mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-06-27 10:31:39 +00:00
efd66ea3f527cb9ec243dcdf72ea3170f94af968
312 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
efd66ea3f5 |
feat(mqtt): per-source status endpoint + Observers panel (#1682)
## Summary Adds MQTT source status visibility per #1043 acceptance criteria: - **Ingestor:** per-source counter registry (`cmd/ingestor/source_status.go`) tracking `connected`, `lastConnectUnix`, `lastDisconnectUnix`, `lastPacketUnix`, `connectCount`, `disconnectCount`, `packetsTotal`, `packetsLast5m` (sliding 5-min window via per-second buckets keyed by unix second — no stale-leak), `lastError`. Wired at the existing OnConnect / ConnectionLost / DefaultPublish callsites alongside the liveness watchdog. Idempotent registration so counters survive reconnects. Snapshot emitted in the existing stats file under `source_statuses` (additive, `omitempty`). - **Backend:** new `GET /api/mqtt/status` handler reads the ingestor stats file and returns the per-source list. **Broker passwords are masked** via a regex over the `scheme://user:pass@host` form (covers mqtt/mqtts/tcp/ssl/ws/wss). Mask is also applied to `lastError` as defense-in-depth (broker libs occasionally quote the failing URL). OpenAPI completeness gate satisfied with a `routeDescriptions` entry. - **Frontend:** small self-contained panel (`public/mqtt-status-panel.js`) mounted above the Observers table. Auto-refreshes every 10s, color-codes each row (green = connected + recent packet, yellow = connected idle, red = disconnected), and tears down its timer on SPA route change. ## TDD - Red commit `f19a93b5` — stub `/api/mqtt/status` handler + assertion test that the broker password is `****`-redacted. Test fails on the assertion (handler passes the URL through verbatim). Compile-clean — assertion-fail, not build-fail. - Green commit `77042e41` — `maskBrokerURL` helper + table-driven unit tests across all schemes + handler rewires to mask both `Broker` and `LastError`. - Subsequent commits land the ingestor wiring and the frontend panel. ## Tests ``` $ cd cmd/server && go test -run 'TestMqttStatus|TestMaskBrokerURL' -v ./... PASS: TestMqttStatus_MasksBrokerPassword PASS: TestMqttStatus_EmptyWhenNoStatsFile PASS: TestMaskBrokerURL_Patterns (10 subtests) $ cd cmd/ingestor && go test -run 'TestSourceStatus|TestSnapshotSourceStatuses' -v ./... PASS: TestSourceStatus_BasicLifecycle PASS: TestSourceStatus_Disconnect PASS: TestSnapshotSourceStatuses_ReturnsAll $ node test-mqtt-status-panel.js 7 passed, 0 failed ``` Full `go test ./...` clean in both `cmd/server` and `cmd/ingestor`. ## Preflight overrides - `cross-stack`: justified — issue #1043 is intrinsically full-stack (ingestor stats → server endpoint → observers panel). Per-stack split would land an unreachable endpoint or a fetch with no backend. - `check-xss-sinks` (public/mqtt-status-panel.js:55): justified — the flagged `innerHTML=` is a fully-static literal (empty-state placeholder, no payload data interpolated). All payload-bearing `innerHTML=` sites in this file run through `escapeHTML` (defined in the same file); the test `renderPanel never echoes a plaintext password (defense-in-depth)` exercises the rendered HTML against payload strings. ## Acceptance criteria - [x] `/api/mqtt/status` returns per-source connection state — `cmd/server/mqtt_status.go` - [x] UI panel shows all configured sources with live status — `public/mqtt-status-panel.js` - [x] Connection state updates on reconnect/disconnect events — `MarkConnect` / `MarkDisconnect` wired in `cmd/ingestor/main.go` - [x] Broker URLs don't expose passwords in the API response — `maskBrokerURL` + 13 test cases - [x] Works with 1-N sources — registry is keyed per-source, snapshot iterates the map **Partial fix for #1043** — per-packet `mqtt_source` attribution (the issue's "Follow-up" section) is **deferred** per the `mc-bot-triaged:v1` triage and the autofix comment ("Per-packet attribution deferred to follow-up issue"). That work requires a new observation-row column and DB schema migration, both explicitly out of scope for this PR. Refs #1043 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
2ef7d2437d |
fix(ci): release fast-path re-tag :edge → :vX.Y.Z when SHA matches (Fixes #1677) (#1680)
## Summary Adds `.github/workflows/release-fast-path.yml`: a metadata-only re-tag workflow that fires on `push.tags: v[0-9]+.[0-9]+.[0-9]+` and, when `:edge`'s `org.opencontainers.image.revision` label matches the tag SHA, applies `:vX.Y.Z`, `:vX.Y`, `:vX`, `:latest` to the existing edge manifest via `crane tag`. No rebuild, no test re-run — ~seconds vs ~30 min today. If the SHA doesn't match (tag points to an older commit, or `:edge` wasn't built yet), it dispatches the existing `deploy.yml` pipeline as a fallback so validated bytes always ship. To prevent double-fire, `deploy.yml`'s top-level `on:` block drops `tags: ['v*']` — `release-fast-path.yml` is now the sole consumer of `push.tags`. Edge publishing on master push is untouched. ## TDD Red commit adds `cmd/server/release_fast_path_workflow_test.go` (two tests: one asserts the new workflow exists with the required trigger/permissions/markers; the other asserts `deploy.yml`'s `on:` block no longer mentions `tags:`). Both fail on assertions in the red commit. Green commit adds the workflow file + edits `deploy.yml`; both pass. ## Acceptance criteria (from #1677) - Tag-CI completes in <2 min when tag SHA == `:edge` revision → fast-path is metadata-only, single short job - Falls back to full pipeline on SHA mismatch → `gh workflow run deploy.yml --ref ${{ github.ref }}` - `:vX.Y.Z` has same digest as `:edge` → `crane tag` copies the manifest, bytes are byte-identical - No regression on older-SHA tags → fallback path runs the unchanged full validation Fixes #1677 --------- Co-authored-by: Kpa-clawbot <bot@corescope.local> |
||
|
|
653d47e03c |
test(openapi): add CI completeness gate for /api routes (Phase 1 of #1670) (#1678)
## Summary Partial fix for #1670 — **Phase 1 only** (CI completeness gate). Phase 2 (backfilling the 18 currently-undocumented routes into `openapi.go`) is deferred to a separate issue per the triage on #1670 and is explicitly out of scope here. ## What this adds - `cmd/server/openapi_completeness_test.go` — AST-walks every non-`_test.go` file in `cmd/server/`, finds string-literal first args to `*.HandleFunc(...)` calls beginning with `/api/`, and diffs against the paths declared in `routeDescriptions()` in `cmd/server/openapi.go`. - `cmd/server/openapi_known_gaps.json` — seeded allowlist of the **18** `/api/` routes currently registered via `HandleFunc` but not yet documented in `openapi.go`. ## Ratchet pattern From this branch forward, `TestOpenAPICompleteness` fails when: 1. A new `HandleFunc("/api/...")` is added without a matching entry in `openapi.go` **or** the allowlist (regression gate — the main goal of Phase 1). 2. A route in the allowlist is *also* documented in `openapi.go` — the allowlist must shrink as Phase 2 backfills land, never go stale. The two-commit history (red → green) demonstrates the gate works: - **Red commit**: adds only the test. Fails on master with the 18 missing routes listed. - **Green commit**: adds the allowlist seeded with that exact 18-route set. Test passes at the current baseline. ## Local verification - `go test ./cmd/server/ -run TestOpenAPICompleteness -v` → PASS at baseline (`44/62 covered; 18 in allowlist; 18 gaps remain`). - Ratchet validation: temporarily inserted `r.HandleFunc("/api/ratchet-test-route", ...)` into `routes.go` → test FAILED with that exact route name; reverted → test PASSES again. ## Files changed - `cmd/server/openapi_completeness_test.go` (+203 / new) - `cmd/server/openapi_known_gaps.json` (+24 / new) ## Preflight `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` → all hard gates pass; no warnings. ## Out of scope - Backfilling the 18 allowlisted routes into `openapi.go` (Phase 2 — tracked separately). - Schema validation of the spec against OpenAPI 3.0 (Phase 3 per the issue). - PR template checkbox update (Phase 2 follow-up). Issue #1670 stays open for Phase 2. --------- Co-authored-by: clawbot <bot@corescope.local> |
||
|
|
938153dd92 |
fix(nodes): rebuild relay-hop history on startup from path_json (#1643)
## Problem A relay node's **activity timeline** — and its per-node `packetsToday` / observer counts — collapses to *"only the hour the server restarted"* after every restart. Before the restart the timeline shows only the node's own adverts (~1–2/hr); all of its relay activity piles into the single post-restart hour. ## Root cause All DB cold-load paths (`Load`, `loadChunk`, `scanAndMergeChunk`) index relay-hop attribution into `byNode` **only** from `observations.resolved_path`. But since #1287 the ingestor persists relay data as aggregate `neighbor_edges` and **never writes `resolved_path`** — it is `NULL` on every deployment (verified on a live DB: 0 of ~440k rows populated). So relay attribution is never reconstructed on startup; it only re-accumulates from live traffic (`IngestNew*`, which re-resolves from `path_json` + the neighbor graph), piling a relay node's whole history into the post-restart window. ## Fix Server read-side only — **no schema / ingestor / migration change**. When `resolved_path` is empty, re-resolve relay hops from the already-persisted `path_json` using the in-memory prefix map + neighbor graph (the same `resolvePathForObs` compute the live ingest path already runs). `main.go` now loads the persisted neighbor graph *before* the packet load so resolution has the graph available. Two correctness details worth a close look: 1. **Fetch the prefix-map/graph snapshot BEFORE opening each load cursor.** `getCachedNodesAndPM` issues its own DB query; doing so while a load cursor is open deadlocks on a single-connection SQLite pool (the test harness uses one). 2. **Index into `byNode` ONLY** — not the `resolved_path` / path-hop indexes. Those are cross-checked by `handleNodePaths` against the persisted `resolved_path` column (NULL here); populating them from an in-memory re-resolution would make that SQL confirmation fail and wrongly drop the tx from paths-through (#1352). ## Tests New coverage asserts a relay pubkey reachable *only* via `path_json` lands in `byNode` after a restart-style load, for both the hot-window (`LoadChunked`) and background-window (`loadChunk`) paths. Existing #1558 (`resolved_path`) and #1352 (paths-through) tests still pass. Full `cd cmd/server && go test ./...` is green under `-race`. ## Perf The fallback runs `resolvePathForObs` per observation with a non-empty `path_json` during cold load — the same per-packet compute the live ingest path already performs, so no new asymptotic cost. The prefix map + graph are snapshotted **once per load** (not per row); `getCachedNodesAndPM` is 30s-cached. In `loadChunk` the resolution runs in the existing lock-free scan and is accumulated locally, matching that function's "build local, merge under lock" design. ## Note on a pre-existing flaky test `TestDistanceConcurrentRequestsDuringBuildReturn202` is timing-fragile (fails ~1/15 on `master` without this change). It relies on the lazy distance build being slow because it's the first caller of `getCachedNodesAndPM` (cold cache). This PR pre-warms that cache during `Load`, narrowing the build window, so the test fails more often in **non-race** local runs. It passes reliably under `-race` (CI mode), where the build stays slow. Flagging in case you want to harden the test separately. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com> Co-authored-by: openclaw-bot <bot@openclaw> |
||
|
|
825b26485c |
fix(#1181): hide nodes whose name starts with a configured prefix (#1655)
Fixes #1181. ## Summary Adds operator-configurable name-prefix hiding for nodes. When a node's name starts with any prefix listed in the new `hiddenNamePrefixes` config field (default `["🚫"]`), it is omitted from `/api/nodes`, `/api/nodes/search`, and `/api/nodes/{pubkey}`. DB rows are preserved — the filter runs at the API layer only, so observation history (paths, hops, distances) stays intact and the node simply re-appears if the operator clears the prefix list. This mirrors the convention already in use on other MeshCore map dashboards: an operator who wants their node hidden renames it with the 🚫 prefix and sends an advert; the next advert is then dropped from the dashboard. The node is **not** hidden from the mesh itself — only from this dashboard. This is documented inline in `config.example.json`. Implementation follows the existing `IsBlacklisted` pattern exactly: a new `Config.IsNameHidden(name)` method, and three filters in `routes.go` placed alongside the corresponding blacklist filters. No DB schema, public API, or websocket changes. ## Files changed - `cmd/server/config.go` — new `HiddenNamePrefixes []string` field + `IsNameHidden` method - `cmd/server/routes.go` — filters in `handleNodes`, `handleNodeSearch`, `handleNodeDetail` - `config.example.json` — new field + `_comment_hiddenNamePrefixes` operator doc - `cmd/server/hidden_name_prefix_1181_test.go` — new test file (red → green) ## Test plan Two new subtests in `TestHiddenNamePrefix_1181_*`: 1. `_NodesList` — inserts a node named `🚫 ban me`, asserts it is present when `HiddenNamePrefixes` is empty and absent when set to `["🚫"]`. 2. `_Search` — inserts `🚫 search me`, asserts `/api/nodes/search?q=search` does not surface it when the prefix is configured. Verified red→green: - Red commit `d0903852`: `go test -run TestHiddenNamePrefix_1181` fails on the leak assertion (`hidden_name_prefix_1181_test.go:94`). - Green commit `e79a0d8d`: same command passes. ``` $ cd cmd/server && go test -run TestHiddenNamePrefix_1181 -count=1 . ok github.com/corescope/server 0.060s ``` ## Out of scope - Auto-purging DB rows for hidden nodes — left to existing retention. The triage was explicit: hide, do not delete. - Live websocket broadcast: nodes are not broadcast via websocket (only packets), so no separate emit path needs filtering. Frontend reads nodes via `/api/nodes`, which is filtered. - Frontend customizer for the prefix list — operators configure via `config.json` like every other knob. |
||
|
|
e04c7113cb |
feat: integrate hashtag channels from meshcore-channels catalogue (#1323) (#1656)
Fixes #1323 ## Summary Adds a small in-memory cache of the community-maintained hashtag-channels catalogue (`marcelverdult/meshcore-channels`) and exposes it as `GET /api/known-channels?region=XX` plus a collapsed sidebar section on the Channels view ("Known channels (catalogue)") with a one-click "+ Add" button per row. Per triage (#1323): new `cmd/server/known_channels_cache.go`, new `GET /api/known-channels?region=…`, frontend section in `public/channels.js`. No new DB tables — cache is in-memory only. ## What changed - `cmd/server/known_channels_cache.go` — `knownChannelsCache` with an atomic snapshot pointer, 24h default refresh, 30s HTTP timeout, 4 MB body cap, custom `User-Agent`. Fail-soft: a failed refresh leaves the last-known snapshot in place. Background goroutine started from `main.go` after the neighbor-graph recomputer; never blocks startup. - `cmd/server/known_channels_route.go` — `GET /api/known-channels?region=` serves the cached snapshot off the atomic pointer (never blocks on upstream). Region filter is case-insensitive ISO 3166-1 alpha-2. Empty/missing cache returns 200 with an empty entries list (fail-soft for the UI). - `cmd/server/config.go` — `KnownChannelsURL` + `KnownChannelsRefreshMs`. - `config.example.json` — example values + `_comment_knownChannels`. - `public/channels.js` — new collapsed sidebar section "Known channels (catalogue)" that lazy-fetches `/api/known-channels` on first render and renders rows with a "+ Add" button. The button calls the existing `addUserChannel(name)` path, so adding catalogue channels reuses the full save-key + decrypt flow that user-typed hashtags already use. - `cmd/server/known_channels_cache_test.go` — failing-first tests: - `TestKnownChannelsParseFixture` asserts the parser populates `GeneratedAt`/`License` and region-stamps every entry while skipping empty countries. - `TestKnownChannelsRouteRegionFilter` asserts the route returns 200 with exactly the filtered subset for `?region=be`. - `TestKnownChannelsFailSoftOn500` asserts a failed upstream fetch leaves the prior snapshot in place and bumps `failCount`. ## Upstream pinning The default URL is pinned to the specific file `channels-by-country.json` on `main`: > https://raw.githubusercontent.com/marcelverdult/meshcore-channels/main/channels-by-country.json Shape (verified 2026-05-24): ```json { "generated_at": "...", "license": "CC0-1.0", "countries": { "be": [{"channel": "#antwerpen", "description": "..."}], ... } } ``` ## Test plan ``` cd cmd/server && go test -run 'TestKnownChannels' -count=1 . ok github.com/corescope/server 0.008s ``` Red commit: |
||
|
|
1116801b2f |
M5: emoji → Phosphor Icons — settings & customize (#1648) (#1653)
**Red commit:** `851cc8c3a024b1675558092d772444bf4f1ec625` — failing test on a stub branch (will link CI run after PR opens). Partial fix for #1648 (M5 of 6). **Do NOT close the tracking issue** — M6 (server-side residual emoji sweep + lint gate) still pending. ## Per-file swap counts | File | Phosphor `<use>` refs | Notes | |---|---|---| | `public/customize.js` | 20 | DEFAULTS → `ph:<name>` tokens; render path keeps legacy emoji branch (back-compat) | | `public/customize-v2.js` | 26 | same as v1; cv2 overrides path unchanged | | `public/home.js` | (helpers added) | `_renderHomeGlyph` / `_renderHomeLabel` accept both `ph:<name>` and legacy emoji | | `public/geofilter-builder.html` | 5 | clear / undo / save / load buttons (+inline `.ph-icon` CSS) | | `public/audio.js` | 1 | audio unlock prompt | | `public/filter-ux.js` | 5 (3 new) | help popover star + close, saved-filter delete | | `public/style.css` | 0 | `#chList .ch-share-btn::before { content: '📤' }` removed; JS now renders an inline sprite | | `cmd/server/routes.go` | (6 `ph:` tokens) | onboarding home defaults updated in lockstep with customize-v2.js | ## Operator config back-compat — PROMINENT Per design call #1 (user-locked): existing operator-stored emoji values in `config.json` / `localStorage` are **NOT** touched. The render path supports both: ```js function renderConfigGlyph(value) { var m = String(value || '').match(/^ph:([a-z][a-z0-9-]+)$/); if (m) return '<svg class="ph-icon"><use href="/icons/phosphor-sprite.svg#ph-' + m[1] + '"/></svg>'; return esc(value); // EMOJI-OK-LEGACY-RENDER — operator-stored emoji/text path } ``` Defaults flipped to `ph:<name>` tokens, so new operators (and operators who hit "Reset to Defaults") see Phosphor sprites. Operators with stored emoji values continue to see their emoji exactly as before. Verified end-to-end (see E2E (b) below). ## cmd/server/routes.go — changed in lockstep Per design call #2: the home-defaults `steps` / `footerLinks` mirror the JS DEFAULTS, so they MUST update together. routes.go now emits `ph:<name>` tokens; the frontend home-render path resolves them. Existing tests (`TestConfigThemeHomeDefaults`) still pass — they assert structure, not glyph values. ## E2E assertions added - `test-issue-1648-m5-emoji-scan.js` — per-file zero-emoji + ph-token DEFAULTS + sprite presence - `test-issue-1648-m5-icons-e2e.js`: - (a) customize chrome — tabs/header rendered as sprites; chrome text icon-free - **(b) back-compat — injects fake `🐙` operator step into localStorage, reloads, opens customize, asserts the emoji renders verbatim in both the input value AND the live preview span; asserts the ph-token step renders as a sprite** (design call #1 in action) - (c) `/channels` modal sprite count - (d) `/audio-lab` sprite presence - (e) `geofilter-builder.html` control buttons sprite-driven - (f) every `<use>` resolves to a defined symbol id ## Out of scope (M6 cleanup) - cmd/server/routes.go residual server-rendered emoji **not** tied to customize defaults (none found by my grep — file already audited) - `make lint-no-emoji` CI grep gate (M6 owns it) - `public/icons/README.md` workflow doc cross-stack: justified — design call #2 requires Go + JS update together. --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
8295c2115c |
fix(reach): bust response cache on blacklist change (#1629) (#1636)
Red commit:
|
||
|
|
078225a54e |
perf(neighbor_api): fold first_seen into cached map — fix #1627 r3 regression (#1632)
## TL;DR Post-merge regression introduced by #1627 r3 (commit `e2212f50`): `buildNodeInfoMap` in `cmd/server/neighbor_api.go` ran an uncached `SELECT … FROM nodes` scan on every call. Folded `first_seen` into the already-cached `getCachedNodesAndPM` (30s TTL) so the 4 hot handlers that call `buildNodeInfoMap` no longer pay for a full table scan per request. ## Before / After `buildNodeInfoMap` is called by **4 hot handlers**: - `cmd/server/neighbor_api.go:130` - `cmd/server/neighbor_api.go:297` - `cmd/server/neighbor_debug.go:83` - `cmd/server/node_reach.go:421` | | Before | After | |---|---|---| | `SELECT … FROM nodes` per call | 1 (uncached) | 0 (cache hit) | | `SELECT … FROM observers` per call | 1 (uncached) | 1 (unchanged) | | At Cascadia scale (~2600 nodes) | full scan × 4 handlers × N req/s | one scan / 30s | ## How - Extended the `getAllNodes` schema probe to also `COALESCE(first_seen, '')`. Falls back through the existing richest → leanest ladder if the column is missing. - `nodeInfo.FirstSeen` is therefore populated for every cached entry in `getCachedNodesAndPM`. - `buildNodeInfoMap` drops its second `SELECT` entirely and just copies `nodeInfo` values out of the cached map. - Public signature of `buildNodeInfoMap` is unchanged. `node_reach.go:421` still sees `nodeInfo.FirstSeen` populated, served from cache. `cmd/server/store.go` is touched because `getAllNodes` is the only sensible owner of the `first_seen` SELECT — adding a parallel cache would duplicate the 30s TTL machinery this fix is designed to leverage. ## Test (red → green) - Commit 1 (`test:`): `TestBuildNodeInfoMap_FirstSeenIsCached` — calls `buildNodeInfoMap`, mutates `first_seen` out-of-band via a separate rw connection, calls it again, and asserts both calls return the same (cached) value. Fails on `origin/master` (call 2 sees the mutated value, proving the uncached scan). - Commit 2 (`perf:`): the fold. Test now passes. ## Refs Post-merge audit identified this as the only MAJOR finding from #1627; recommendation was a follow-up hot-fix PR. This is that PR. --------- Co-authored-by: openclaw-bot <bot@openclaw> Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
43be1bb76a |
fix(reach): scanReachRows DB errors must surface as 500 not 404 (#1631) (#1635)
Red commit:
|
||
|
|
e2212f5015 |
feat(nodes): per-node Reach page + GET /api/nodes/{pubkey}/reach (v2, review-complete) (#1627)
Re-submission of #1625 (which was merged early, then reverted in #1626) — now with **all three round-1 reviews addressed** so it lands in one hardened state instead of as post-merge follow-ups. ## What Per-node **Reach** view: a standalone page (`#/nodes/{pubkey}/reach`) + a node-detail section + `GET /api/nodes/{pubkey}/reach`. It shows which nodes a node has a **stable two-way RF link** with, derived from raw `path_json` adjacency (a path travels origin→observer, so `[A,B]` ⇒ B heard A). A link is bidirectional when both directions have observations; the **bottleneck** (weaker direction) rates two-way reliability. Nodes are identified only by **unique 2–3 byte** path prefixes (1-byte collides → excluded). ## Review fixes folded in vs #1625 **Performance (Carmack):** hard scan LIMIT (200k) + modest prealloc; `json.Unmarshal` replaced by a single-pass `parsePathTokens` (100k-row scan 2.2M→1.3M allocs, 344→203ms); memoized resolver; size-hinted maps (attribution over 100k rows: 102 allocs); `context.Context` plumbed; cache `RWMutex` + evict-oldest (no full wipe); singleflight dedup; degree/rank from a 60s shared snapshot; bench rewritten (ReportAllocs, 1k/10k/100k, mixed-payload, isolated attribution). **Correctness/safety + tests (Independent + Kent Beck):** pubkey validation → 400; error logging instead of silent swallow (first_seen / degree / marshal→500 / discarded rows); `public_key=?` index use; canonical `PayloadADVERT`; `min()` builtin; documented cache-slice immutability; mux ordering comment. New tests: scanReachRows decode, 3-byte token branch, non-advert first-hop guard, observer SNR aggregation across rows, HTTP-level attribution (asserts non-zero we_hear/they_hear), 400/404/blacklist/cache-hit. **UI / a11y / Tufte:** in-map legend (tiers + thresholds); dropped the colour+width double-encoding (constant width, colour-only); colour-blind glyphs (●●●/●●/●) + tier title beside the bottleneck number; dark-theme `--link-*`; lighter table (horizontal rules, sentence-case headers); map built once + link layer updated in place on toggle (no flicker); time-range no longer flashes a loader; `destroy()` generation guard; statCard escaping; scoped `@media print` to `#nq-report`; `fieldset/legend` + `for/id` toggles; `aria-pressed` / `aria-live` / back-link `aria-label`; "distance (km)" + bottleneck tooltip + no-GPS note; inline styles → CSS; decorative emoji removed. **Docs:** api-spec documents the 5-min cache, 200k scan cap, and 400. ## Testing - `cmd/server` full suite green; reach unit + endpoint + bench all pass. - `eslint public/*.js` (no-undef) and the XSS-sink gate clean. - E2E updated: request status checks + exact (non-tautological) toggle assertions + hard map-render assert. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- ## TDD-history note (Kent Beck gate) This branch carries production + tests together, not a fabricated red→green sequence. That's deliberate: the branch was rebased onto upstream and the intermediate SHAs were squashed, so reconstructing a "failing-test-first" commit after the fact would be theatre, not evidence — and rewriting history to stage it would be dishonest. The behaviour is instead covered by a comprehensive, anti-tautological suite (directional attribution edges, 3-byte token branch, non-advert first-hop guard, observer SNR aggregation, HTTP-level attribution asserting non-zero counts, scan-cap truncation, zero-reach 200-not-404, companion mis-attribution, cache eviction). Requesting maintainer acceptance of the work on test *substance* rather than commit *choreography*; the net-new-UI exemption is not claimed for the server endpoint. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: meshcore-bot <bot@meshcore> |
||
|
|
9c5faab1e4 |
Revert "feat(nodes): per-node Reach page (#1625)" (#1626)
Reverts #1625. #1625 was merged before the round-1 reviews (Independent / Kent Beck / Tufte) were addressed. Reverting to land it cleanly: a fresh PR will re-add the feature with the perf pass, the backend correctness/safety + test-coverage fixes, and the UI/a11y (Tufte) batch folded in, so it goes through review in a single hardened state rather than as a string of post-merge follow-ups. No functional loss — the feature returns in the replacement PR. |
||
|
|
47f85f6c4c |
feat(nodes): per-node Reach page + GET /api/nodes/{pubkey}/reach (directional link quality) (#1625)
## What
Adds a per-node **Reach** view that answers "how well does this specific
node hear, and get heard by, its neighbours?" — both as a standalone
page (`#/nodes/{pubkey}/reach`) and as a section on the node detail
page.
New endpoint: **`GET /api/nodes/{pubkey}/reach`**.
## What it measures
For the target node it derives, from raw `path_json` adjacency (a path
travels origin→observer, so in `[A,B]` B received A directly):
- **Directional link counts** per neighbour: `we_hear` (how often we
received them) vs `they_hear` (how often they received us).
- **Bidirectional / bottleneck**: a link is two-way stable when both
directions > 0; the weaker direction is the bottleneck and rates real
two-way reliability.
- **Importance**: neighbour degree + rank, relay-observation volume,
bidirectional-link count, direct-observer count.
- **Direct observers**: who received the node at 0 hops, with SNR.
Reliability rule: a neighbour is only attributed when its pubkey
**prefix is unique** at the path's byte length (collisions are skipped,
never misattributed).
## UI
- Standalone Reach page + node-detail section.
- Reusable bidirectional link map (OSM) with links coloured by
bottleneck.
- Incoming/outgoing toggles to isolate each direction.
## Naming note (deliberate, no collision)
This is distinct from the existing **per-observer reachability** in
topology analytics (`ReachNode` / `ObserverReach` / `perObserverReach`).
This PR adds its own `NodeReach*` response structs in a new
`node_reach.go` and a new `/api/nodes/{pubkey}/reach` route — there are
no symbol or route collisions (verified: `go build ./...` clean). Happy
to rename to disambiguate further (e.g. "Link Quality") if you'd prefer
to reserve "Reach" for the per-observer feature.
## Testing
- `cmd/server`: endpoint shape/404/limit-clamp + unit tests for token
derivation and directional attribution, plus a scan benchmark — all
pass.
- Frontend: helper tests + Reach-page E2E (`test-node-reach-e2e.js`),
standalone route + incoming/outgoing toggles.
- `go build ./...` and `eslint public/*.js` (no-undef) clean.
## Docs
Design spec, implementation plan, and the `GET
/api/nodes/{pubkey}/reach` API contract are included under `docs/`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|
|
a4776557ae |
feat(#1290): use firmware repeat:on|off hint to exclude listener-only observers from disambiguator (#1624)
Closes #1290. cross-stack: justified — backend persists firmware-side `repeat` hint to a new observers column, frontend surfaces the listener/repeater status as a badge on the observers list and node-detail Heard By table per the issue's UI acceptance criterion. ## What Firmware 1.16 publishes a `repeat: on|off` flag in the MQTT `/status` JSON (confirmed by @cwichura on the issue thread — see [`MQTTMessageBuilder.cpp:58`](https://github.com/agessaman/MeshCore/blob/b45373a31f111fb0de98bb3b168226d09ceadc47/src/helpers/MQTTMessageBuilder.cpp#L58) in `agessaman/MeshCore mqtt-bridge-implementation-flex`). Listener-only observers (`repeat:off`) by firmware contract never relay packets, so they cannot legitimately be a hop in someone else's resolved path. This PR plumbs the hint end-to-end so the disambiguator stops considering them. ## How * **`internal/dbschema`**: idempotent `can_relay INTEGER DEFAULT 1` migration on `observers`, plus `AssertReady` probe (server fatal-logs if absent). Mirrored in `cmd/ingestor/db.go` `CREATE TABLE` for fresh DBs. Annotated `PREFLIGHT: async=true` — `DEFAULT 1` is constant so SQLite does this as a metadata-only schema rewrite. * **`cmd/ingestor`**: `extractObserverMeta` accepts `repeat` as bool, case-insensitive string (`on|off|true|false|yes|no`), or numeric `0|1`. Missing field → `nil` → `COALESCE` preserves the existing column value (back-compat with legacy observers). Plumbed through `UpsertObserverAt` and the prepared upsert statement. * **`cmd/server`**: `GetNonRelayObserverPubkeys` + new `prefixMap.markNonRelay` drop matching candidates inside `pm.resolveWithContext` at the top of the resolver, so all 4 tiers see the pruned candidate set. `ObserverResp.CanRelay` is surfaced on `/api/observers` and `/api/observers/{id}`. `GetNodeHealth` enriches per-observer rows with `can_relay` so the node-detail badge renders. Probe-and-fall-back when the `can_relay` column is absent (legacy test fixtures). * **`public/`**: listener vs repeater pill on observers list, observer detail `Relay` stat card, and node-detail `Heard By` table. CSS uses existing theme vars. ## Test Added `TestResolveWithContext_ExcludesNonRelayObservers_Issue1290` in `cmd/server/resolve_non_relay_1290_test.go` covering all three required cases: * `repeat:off` pubkey → not a candidate (assertion failed in red commit `5f7fdb96`, passes after green `f12911dc`) * `repeat:on` pubkey → still a candidate (regression guard) * legacy obs (no field) → still a candidate (back-compat) Red→green proof: ``` $ git log --oneline origin/master..HEAD |
||
|
|
3d12266595 |
fix(#1608): address PR #1609 follow-up findings — config doc, receipt-time liveness, buffer stop/clamp warn (#1623)
Follow-up to #1609 / #1608. Addresses the 5 unresolved findings from the PR #1609 round-1 polish review. ## Findings addressed | Tag | Severity | Fix | Commits | |-----|----------|-----|---------| | **B1** | BLOCKER | Document `ingestBufferSize` in `config.example.json` near other ingestor knobs. Default `50000`, comment text from review. | `f0b4e411` | | **M1** | MAJOR (option 1 from review) | Split receipt-time vs post-write liveness: add `SourceLivenessState.LastReceiptUnix` + `MarkReceipt`, stamp at the MQTT receipt callback, leave `LastMessageUnix` post-write only. Drop the double-stamp at receipt that masked write-path stalls. Surface both clocks via the ingestor stats file (`source_liveness`) and the server's `/api/healthz` (`ingest_liveness`, additive — older builds unaffected). | RED `fa78233d` / GREEN `bc81b544` | | **M1 (drop-log)** | MAJOR | Log every drop when buffer is at capacity. Removes the `n==1 \|\| n%1000` throttle that hid the first stall behind 1000 lost packets. The Submit drop branch only fires when the channel is at cap so volume is naturally bounded by the stall, not by an arbitrary modulo. | RED `a468763e` / GREEN `7b24fce5` | | **m1** | MINOR | Add `IngestBuffer.Stop()` and `Done()` so tests stop leaking the consumer goroutine that `Start()` spawns. Existing tests gain `t.Cleanup(b.Stop)`. Drain semantics: stop-before-Ready exits immediately; stop-after-Ready best-effort drains queued jobs. | RED `8430c822` / GREEN `78c9b223` | | **m2** | MINOR | `NewIngestBuffer(<1)` now logs a `[ingest-buffer] WARN` line on clamp so misconfigured `ingestBufferSize` values are visible instead of silently running a 1-slot queue. Test captures log output. | RED `62119ab4` / GREEN `815bfd02` | | **m3** | MINOR | Add godoc to `Submit` and `Ready` documenting the Start-before-Submit / Start-before-Ready ordering invariant. | `564a813b` | ## TDD discipline Each behavioral fix (M1, M1-drop-log, m1, m2) lands as a red-then-green pair. Red commits compile + run + fail on assertion, verified locally before the green commit. Per-finding red→green pairs are visible in the commit graph above. B1 and m3 are docs-only and ship as single commits (preflight script accepts them under the docs/comments exemption). ## Schema compatibility `/api/healthz` change is purely additive: `ingest_liveness` is only included when the ingestor publishes the new `source_liveness` field, so older ingestor + newer server combos are unaffected. Field order in the response stays stable for prior consumers. ## Test output - `go test -count=1 -timeout 180s ./cmd/ingestor/...` → green (160s) - `go test -count=1 -timeout 300s ./cmd/server/...` → green (48s) - Race-mode runs of the touched packages (`IngestBuffer|Liveness|Watchdog|Receipt|Healthz`) → green - Full-package race runs locally exceed the brief's 120s timeout on pre-existing slow integration tests (TestObsTimestampIndexMigration, TestNeighborEdgesBuilderDeltaScan); CI has the headroom. ## Preflight `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` → all hard gates pass, no warnings. ## Files changed - `config.example.json` — B1 - `cmd/ingestor/ingest_buffer.go` — m1, m2, M1-drop-log, m3 - `cmd/ingestor/ingest_buffer_test.go` — m1, m2, M1-drop-log - `cmd/ingestor/mqtt_watchdog.go` — M1 - `cmd/ingestor/mqtt_watchdog_m1_test.go` — M1 (new) - `cmd/ingestor/main.go` — M1 (receipt callsite) - `cmd/ingestor/stats_file.go` — M1 (publish `source_liveness`) - `cmd/server/perf_io.go` — M1 (type + reader) - `cmd/server/healthz.go` — M1 (surface `ingest_liveness`) Original review reference: PR #1609 polish review by the M-axis bot. --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
bc1822e46c |
perf(load): chunked Load with early HTTP readiness (#1009) (#1596)
## What Switches the server's startup from a synchronous full-scan `PacketStore.Load()` to a chunked `LoadChunked(chunkSize)` that: 1. Streams transmissions+observations from SQLite in id-ordered chunks (default `chunkSize=10000`, configurable via `db.load.chunkSize`). 2. Closes `FirstChunkReady()` after the first chunk is merged — `main.go` binds the HTTP listener on that signal instead of blocking on the full multi-minute load. 3. Stamps `X-CoreScope-Load-Status: loading; progress=<rows>` on every response while LoadChunked is in flight, flipping to `ready` once it completes (via `loadStatusMiddleware`). 4. Preserves the existing retention/`hotStartupHours`/`maxMemoryMB` clamps and the post-load index rebuild (`pickBestObservation` / `buildSubpathIndex` / `buildPathHopIndex` / `buildDistanceIndex`). ## Why Per #1009: at 5M+ observations (Cascadia scale) the synchronous Load blocked HTTP for ~80s with a 2–3× steady-state RAM peak. With chunked load the listener binds within seconds; dashboards and probes can read partial data and see the `loading` status header until the background load finishes. ## Notes - `/api/healthz` readiness gate (`readiness` atomic, init `WaitGroup`) is unchanged — it still waits for neighbor-graph build + initial `pickBestObservation` before reporting `ready:true`. `LoadChunked` only changes when the listener BINDS, not when it advertises ready. - `cmd/server/main.go` waits for `FirstChunkReady` (or the full load on a tiny DB) before proceeding, and drains the load goroutine in the background with a logged error path. - Config Documentation Rule: `config.example.json` now documents `db.load.chunkSize` with a nested `_comment` describing the trade-off. ## Tests - `cmd/server/chunked_load_test.go` asserts: - (a) `FirstChunkReady` fires before `LoadChunked` returns - (b) `X-CoreScope-Load-Status` transitions `loading; progress=...` → `ready` - (c) `chunkSize` honored (2500 rows @ 1000 → 3 chunks via `OnChunkLoaded`) - (d) `Config.DBLoadChunkSize()` default 10000 + override - Red commit (`102a4c84`) lands the tests with stubs that fail on assertion — verified locally before the green commit. - Green commit (`35cecf16`) makes all four pass; full `cmd/server` suite green (47s locally). Closes #1009 ## TDD red-commit exemption The original red commit `f878e15e` ("test(load): failing tests for chunked Load + early HTTP readiness") fails to **compile** rather than failing on an assertion, because it references symbols (`store.LoadChunked`, `store.FirstChunkReady`, `store.OnChunkLoaded`, `Config.DBLoadChunkSize`, `loadStatusMiddleware`) that do not exist on master. Per `AGENTS.md` the bar is "MUST fail on an assertion ... A compile error is NOT a valid red commit." This is claimed under the **net-new surface** exemption with the following justification: - LoadChunked / FirstChunkReady / loadStatusMiddleware / DBLoadChunkSize are all introduced by this PR — no prior implementation existed to refactor. There is no behaviour on master that the red commit could meaningfully assert against without first declaring the new symbols. - The cheapest "proper" alternative (split the red into two commits: stub-first + assertion-fail) was deferred because the test file unambiguously fails on missing-symbol — there is no risk of the test becoming a tautology against a pre-existing stub. - **Behaviour gating IS proven elsewhere on this branch.** Commit `799bde49` ("test(load): red — LoadChunked must mark indexes ready + not flip Complete on error") is a proper assertion-fail red against the same package, and commit `92cadd1d` is the matching green. Reviewers can verify the red→green pattern there. If a future reviewer wants the strict pattern, the follow-up is mechanical: split `f878e15e` into a stub-only commit followed by the assertion commit. Not done here to keep the rework cost proportional to the risk (zero, in this case). ## Preflight overrides - check-async-migrations: justified — the flagged `CREATE TABLE`/`CREATE INDEX` statements live in `cmd/server/chunked_load_id_zero_test.go` and `cmd/server/chunked_load_oldest_test.go` only. They run against per-test `t.TempDir()` SQLite files (in-process, ~10 rows, lifetime = single test) — they are NOT production schema migrations. No prod table is touched. PREFLIGHT-MIGRATION-SCALE: <30s N=10 (per-test tempdir fixture). --------- Co-authored-by: CoreScope Bot <bot@corescope.local> Co-authored-by: clawbot <bot@noreply.example.com> Co-authored-by: Kpa-clawbot <bot@example.com> Co-authored-by: Kpa-clawbot <bot@kpa-clawbot> |
||
|
|
7421ead9b0 |
fix: bypass API limit clamps for internal UI requests. Revisit of issue #1540 (#1589)
This PR replaces the strict, hardcoded limits on API list endpoints (introduced in the recent security patch) with a new operator-configurable `listLimits` block. This change is needed as issue 1540's implementation introduced a 500max node limit on the live map or any other function that leverages the api/nodes backend. Previously, we attempted to bypass public caps for internal UI requests using a heuristic based on browser headers (`Sec-Fetch-Site`). Following review, we decided to drop that heuristic entirely to eliminate any security-by-browser-convention surface area. Instead, `queryLimit()` returns to its original, mathematically simple bounds-checking shape, and the absolute maximums are now drawn from `config.json`. This provides equal DoS protection against all callers while allowing server operators to tune the ceilings based on the size of their mesh (e.g. embedded devices can tighten the knobs, regional hubs can raise them). ### Changes Made: - **`config.go`**: Introduced a `ListLimits` config struct containing `PacketsMax`, `NodesMax`, `AnalyticsMax`, and `ChannelMessagesMax`. Added safe initialization to ensure default caps (10000, 2000, 200, 500 respectively) apply even if the block is omitted from the config. - **`clamp_limit.go`**: Deleted `isInternalUIRequest` entirely and restored `queryLimit` to its original signature (`r, def, max`). - **`routes.go`**: Replaced all hardcoded integer ceilings on list endpoints (`/api/packets`, `/api/nodes`, etc.) with `s.cfg.ListLimits.*`. - **`config.example.json`**: Added the `listLimits` block with documentation to guide new operators. - **`clamp_limit_test.go`**: Purged all header-heuristic testing. ### Verification: - All 611 backend unit tests pass (`npm run test:unit`). - Bounds-checking math continues to enforce hard DoS clipping exactly at the operator's specified configuration limit. --------- Co-authored-by: mc-bot <bot@openclaw.local> Co-authored-by: openclaw-bot <bot@openclaw> |
||
|
|
1bdb92de88 |
feat(#1574): operator-configurable liveMap.maxNodes (default 2000) (#1577)
Red commit:
|
||
|
|
ad41b9bb7b |
fix(tests): subpaths_window tests wait for index readiness after #1595 chunked load (#1621)
## Why master is red After PRs #1592 (route-window subpath regression test) and #1595 (background/chunked index build with 503 readiness gate) were merged together, two tests in `cmd/server/subpaths_window_test.go` started failing on master: ``` --- FAIL: TestSubpathsHonorsTimeWindow_StoreLevel subpaths_window_test.go:70: unbounded: expected totalPaths=2, got 0 (subpaths=[]) --- FAIL: TestSubpathsHandlerHonorsTimeWindow subpaths_window_test.go:116: GET /api/analytics/subpaths?...: status=503 body={"error":"index loading","retryAfter":5} ``` Both branches passed in isolation; the conflict only manifested post-merge. Reason: - **#1592** added tests that call `store.Load()` then immediately query `GetAnalyticsSubpathsWithWindow` / hit `/api/analytics/subpaths`. - **#1595** moved the subpath + path-hop index builds off the critical path of `Load()` into background goroutines, and hard-gated the analytics handlers behind `SubpathIndexReady()` (returning 503 + `Retry-After: 5` until the build completes). So after `Load()` returns, `s.spIndex` is still empty for a short window and the handler returns 503. The store-level test sees `totalPaths=0`; the handler test sees the 503. ## Fix (test-only) Add `store.WaitIndexesReady(5 * time.Second)` between `Load()` and the assertions in both tests. This matches the established pattern already used by `routes_test.go` and `repeater_enrich_recomputer_1008_test.go`. The 503 readiness gate from #1595 is intentional production behavior and is **not** touched. No production code is modified. ## Repro Before: ``` $ go test ./cmd/server/ -run TestSubpaths.*Window -v -count=1 --- FAIL: TestSubpathsHonorsTimeWindow_StoreLevel (0.01s) subpaths_window_test.go:70: unbounded: expected totalPaths=2, got 0 (subpaths=[]) --- FAIL: TestSubpathsHandlerHonorsTimeWindow (0.02s) subpaths_window_test.go:116: GET /api/analytics/subpaths?minLen=2&maxLen=8: status=503 body={"error":"index loading","retryAfter":5} FAIL ``` After: ``` $ go test ./cmd/server/ -run TestSubpaths.*Window -v -count=3 --- PASS: TestSubpathsHonorsTimeWindow_StoreLevel (0.01s) --- PASS: TestSubpathsHandlerHonorsTimeWindow (0.02s) ... (x3) ... PASS ok github.com/corescope/server 0.097s $ go test ./cmd/server/ -count=1 -timeout 300s ok github.com/corescope/server 46.292s ``` ## Files changed - `cmd/server/subpaths_window_test.go` (+11 lines, test-only) ## Notes - TDD exemption: this is a test-fix PR for a merge-conflict-induced failure. The "failing test" already exists on master; this PR makes it pass correctly by waiting on the readiness gate the test was previously unaware of. - Unblocks staging deploys. Co-authored-by: openclaw-bot <bot@openclaw> |
||
|
|
222bfdf6cf |
feat(perf): SQLite writer-lock wait/hold instrumentation per component (#1340) (#1594)
## What Per-component SQLite writer-lock instrumentation so the next neighbor-builder-style write-lock starvation (root cause of #1339, invisible to operators for ~3 days) is detectable from `/api/perf`. Adds `Store.WriterExec` / `Store.WriterTx` wrappers that gate every wrapped call on a package-level `writerMu` so the wait the SQLite driver hides becomes Go-visible, and record `wait_ms` + `hold_ms` + `contention_total` (wait_ms > 100ms) under a component tag. Per-component p50/p95/p99 + max are published to `/api/perf/write-sources` under `.writer_perf` via the existing ingestor stats-file path. Slow-writer log line (`[db-slow-writer] component=X duration=Yms query=<200ch>`) fires on `hold_ms > 500ms` (threshold overridable via `CORESCOPE_DB_SLOW_WRITER_MS` env var). ## Tagged call sites | Component | Location | |-----------|----------| | `mqtt_handler` | `InsertTransmission` (db.go) | | `neighbor_builder` | `buildAndPersistNeighborEdges` (neighbor_builder.go) | | `prune_packets` | `PruneOldPackets` (maintenance.go) | | `prune_observers` | `RemoveStaleObservers` + orphan-metrics cleanup (db.go) | | `prune_metrics` | `PruneOldMetrics` (db.go) | | `vacuum` | `RunIncrementalVacuum` + `CheckAutoVacuum`'s full VACUUM (db.go) | ## TDD red→green - **Red commit** `68de585b` — `cmd/ingestor/db_writer_perf_test.go` + `Store.Writer*` stubs at end of `db.go`. Test synthetically blocks the writer for 60s tagged `neighbor_builder`, then asserts `mqtt_handler.wait_ms.p99 > 50000ms` on concurrent inserts. Fails on the assertion (p99 = 0.0ms) with the stub — not a build error. - **Green commit** `6a9be174` — replaces stubs with real wait/hold/contention aggregator + wires every writer call site. Same test passes: ``` 2026/06/05 04:36:47 [db-slow-writer] component=neighbor_builder duration=60059.0ms query=COMMIT --- PASS: TestWriterStarvationVisibleInPerf (60.40s) PASS ok github.com/corescope/ingestor 60.408s ``` ## Scope discipline - **API**: no public `Store`/`DB` signature change. Only additive exports. - **Server**: extends existing `/api/perf/write-sources` JSON with `.writer_perf` — does **not** add a new route, does **not** replace `handlePerf`. Empty `.writer_perf` map when paired with an older ingestor. - **Read/write invariant** (#1283) preserved: all instrumentation lives on the ingestor's writer connection. - **Files touched** (6 total): `cmd/ingestor/db.go`, `cmd/ingestor/db_writer_perf_test.go`, `cmd/ingestor/maintenance.go`, `cmd/ingestor/neighbor_builder.go`, `cmd/ingestor/stats_file.go`, `cmd/server/perf_io.go`, `config.example.json`. ## Deferred (acceptance items NOT in this PR) - **`mbcap_persist` component tag** — `RunMultibyteCapPersist`'s tx is intentionally NOT wrapped in this PR to stay within the implementation brief's 3-files-outside-whitelist budget. One-file follow-up to instrument. - **CI smoke test** asserting "neighbor-builder hold_ms < 1000ms on 100k-obs fixture" — deferred to a separate PR per the brief; this PR is scoped to instrumentation only. ## Preflight overrides PREFLIGHT-MIGRATION-SCALE: <30s N=runtime — the async-migration gate flagged five `instrumentedExec` / wrapped-`tx.Exec` lines on `DELETE FROM observer_metrics`, `UPDATE observers`, `DELETE FROM observer_metrics`, `DELETE FROM observations`, `DELETE FROM transmissions`. These are **not** schema migrations — they are the existing runtime prune / retention queries that already ran sync against `s.db.Exec` / `tx.Exec` on every retention cycle on master. This PR only swapped the surface call (sync → sync, via the wrapper) to record wait/hold timing; no new sync schema work was introduced. Behavior on production data is identical to master. Also: red commit's synthetic `UPDATE nodes SET name = name WHERE 0` is a test-only stub designed to acquire the writer without mutating any row (the `WHERE 0` is a no-op predicate). Fixes #1340 --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
1b112f0b08 |
feat(memlimit): GOMEMLIMIT via runtime.maxMemoryMB in server + ingestor (#1010) (#1595)
Red commit:
|
||
|
|
df61660a5e |
perf(load): background subpath+pathHop index builds with ready gates (#1008) (#1604)
## Summary
Mirrors the distance-index lazy pattern (#1011): the subpath and
path-hop index builds are no longer part of `Load()`'s synchronous
critical section. They now run in **two parallel background goroutines**
kicked off after `s.loaded = true`, so HTTP comes up immediately even at
Cascadia scale (5M observations, previously ~60s blocked on these two
builds inside `Load()` under `s.mu`).
Fixes #1008.
## Approach
Two new `atomic.Bool` fields on `PacketStore` (`subpathReady`,
`pathHopReady`) plus a one-shot broadcast channel (`indexReadyChan`) for
waiters. `Load()` removes the synchronous `s.buildSubpathIndex()` /
`s.buildPathHopIndex()` calls and instead kicks
`s.startBackgroundIndexBuilds()` right before returning. That function
spawns **two independent goroutines** (review m7), one per index. Each
goroutine:
1. acquires `s.mu.Lock()` (blocks until `Load()`'s deferred Unlock
fires),
2. runs its builder, releases the lock, stores its `ready = true`,
3. closes the broadcast channel if both flags are now true,
4. logs `[startup] index build complete: subpath (Xs)` (or pathHop).
Analytics handlers whose entire response IS the index aggregate —
`/api/analytics/subpaths`, `/api/analytics/subpaths-bulk`,
`/api/analytics/subpath-detail`, `/api/nodes/{pubkey}/paths` — gate
reads behind the corresponding atomic and respond with `503 Service
Unavailable`, `Retry-After: 5`, body `{"error":"index
loading","retryAfter":5}` until the build completes — matching the
triage spec.
### Handler scope (review M2)
A second class of handlers also touches these indexes — `/api/nodes`,
`/api/nodes/{pubkey}`, the `GetRepeaterRelayInfoMap` /
`GetRepeaterUsefulnessScoreMap` / `GetBridgeScore` enrichment helpers,
and `repeater_liveness` / `repeater_usefulness`. These are
**intentionally NOT 503-gated**: they expose the index via optional
enrichment fields that callers already treat as "may be empty", and
503-ing the SPA bootstrap to wait for an index that only affects
relay-activity badges would be a worse UX than a 30–60s window of "—"
values. The rationale is documented in the package doc-comment at the
top of `index_ready_1008.go`.
The recomputer's synchronous prewarm path
(`StartRepeaterEnrichmentRecomputer`) gates on `WaitIndexesReady(60s)`
(review M1) so it never snapshots an empty `byPathHop` into
`s.repeaterRelayCache`; on timeout it skips the prewarm and lets the
5-minute ticker pick up the populated index.
## Concurrency safety
Each build goroutine acquires `s.mu.Lock()` before calling the existing
`buildSubpathIndex()` / `buildPathHopIndex()` helpers, which replace
`s.spIndex` / `s.spTxIndex` / `s.byPathHop` with freshly-allocated maps.
Visibility of the populated maps to handlers that observe
`Ready()==true` is established by Go 1.19+ sync/atomic acquire-release
semantics: the atomic store of `true` happens-after `s.mu.Unlock()`, and
the handler's atomic load synchronizes-with that store. The handler's
subsequent `s.mu.RLock` serializes against concurrent ingest writers,
not against the builder.
The existing `main.go` boot sequence does not start ingest goroutines
until after `store.Load()` returns and graph init completes, so the
brief window between `Load()` returning and the two goroutines acquiring
`s.mu` does not race with concurrent ingest writes.
## TDD: red → green
- **Red** commit `63e79e11`: `cmd/server/index_ready_1008_test.go` adds
four assertions; `cmd/server/index_ready_1008.go` adds compile-only
stubs returning `true` so the tests fail on assertions, not build
errors.
- **Green** commit `fb1d22b0`: implements the real atomic gates, the
background goroutine, and the four handler 503 branches; also updates
four existing tests that read indexes directly post-`Load()` to call
`store.WaitIndexesReady(5s)` first.
- **Race-fix commit `b77d56eb`** (review m8 — test-infra exemption):
adds `WaitIndexesReady` calls in test helpers/setup paths so the race
detector no longer flags the read-after-Load() pattern in existing
tests. Per AGENTS.md, race-detector flakes are observable evidence (test
crashes under `-race`) and qualify for the test-infra exemption from the
TDD red-commit requirement; no behavior change in production code.
- **Polish round 2 — M1 red `408c7462` / green `85e82c8a`**:
`TestIssue1008_M1_PrewarmWaitsForIndexes` asserts the recomputer prewarm
SKIPs when indexes are not ready. Red commit adds the assertion + a stub
`repeaterEnrichmentPrewarmWait` var; green commit wires
`WaitIndexesReady` into the prewarm path and adds the handler-scope docs
for M2.
- **Polish round 2 — minor cleanups `fd089bd0`** (m3..m7): chunk-loader
wires `markIndexesReadySync`, memory-model comment rewritten to cite
acquire-release, sentinel deleted, polling replaced with a broadcast
channel, two parallel goroutines for the builds.
`TestIssue1008_m7_BothFlagsSetAfterParallelStart` covers the parallel
path.
## Reproduction
```
git fetch origin fix/issue-1008
git checkout
|
||
|
|
3898688d6d |
analytics: Relay Airtime Share endpoint + dumbbell chart (#1359) (#1601)
Implements the locked spec from #1359. Red commit: |
||
|
|
d6384c3c59 |
fix(#1217): honor time-window filter on Route Patterns analytics (#1592)
## What The Route Patterns chart on `/#/analytics` ignored the Time window picker — every selection returned identical data. This PR threads `?window=` through to the backing endpoints and the store-level computation. ## Root cause `cmd/server/routes.go:2065` (`handleAnalyticsSubpaths`) and `cmd/server/routes.go:2090` (`handleAnalyticsSubpathsBulk`) never called `ParseTimeWindow(r)`. The store-level entry points (`GetAnalyticsSubpaths`, `GetAnalyticsSubpathsBulk`) had no window-aware variant. The frontend (`public/analytics.js`) didn't append `&window=` to the `/analytics/subpaths-bulk` request. ## Fix ### Backend (`cmd/server/store.go`) Added `GetAnalyticsSubpathsWithWindow` + `GetAnalyticsSubpathsBulkWithWindow`. Zero `TimeWindow` → byte-equivalent to the existing fast path (no perf regression on the default view). Non-zero window → iterate `s.packets`, filter on `tx.FirstSeen` via `TimeWindow.Includes`, reuse `rankSubpaths`. Cached by `(region|area|window)`. ```diff -data := s.store.GetAnalyticsSubpaths(region, minLen, maxLen, limit) +window := ParseTimeWindow(r) +data := s.store.GetAnalyticsSubpathsWithWindow(region, minLen, maxLen, limit, window) ``` ```diff -results := s.store.GetAnalyticsSubpathsBulk(region, groups) +results := s.store.GetAnalyticsSubpathsBulkWithWindow(region, groups, ParseTimeWindow(r)) ``` ### Frontend (`public/analytics.js`) `renderSubpaths` now appends `&window=<value>` to the `/analytics/subpaths-bulk` request, matching how RF / topology / channels tabs already wire the picker. ## Before / after ``` GET /api/analytics/subpaths?window=24h → totalPaths=2 (all data — ignored window) GET /api/analytics/subpaths?window=24h → totalPaths=1 (24h-bounded — honored) ``` ## Tests `cmd/server/subpaths_window_test.go`: - `TestSubpathsHonorsTimeWindow_StoreLevel` — seeds a 1h-old tx with path `[aa,bb]` + a 30d-old tx with path `[cc,dd]`; asserts the unbounded call sees both and the 24h-windowed call sees only the recent one. - `TestSubpathsHandlerHonorsTimeWindow` — same scenario via the HTTP handlers for `/api/analytics/subpaths` and `/api/analytics/subpaths-bulk`. TDD: red commit `eefc27d3` (test fails on assertion with stub that ignores window), green commit `4c4c45d0` (implementation makes it pass). Full `go test ./...` in `cmd/server` green locally (~47s). ## Performance Default view (no window selected) is unchanged — `window.IsZero()` short-circuits to the existing precomputed-index hot path. Windowed view is O(N_tx · path²), same complexity as the existing region-filtered slow path. Results cached per `(region|area|window)`. Closes #1217 --------- Co-authored-by: Kpa-clawbot <bot@corescope> |
||
|
|
5629a489b2 |
perf(distance): lazy build distance index on first request (#1011) (#1597)
## Summary Build the distance analytics index lazily on the first `/api/analytics/distance` request instead of eagerly inside `Load()` (and its background-load chunked merge). Per the triage Fix path on the issue: - Eager startup build removed from `Load()` and from `loadAllPacketsBackground()`'s post-merge pass. - First request returns `202 Accepted` + `Retry-After: 5` and kicks off the build in a background goroutine, gated by `sync.Once` so concurrent first-window requests all observe 202 (single build, not N parallel O(n²) computations). - Once built, subsequent requests fall through to the existing analytics-recomputer / TTL cache and serve 200 as before. - Debounced rebuild policy: refire only when `Δobs > 5%` since last build OR `>5 min` elapsed, whichever is more restrictive. Background loader also resets the gate so the next request rebuilds against the larger dataset. Effect: operators who never visit distance analytics no longer pay the O(n²) construction at startup. Acceptance criteria (a) no startup build, (b) first request triggers build, (c) concurrent in-flight requests get 202 are encoded as failing-first tests. ## Red → green - Red: `bc947ad1` — 3 assertion failures (`expected ... empty, got 3`, `expected 202, got 200`, `expected all 10 ... got 0`). - Green: `5264b68a` — production change makes them pass, no other tests regress. ## Files changed - `cmd/server/store.go` — lazy-build state (`distLazyMu`/`Once`/`Built`/`Building`/`LastBuilt`/`LastObs`), `TriggerDistanceIndexBuild`, `DistanceIndexBuilt`, `DistanceIndexBuilding`; eager `buildDistanceIndex` calls in `Load()` post-pass and chunked-background-load post-pass removed (Once reset instead so the next request rebuilds against the full dataset). - `cmd/server/routes.go` — `/api/analytics/distance` returns 202 + `Retry-After` until built. - `cmd/server/distance_lazy_index_test.go` — new tests (the three triage acceptance criteria). - `cmd/server/coverage_test.go`, `cmd/server/parity_test.go`, `cmd/server/routes_test.go`, `cmd/server/hop_disambig_e2e_test.go` — pre-warm the index via `TriggerDistanceIndexBuild()` + `DistanceIndexBuilt()` poll where the test asserts the 200 JSON shape. ## Perf justification Startup cost on a 500K-obs / 2K-node dataset: previously O(n²) hop scan during `Load()` post-pass and again during the background-load merge — measured at 10–20s in `specs/startup-audit.md`. New code: zero work at startup, the same O(n²) work runs at most once per HTTP request cycle (and only when the index is stale per debounce policy). Cold-path concurrency is bounded by `sync.Once`, so N parallel first-window requests never produce N parallel builds. ## Scope No config field added (debounce thresholds are hardcoded constants per the triage Fix path — `5%` / `5min`). No public API signature changes. No DB-side migration. Tests cover the lazy invariant, the 202+Retry-After contract, and concurrent first-request behavior. Closes #1011 --------- Co-authored-by: Kpa-clawbot <bot@corescope.local> |
||
|
|
3df8924114 |
fix(#1218): include multi-byte prefix repeaters in 1-byte hash usage matrix view (#1591)
## Problem
`/analytics` Hash Usage Matrix 1-byte view excluded repeaters configured
for 2- or 3-byte hash prefixes. In MeshCore, 1-byte path-matching is a
first-byte equality check, so any packet routed by 1-byte hash collides
on that first byte regardless of the downstream repeater's configured
prefix size. Omitting multi-byte prefix repeaters under-reports real
conflicts in the 1-byte hash space.
## Fix
**Data layer — `cmd/server/store.go` (`computeHashCollisions`,
~L7907-L7918 before, L7907-L7941 after):**
Before — `one_byte_cells` was populated only from `prefixMap`, which
only contained repeaters with `hash_size == 1`:
```go
if bytes == 1 {
oneByteCells = make(map[string][]collisionNode)
for i := 0; i < 256; i++ {
hex := strings.ToUpper(fmt.Sprintf("%02x", i))
oneByteCells[hex] = prefixMap[hex]
if oneByteCells[hex] == nil {
oneByteCells[hex] = make([]collisionNode, 0)
}
}
} else if bytes == 2 { ... }
```
After — additionally project all `hash_size in {2,3}` repeaters to their
first byte:
```go
if bytes == 1 {
// ... (same baseline population) ...
for _, cn := range allCNodes {
if cn.Role != "repeater" { continue }
if cn.HashSize != 2 && cn.HashSize != 3 { continue }
if len(cn.PublicKey) < 2 { continue }
hex := strings.ToUpper(cn.PublicKey[:2])
if _, ok := oneByteCells[hex]; !ok { continue }
oneByteCells[hex] = append(oneByteCells[hex], cn)
}
}
```
The 2-byte view's bucketing is unchanged — that view continues to count
only repeaters configured for 2-byte prefixes (those semantics differ).
**UI — `public/analytics.js` L1459:** clarified the 1-byte view
description so the inclusion of multi-byte prefix repeaters is explicit.
## API shape
No response-shape change. `one_byte_cells[HEX]` is still
`[]collisionNode`; only the contents now include 2/3-byte prefix
repeaters in the appropriate first-byte buckets. The existing frontend
decoder is unaffected.
## Tests
-
`cmd/server/routes_test.go::TestHashCollisionsOneByteIncludesMultiBytePrefixRepeaters`
— seeds three repeaters with first byte `CC` configured for 1/2/3-byte
prefixes plus an unrelated `DD` repeater, asserts all three appear in
`one_byte_cells["CC"]`, and that the 2-byte view's `nodes_for_byte` is
unchanged.
Red commit `278bdf8d` (test only) fails on assertion ("got 1, want 3");
green commit `9127ea4e` passes.
## Preflight
`bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master`
→ clean.
Closes #1218
---------
Co-authored-by: clawbot <bot@corescope>
|
||
|
|
1a2b8c48be |
feat(node-detail): link RTC-reset warning to offending packet hashes (#1094) (#1590)
## Problem Node detail's bimodal-clock warning showed only `⚠️ N of last M adverts had nonsense timestamps (likely RTC reset)` — no way to tell which packets, no way to verify the heuristic, no way to drill in. ## Fix Additive, two-sides: **Backend** (`cmd/server/clock_skew.go`) - New type `BadSample { Hash, AdvertTS, SkewSec }`. - New field `NodeClockSkew.RecentBadSamples []BadSample` (`omitempty`). - Populated from the **same** bimodal-bad classification pass that produces `RecentBadSampleCount` — no heuristic change. `tsSkewPair` carries `hash` + `advertTS` so the classifier can record per-sample evidence without a second walk; drift code is unaffected (reads only `ts`/`skew`). **Frontend** (`public/nodes.js`) - `bimodalWarning` preserves the existing count summary line, then renders a `<ul>` of bad samples: each `<li>` is `<a href="#/packets/HASH">hash[:8]</a> → formatTimestamp(advertTS)` with ISO tooltip. Defensive `Array.isArray` so older API responses still render the summary alone. ## TDD - **Red:** `cmd/server/clock_skew_issue1094_test.go::TestIssue1094_RecentBadSamples_ExposesHashAndTimestamp` — seeds 3 healthy + 2 bimodal-bad adverts, asserts `RecentBadSamples` has length 2 with the expected hashes and advert timestamps. Fails on the assertion (`len = 0, want 2`) with the stub-only commit. - **Green:** classifier populates the slice; existing #1285 and bimodal tests stay green. - Red commit: `ed501f4b` - Green commit: `54305b06` ## Cross-stack Backend + frontend ship together (`cross-stack: justified` commit). API stays backward compatible (`omitempty` server, `Array.isArray` client) but the feature only lights up with both halves present. ## Preflight Clean — PII, branch scope, red-commit, CSS vars, XSS sinks, migrations, fixture coverage all pass. ## Acceptance - [x] Warning lists specific packet hashes - [x] Each hash links to `#/packets/<hash>` - [x] Bad advert timestamp shown next to the hash - [x] Pattern is reusable — `BadSample` is a clean shape any future heuristic that flags specific packets can adopt Fixes #1094 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
7533b3b67b |
feat(nodes): sortable First Seen column on Nodes table (#1166) (#1587)
## Summary Adds a sortable **First Seen** column to the Nodes table so users can spot newly observed repeaters in their region (per the reporter's use case). Closes #1166 ## Backend `/api/nodes` already exposes `first_seen` per node via `db.scanNodeRow` (sourced from the existing `nodes.first_seen` column — no schema migration, no recomputation, no extra query cost). The red test pins that contract. ## Frontend (`public/nodes.js`) - New `<th data-sort-key="first_seen" data-sort-default="desc">First Seen</th>` between Last Seen and Adverts. - Cell renders via `renderNodeTimestampHtml(n.first_seen)` — same relative-time + absolute-ISO `title=` tooltip as the Last Seen column. Empty values render as `—`. - `sortNodes` gains a `first_seen` branch with **empty-last** semantics: nodes without a `first_seen` always sort to the bottom regardless of asc/desc direction, so unknowns never clutter the top of the table. - Empty-state `colspan` bumped 7 → 8. ## TDD - **Red commit** `112442f4` — `test-issue-1166-first-seen-column.js` + `cmd/server/first_seen_1166_test.go`. The backend half passes on red (field already returned); 5 frontend assertions fail on assertions (column header missing, sort branch missing, empty-last violated). - **Green commit** `9274b36c` — only `public/nodes.js`. All 6 tests pass. Verified red is real-fail (assertion-shaped) by checking out the red commit's `nodes.js` and re-running the test: 5 failures, all on `assert.strictEqual`, none on parse/import. ## Test results ``` node test-issue-1166-first-seen-column.js → 6 passed, 0 failed node test-frontend-helpers.js → 611 passed, 0 failed go test ./cmd/server/... → ok (45.16s, all pass) ``` ## Files changed - `public/nodes.js` (+14 / −1) - `test-issue-1166-first-seen-column.js` (new) - `cmd/server/first_seen_1166_test.go` (new) ## Scope guardrails - No schema migration. - No new files outside the worktree's three allowed surfaces. - No refactor of other Nodes columns. - Empty cells handled in both render (em-dash) and sort (always last). --------- Co-authored-by: fix-1166-bot <bot@corescope.local> |
||
|
|
f7571a261e |
fix(#1546): remove dead server-side backfill flag (stuck backfilling=true) (#1583)
## Summary Closes #1546. `/api/stats` reported `{"backfilling":true,"backfillProgress":0}` on every fully-converged server, and `X-CoreScope-Status: backfilling` was sent on every request. Root cause: the `Store` had three atomic fields — `backfillComplete` / `backfillTotal` / `backfillProcessed` — read by `handleStats` and `backfillStatusMiddleware`, but **nothing ever wrote to them**. They are leftovers from the server-side async backfill added in #612/#614. That work moved to the **ingestor** in #1289 (server is now read-only) and the writer `backfillResolvedPathsAsync` was deleted, orphaning the readers. `backfillComplete.Load()` therefore always returned `false`, so `backfilling := !false` was permanently `true`. This is the leftover of an intentional architecture change, not an unfinished feature — the server no longer does backfill by design, so the correct fix is to delete the dead flag (per triage recommendation; zero consumers). ## Changes - `store.go` — drop the 3 dead atomic fields. - `routes.go` — drop `backfillStatusMiddleware` (+ its registration) and the backfill-progress computation in `handleStats`. - `types.go` — drop `Backfilling` / `BackfillProgress` from `StatsResponse`. **API change:** `/api/stats` no longer emits `backfilling` / `backfillProgress`; the `X-CoreScope-Status` header is removed. Verified no frontend or other consumer reads them. - `resolved_index.go` — remove stale comment referencing the deleted `backfillResolvedPathsAsync`. ## Test Regression assertion added to `TestStatsEndpoint` (#1546): asserts the response no longer carries `backfilling` / `backfillProgress` and that `X-CoreScope-Status` is unset. Verified red→green — against pre-fix code all three assertions fail; with the fix they pass. Full `cmd/server` suite green locally. ## Out of scope If a real server-side backfill/migration status indicator is wanted, that's a new feature on top of the ingestor stats pipe — tracked separately, not by reviving these dead fields. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
9465949e79 |
fix(#1558): mirror Load's resolved_path indexing into loadChunk (#1582)
## Summary Closes #1558. The background-backfill path (`loadChunk`) silently dropped the resolved-path indexing branch that `Load` performs per observation. Same SQL rows, two different post-conditions — a contract violation between the hot-startup load and the background chunk load. ## Root cause (the differential matters) The reporter's hypothesis — `indexByNode` not invoked on background-loaded transmissions — was 90% right but pointed at the wrong line. - `cmd/server/store.go:1116` already calls `s.indexByNode(tx)` inside the loadChunk per-batch merge lock for every backfilled tx. Decoded `pubKey` / `destPubKey` / `srcPubKey` ARE indexed. - `indexByNode` (store.go:1313 pre-patch) only reads three fields from `decoded_json`. It does NOT and cannot touch `resolved_path`. - `Load` (store.go:783-799) per-observation unmarshals `o.resolved_path`, extracts every relay-hop pubkey, and feeds them through `addToByNode` + `addResolvedPubkeysToPathHopIndex` + `addToResolvedPubkeyIndex`. - `loadChunk` (store.go:937-1023 pre-patch) selects `o.resolved_path` into `resolvedPathStr`… then never touches it. Result: after a container restart, every transmission older than `hotStartupHours` ends up present in `s.packets` / `s.byHash` / `s.byTxID` but missing from `s.byNode[relayPK]` for every relay pubkey. Home-page per-node `packetsToday` / `totalTransmissions` / `observers` / `avgHops` / `avgSnr` collapse for relay-heavy nodes (753 → 8 in the reporter's trace). Stats only self-heal as live ingest re-populates `byNode` through the ingest path (which DID call the full sequence inline). ## Fix shape 1. **Extract a shared `(s *PacketStore) indexResolvedPathHops(tx, pks, hopsSeen)` helper.** Owns the `addToByNode` + `addResolvedPubkeysToPathHopIndex` + `addToResolvedPubkeyIndex` sequence. Single point of truth so the "feed decode-window consumers for resolved-path pubkeys" invariant is structural, not duplicated. 2. **Re-point `Load` and both ingest sites at the helper.** Load's semantic behaviour is byte-identical with the prior inline block. 3. **Add the missing call in `loadChunk`.** Per AGENTS.md performance rule #0 ("no expensive work under locks"), unmarshal `resolved_path` and dedupe relay pubkeys per txID **outside** the merge critical section (`localResolvedPKsByTx`), then feed the pre-built slice through `indexResolvedPathHops` inside the existing per-batch lock alongside `indexByNode`. Mirrors `loadChunk`'s "build local, merge under lock" shape. ## TDD: red → green commits ``` |
||
|
|
7292d60fbe |
feat(#1508): config-driven disabled tabs in customizer modal (#1579)
# feat(#1508): config-driven disabled tabs in customizer modal Fixes #1508. ## Why The customizer modal mixes one-shot operator chrome (`branding`, `home`, `geofilter`, `export`) with daily-use viewer toggles (`theme`, `nodes`, `display`). Non-technical users get confused by the admin tabs and skip past the controls they actually need. There's no current way to hide individual tabs server-side — only via CSS, which doesn't prevent state mutation. ## What Adds a single operator knob: `customizer.disabledTabs` in `config.json`. The named tab ids are filtered out of `_renderTabs()` in `public/customize-v2.js` before render. - `config.example.json` — new `customizer` block, default `disabledTabs: []` (zero behavior change for existing operators). - `cmd/server/config.go` — new `CustomizerConfig` type, optional pointer on `Config`. - `cmd/server/routes.go` + `cmd/server/types.go` — `/api/config/client` now surfaces `customizer.disabledTabs` (always an array, empty when unset). - `public/customize-v2.js` — `_renderTabs()` filters by id. - `cmd/server/customizer_disabled_tabs_test.go` — RED-then-green tests covering both the configured-and-defaulted shapes. ## TDD trail 1. RED commit adds the failing tests + minimal `CustomizerConfig` stub so the package still compiles; both tests fail on the assertion (`body.customizer` is `<nil>`) — not on import. 2. GREEN commit wires the field through `/api/config/client` and the frontend tab filter; both tests pass. ## Scope 5 files. No new API surface, no UI for editing the list (operator edits `config.json` directly per the issue body). Backward-compatible: missing `customizer` block defaults the list to empty. --------- Co-authored-by: bot <bot@local> |
||
|
|
d7cd9203ca |
Fixes #1165: add OSM/Stamen tile providers with per-provider Leaflet layer control. (#1533)
List of changes too long to describe, so I'll hit high level. - Config now supports the json map tiles that were suggested by @Kpa-clawbot. - Leaflet map layer button appears in the top right of live.js and map.js (because all the work was already done on live.js... Added bonus) - Allows users to enter creds for OSM and Stamen to get enterprise related perks, in the config file - Added a default light map under customizer. Still suggest removing them all together and relying on the config - You can enable OSM and Stamen in the config without a license, but at your own risk!!! - Config comment explains where to register and the providers for osm, as well as the general limits per X interval - Updated tests (28) to address the changes made to the maps ### TDD Exemption **Reason**: Net-new UI surfaces (per `AGENTS.md`) This PR introduces a net-new UI surface (the multi-provider map tile selector). Under the `AGENTS.md` exemption for net-new UI surfaces, the absence of an initial failing (red) commit is permitted, as the UI was built first. However, the underlying public APIs are fully covered. The following tests serve as the first assertions for these new APIs: - `window.MC_createLayerControl`: Asserted in `MC_createLayerControl handles Auto mode and explicit layers correctly` - `window.MC_setDarkTileProvider` & `window.MC_getDarkTileProvider`: Asserted in `MC_setDarkTileProvider persists to localStorage...` - `window.MC_setLightTileProvider` & `window.MC_getLightTileProvider`: Asserted in `MC_setLightTileProvider persists to localStorage...` - `window.MC_initTileRegistry`: Asserted in `MC_initTileRegistry(true) dispatches mc-tile-provider-changed` - `applyTileFilter`: Asserted in `applyTileFilter sets invert CSS for inverted dark provider...` - Cross-tab synchronization: Asserted in `Cross-tab storage event re-dispatches mc-tile-provider-changed` |
||
|
|
63bfa3d910 |
feat(security): detect CDN-fronted deployment + document bypass requirement (closes #1561) (#1564)
Closes #1561. Follow-up to #1551. ## Why #1551 added `Cache-Control: no-store` to all `/api/*` responses. That's sufficient for CDNs that honour origin headers (Varnish, nginx). It is **not** sufficient for Cloudflare zones where Cache Rules / Page Rules override origin Cache-Control. Field evidence from the meshat.se diagnosis (2026-06-04): observers behind Cloudflare were returning `cf-cache-status: HIT` with `age` up to ~6 hours despite the origin emitting `no-store`. The CDN was caching per zone policy and ignoring the upstream directive — exactly the failure mode #1551 cannot reach. The application has no way to inject CDN rules; the only durable fix is operator-side. This PR makes that operator step discoverable and verifiable. ## What ### Server-side detection (log-only) `cmd/server/cdn_detection.go` adds a middleware wired into the `/api/*` chain after `noStoreAPIMiddleware`. On the **first** request bearing any CDN-typical header (`CF-Connecting-IP`, `CF-Ray`, `X-Forwarded-For`, `X-Real-IP`, `Fastly-Client-IP`, `True-Client-IP`) it logs: ``` [security] WARNING: detected request via CDN (CF-Ray header present). Ensure /api/* is bypassed in your CDN config — see docs/deployment-behind-cdn.md. Cached API responses cause observer-flap and incorrect dashboards. ``` `sync.Once` guarantees the warning fires at most once per process boot. The middleware never blocks, never modifies the response, never adds headers. Detection is observational only — operators who run behind a CDN without bypass have a real bug; the warning is appropriate. ### Operator documentation `docs/deployment.md` gains a new **"Behind a CDN (Cloudflare, Fastly)"** section covering: 1. Curl verification command + healthy vs unhealthy output examples 2. Cloudflare Cache Rule creation (URI Path starts-with `/api/` → Bypass cache) 3. Legacy Page Rules equivalent 4. Fastly note 5. Re-verification 6. Meaning of the startup log warning 7. Why we can't fix this server-side `docs/deployment-behind-cdn.md` is the canonical path the log message references — it's a short TL;DR that links back to the full section. ### Healthcheck script `scripts/check-cdn-bypass.sh` — POSIX sh, no dependencies beyond curl + grep + awk. Operators run: ```sh scripts/check-cdn-bypass.sh https://your-domain.example.com ``` Exits `0` with `OK: no CDN caching detected ...` or `1` with a precise diagnostic naming the offending header (`cf-cache-status: HIT` or stale `age`). ## TDD - **Red commit `e90ccaba`** (`test(security): RED ...`) — `cmd/server/cdn_detection_test.go` (4 Go tests + 6 subtests for each header) and `scripts/test-check-cdn-bypass.sh` (3 shell harness cases). Middleware stub returns `next` unchanged so tests compile and fail on assertions, not build errors. - **Green commit `5e6a60b5`** (`feat(security): GREEN ...`) — real middleware, wiring in `routes.go`, healthcheck script, doc. ## Deliverables | File | Status | Purpose | |------|--------|---------| | `cmd/server/cdn_detection.go` | new | middleware + sync.Once warning | | `cmd/server/cdn_detection_test.go` | new | 4 Go tests (1 stand-alone + 1 silence + 1 once + 1 table-driven over 6 headers) | | `cmd/server/routes.go` | modified | `r.Use(cdnDetectionMiddleware)` after no-store | | `docs/deployment.md` | modified | TOC entry + "Behind a CDN" section | | `docs/deployment-behind-cdn.md` | new | canonical path referenced by log message + script output | | `scripts/check-cdn-bypass.sh` | new | operator-runnable healthcheck | | `scripts/test-check-cdn-bypass.sh` | new | shell harness with fake curl | ## What this PR explicitly does NOT do - Does not block requests based on CDN detection (log-only). - Does not enforce CDN bypass (impossible — operator-controlled). - Does not spoof, strip or modify CDN headers. - Does not add CSP / HSTS / other security headers (out of scope). - Warning is not configurable — operators behind a CDN without bypass have a real bug, surfacing it is correct. ## Verification - `go test ./...` in `cmd/server/` — full suite green. - `sh scripts/test-check-cdn-bypass.sh` — 3/3 pass. - Preflight checklist — all 11 gates clean (PII, branch scope, red commit, CSS vars, CSS self-fallback, LIKE-on-JSON, sync migration, async-migration annotation, XSS sinks, img/SVG ratio, themed-img/SVG, fixture coverage). --------- Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: clawbot <bot@clawbot.invalid> |
||
|
|
65bd954b17 |
feat(config): make observer health thresholds configurable (closes #1552) (#1556)
Closes #1552. ## What Make observer `Online` / `Stale` / `Offline` thresholds operator-configurable via `config.json`'s existing `healthThresholds` block — and **raise the defaults** from 10 min / 60 min to **60 min / 1440 min (1 h / 24 h)** so they match the node thresholds and stop producing flap out of the box. ⚠️ **This is a default behavior change.** Operators who want the old aggressive 10-min Online threshold must opt in via: ```json "healthThresholds": { "observerOnlineMinutes": 10 } ``` ## Why Per #1552: the `600000` / `3600000` constants in `public/observers.js` were not tunable, *and* 10 min is wrong as a default. Wide-geo, low-traffic meshes legitimately see observers go quiet for >10 min between reports, and operators behind a CDN (#1551) get cached `last_seen` values that can push the observer 15+ min behind reality — guaranteeing flap at the 10-min threshold. The meshat.se operator (43 observers, v3.8.3) reports exactly this pattern. Defaults raised from 10 / 60 minutes to 60 / 1440 minutes (1 h / 24 h) to match the node thresholds for consistency and eliminate flap on low-traffic / CDN-fronted instances. Operators wanting the old 10-min Online behavior can set `observerOnlineMinutes: 10` in config. ## Changes Backend (`cmd/server/config.go`): - `HealthThresholds` gains `ObserverOnlineMinutes` / `ObserverStaleMinutes` (int). - `GetHealthThresholds()` defaults to **60 / 1440** when zero/absent. - `ToClientMs()` emits `observerOnlineMs` / `observerStaleMs`, picked up by the existing `/api/config-public` → `roles.js` `Object.assign(HEALTH_THRESHOLDS, …)` pipeline. `config.example.json`: new `observerOnlineMinutes` / `observerStaleMinutes` keys (60 / 1440) + `_comment_observerThresholds` explaining the rationale and opt-out. Frontend: - `public/observers.js` `healthStatus()` — reads from `window.HEALTH_THRESHOLDS.observerOnlineMs / observerStaleMs`, falls back to **3600000 / 86400000** (matching the new Go defaults for the pre-`/api/config-public` window). - `public/observer-detail.js` — same refactor (was previously hardcoded `600000` + misusing `nodeDegradedMs` for the Stale boundary). ## Backward compat - API shape: unchanged — only adds two optional keys. - Config: unchanged keys / no renames. - Default behavior: **changed** — operators relying on the implicit 10/60 must opt in (one config line). ## TDD - RED 1 (`ee19058f`): assertions on the new fields + `ToClientMs` keys + `healthStatus` reading from `window.HEALTH_THRESHOLDS`. CI: [failure](https://github.com/Kpa-clawbot/CoreScope/actions/runs/26945264822). - GREEN 1 (`30cfbf7a`): configurability landed (defaults still old 10/60). CI: [success](https://github.com/Kpa-clawbot/CoreScope/actions/runs/26945220598). - RED 2 (`2649cf35`): pin new 60/1440 defaults — empty-config Go path + JS `healthStatus` with no `HEALTH_THRESHOLDS`. CI must fail. - GREEN 2 (`5ef85bca`): bump Go defaults to 60/1440, JS fallbacks to 3600000/86400000, `config.example.json` updated. CI must pass. ## Preflight Clean (exit 0). `cross-stack` ack in commit messages — single feature spans Go + JSON + JS readers. ## Not in scope - Customizer UI for editing the thresholds (config-only per issue). - Node/infra thresholds (unchanged). - The deeper observer-flap root cause (#1551 cache-control is a separate PR in flight). --------- Co-authored-by: corescope-bot <bot@corescope> Co-authored-by: mc-bot <bot@meshcore.local> |
||
|
|
0c908d2bca |
fix(api): emit Cache-Control: no-store on /api/* responses (#1551) (#1553)
Closes #1551. ## Problem `/api/*` Go responses emit no `Cache-Control` header. CDNs (Cloudflare, nginx, Varnish) default to caching `application/json` for **15 min – 4 h** when no directive is set. Observed against a public Cloudflare-fronted CoreScope instance (`meshcore.meshat.se`): - 17 consecutive polls of `/api/observers` over ~10 min returned byte-identical responses - Response headers showed `cf-cache-status: HIT`, `age: 878` (~15 min) - Cache-busting query param → `cf-cache-status: MISS` with fresh `last_seen` values This causes WebSocket pushes to diverge from REST GETs (WS fresh, REST stale) and produces false-positive stale/online flips for observers near the 10-min threshold. ## Fix New `noStoreAPIMiddleware` in `cmd/server/routes.go` wired into the gorilla/mux chain alongside the existing `backfillStatusMiddleware`. Sets `Cache-Control: no-store` on every response whose request path starts with `/api/`. ## Design choice: `no-store` vs `private, max-age=0` Chose `no-store`. CoreScope's REST endpoints are fresh-on-every-request by contract (WS pushes diff against REST GETs), so any intermediary cache is wrong. `no-store` forbids **any** cache (CDN, browser, intermediary). `private, max-age=0` still permits short browser caches and some intermediaries — no benefit here. ## Scope discipline - `/api/` prefix only. - Static assets (`/`, `/app.js`, `/style.css`, …) keep their existing `no-cache, no-store, must-revalidate` headers from `spaHandler` in `main.go`. Hashed assets stay CDN-cacheable by design. - The middleware runs for **all** registered routes including the websocket upgrade HTTP request, since `/ws` is served through the same mux. ## TDD - **Red** `1beb5432`: `cmd/server/cache_control_api_test.go` asserts `Cache-Control: no-store` on `/api/stats`, `/api/observers`, `/api/packets`, `/api/nodes`, and asserts the middleware does NOT leak onto `/` or `/app.js`. Fails on assertion (no Cache-Control header emitted) — not a compile error. - **Green** `13be675f`: middleware + wiring. All assertions pass; full `cmd/server` suite stays green. ## Files - `cmd/server/routes.go` — middleware definition + `r.Use(noStoreAPIMiddleware)` - `cmd/server/cache_control_api_test.go` — 6 sub-tests across 2 top-level tests ## Preflight `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` → clean (exit 0). --------- Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
800d61c382 |
fix(security): uniform limit-clamp, log-injection sanitization, SPA path validation (#1540)
Follow-up to v3.8.3 security train. Found by non-XSS input-validation audit. Three findings closed in one PR — all defense-in-depth: medium is genuinely DoS-only (no data exposure), lows tighten log hygiene and SPA path handling so future router changes can't silently expose the filesystem. ## Findings addressed ### MEDIUM — unbounded `limit` on list endpoints - **What:** four list endpoints accepted `limit=999999999` and passed the value straight to SQL `LIMIT ?` and Go `make(..., 0, limit)`. - **Where:** `cmd/server/routes.go` — handlePackets (incl. multi-node branch), handleNodes, handleChannelMessages, handleAnalyticsSubpaths, handleAnalyticsSubpathsBulk per-group lim, handleDroppedPackets. - **Fix:** new `clampLimit(raw, def, max)` helper in `cmd/server/clamp_limit.go` plus `queryLimit(r, def, max)` HTTP wrapper. Caps: packets/nodes/channels/dropped = 500, analytics buckets / bulk-health = 200. Already-clamped endpoints (handleBulkHealth) migrated to the helper for uniformity. Silent clamp — no response-shape change. Negative / zero / non-numeric → default. ### LOW — log injection via newline in advert name - **What:** advert `name` field allows `\n` / `\t` (sanitizeName intentionally preserves them for display). Logged at two MQTT-ingest sites, an attacker with publish ACL could forge log lines. - **Where:** `cmd/ingestor/main.go:659,690`. - **Fix:** new `sanitizeLogString` in `cmd/ingestor/sanitize_log.go` strips control bytes < 0x20 and DEL with `?`. Wrapped at the two log call sites that interpolate `name=` and `observer=`. Stored display values untouched. ### LOW — SPA static handler depends on default mux path-cleaning - **What:** `cmd/server/main.go:469` joins `r.URL.Path` to root; safe today only because gorilla/mux runs `path.Clean` and `http.FileServer` rejects `..`. A future `SkipClean(true)` or router swap would silently expose the filesystem. - **Where:** `cmd/server/main.go` (spaHandler). - **Fix:** new `isSafeStaticPath` rejects requests whose decoded or raw path contains `..`, `%2e%2e`, `\\`, or `%5c` with a 400. Legit asset names with dots (`/app.js`, `/customize-v2.js`, `/themes/dark.css`) are unaffected. ## TDD - Commit 1 (red): adds `TestClampLimit`, `TestSpaHandlerPathTraversal`, `TestSanitizeLogString` with stub helpers — tests fail on assertions (not build errors), proving they gate the change. - Commit 2 (green): production fix. Revert the green commit and the red commit's assertions fail. ## Audit reference Source: non-XSS input-validation audit dated 2026-06-03 (workspace). Sibling PR `fix/xss-r2-trace-obs-anl` owns the XSS findings — not included here. --------- Co-authored-by: clawbot <clawbot@users.noreply.github.com> |
||
|
|
3850600130 |
perf(server): TTL-cache /api/stats observations aggregate — eliminate per-request full-table scan (#1460) (#1516)
## Problem `GetStoreStats` ran a `SUM(CASE WHEN timestamp > ?)` over the full `observations` table on **every** `/api/stats` call. The staging pprof analysis (#1460) identified this as rank #9 CPU consumer: `GetStoreStats.func2` at 920ms cumulative = ~10% of all server CPU. The query: ```sql SELECT COALESCE(SUM(CASE WHEN timestamp > ? THEN 1 ELSE 0 END), 0), COALESCE(SUM(CASE WHEN timestamp > ? THEN 1 ELSE 0 END), 0) FROM observations WHERE timestamp > ? ``` scans ~1.9M rows each time `/api/stats` is polled (every 15s from the dashboard). ## Fix Add a **30-second TTL cache** on `PacketStore` for `PacketsLastHour` and `PacketsLast24h`: - Cache hit → skip the observations goroutine entirely, use stored values - Cache miss → run the query, update cache with result - The node/observer `COUNT(*)` query is unchanged and always runs fresh The hour/24h counts are display-only values; 30s accuracy is sufficient. ## Changes `cmd/server/store.go`: - 4 new fields on `PacketStore`: `statsCacheMu sync.Mutex`, `statsCacheTime time.Time`, `statsLastHour int`, `statsLast24h int` - `GetStoreStats`: check cache before launching goroutines; conditional `wg.Add`; update cache after successful query Builds clean. No tests changed. Closes #1460 (P1#1 from staging CPU profile). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
367265eb59 |
feat(#1369): cross-domain embed support (CORS env override + ?embed=1 chrome suppression) (#1500)
Closes #1369. ## What Cross-domain embed support, shipped as two halves: ### Part A — CORS env override + read-only contract * `applyCORSEnv()` reads `CORS_ALLOWED_ORIGINS` (comma-separated, trimmed, empties dropped). Set in env → overrides `cfg.CORSAllowedOrigins`. Unset/empty → config.json value wins. * `Access-Control-Allow-Methods` tightened from `GET, POST, OPTIONS` → `GET, HEAD, OPTIONS`. The cross-domain surface is read-only by contract; same-origin admin writes don't go through preflight and are unaffected. * `config.example.json` adds `corsAllowedOrigins: []` + a comment explaining the env override and the embed URL pattern. * No wildcards introduced (still supported as `["*"]` for ops that opt in). No credentialed CORS. ### Part B — `?embed=1` chrome suppression * `shouldEmbedRoute(basePage, hashSearch)` — pure helper, allowlisted to `map` and `channels`, requires `embed=1` in the hash querystring. * `navigate()` toggles `body.embed` based on the helper. * CSS hides `.top-nav`, `[data-bottom-nav]`, `.nav-drawer`, `.nav-drawer-backdrop`, zeroes body padding/margin, reclaims `100dvh` for `#app.app-fixed`. Use: `<iframe src="https://analyzer.example/#/map?embed=1">`. For iframe-only display, no CORS entry is needed (the iframe loads the document, not a JSON API). The CORS allowlist only matters when the embedding origin's own JS calls `/api/*` directly. ## Tests | File | Asserts | Status | |---|---|---| | `cmd/server/cors_embed_1369_test.go` | 4 (env override, env-empty, env-trim, GET/HEAD contract, preflight POST rejected) | green | | `test-embed-mode-1369.js` | 9 (helper allowlist + param parsing) | green | | `cmd/server/cors_test.go` | existing | updated to read-only method-set assertion | TDD: 2 red commits (one per part, both compile, both fail on assertions) → 2 green commits. ## Out of scope (per the issue's narrow ask) * Other SPA routes do not honor `?embed=1` (their chrome makes layout assumptions; defer until requested). * No iframe sandboxing recommendation — that's the embedder's responsibility. * No CSP / `X-Frame-Options` change in this PR — frames are already permitted; add an explicit `frame-ancestors` policy in a follow-up if operators want to whitelist embedders at the HTTP layer too. ## Security notes (DJB lens) * Allowlist is exact-match, case-sensitive string compare — no normalization, no scheme/host parsing, no surprises. * No `Access-Control-Allow-Credentials` (would let third parties read auth'd state via cookies). * No reflection of arbitrary origins (every echoed origin came from the allowlist). * Methods narrowed to read-only; even a misconfigured allowlist can't grant cross-origin writes through this middleware. 🤖 Generated with OpenClaw --------- Co-authored-by: bot <bot@corescope.local> |
||
|
|
ca2c3d6c79 |
feat(1488): customize marker stroke (color, width, opacity) (#1494)
## Summary Reporter (@EldoonNemar in #1488) found the new white marker stroke overwhelming with hundreds of nodes on screen. This PR exposes the stroke through CSS vars + a customizer panel so operators can dial color/width/opacity (or remove it) without code edits. **Scope:** ship stroke customization only. The reporter also asked for the old glow-style highlight ring as an alternative — that's a separate visual feature that needs design discussion, so it's deferred to a follow-up issue. ## Changes - **`public/style.css`** `:root` declares `--mc-marker-stroke-color` / `--mc-marker-stroke-width` / `--mc-marker-stroke-opacity` with sensible defaults (white, 1, 1) that match current behavior. - **`public/roles.js`** `makeRoleMarkerSVG` — replaced the 6 baked `stroke="#fff" stroke-width="1"` literals with a single shared `strokeAttr` referencing the CSS vars. One source of truth for all role shapes. - **`public/map.js`** `makeMarkerIcon` — same migration. The observer star overlay keeps its narrow 0.8 width but routes color + opacity through the same vars. - **`public/live.js`** `addNodeMarker` fallback SVG — same migration. - **`public/customize-v2.js`** — new `markerStroke` object section (color/width/opacity) with validation, `applyCSS` writes, three controls on the Colors tab → "Marker Stroke" panel (color picker + width slider 0–4 + opacity slider 0–100%). Optimistic CSS-var writes on the `input` event so markers repaint live as the operator drags. - **`cmd/server/{config,types,routes}.go`** — `ThemeFile` / `Config` / `ThemeResponse` pick up `MarkerStroke` so `theme.json` and `config.json` can ship server-side defaults. Defaults mirror the `:root` CSS values so no breaking change for current operators. - **`config.example.json`** — documented `markerStroke` section with usage hint. ## TDD - **Red commit** `92183f95` — `test-issue-1488-marker-stroke-vars.js` (5 sections, 18 assertions); failed 14/18 before implementation. - **Green commit** `ce39637e` — implementation; same test now passes 18/18. - Existing `#1438` (marker CSS-var migration) and `#1293` (marker shapes) regression tests still pass. - Go tests (`cmd/server/...`) all green. ## CDP validation Synthetic page with 600 markers, three blocks proving CSS-var control works end-to-end: | Block | Stroke setting | Computed `getComputedStyle().stroke` / width / opacity | | --- | --- | --- | | Default | `var(--mc-marker-stroke-color)` (no override) | `rgba(255,255,255,0.85)` / `1px` / `1` | | Tuned | inline `--mc-marker-stroke-*` (operator override) | `rgb(255,255,255)` / `0.5px` / `0.3` | | Cyan | inline `--mc-marker-stroke-*` (branding/CB) | `rgb(0,229,255)` / `2px` / `1` | Same SVG source, three different rendered strokes — that's the whole point. Runtime `documentElement.style.setProperty(...)` (which is exactly what the customizer slider's `input` handler does) repaints mounted markers without reload. CDP screenshot attached to the implementation note. ## Hot-deploy Frontend + Go binary changes. Safe to hot-deploy frontend files (`public/*.js`, `public/style.css`) via the standard staging path; Go binary update needs a container restart. ## Defer Glow highlight ring (the second half of #1488) — separate follow-up issue. This PR delivers the immediately-useful, smaller deliverable. Partial fix for #1488 (stroke customization shipped; glow ring deferred to a follow-up issue). --------- Co-authored-by: meshcore-bot <bot@meshcore.local> |
||
|
|
13bdee57d4 |
perf: P0 hot-path fixes (observers, neighbor-graph, observer-analytics) (#1481) (#1483)
## What Three of the four P0s from #1481's scale-test findings. Each cuts a distinct hot path; together they target /api/observers, /api/analytics/neighbor-graph, and /api/observers/{id}/analytics — the top three live offenders. ### P0-1: 5-min atomic-pointer cache for default neighbor-graph response - Live p95 10.8s on the most-trafficked organic endpoint. - Background recomputer (5-min cadence per operator directive) builds the default-filter (`minCount=5 minScore=0.1`, no region, no role) `NeighborGraphResponse` and stores it via `atomic.Pointer`. - `handleNeighborGraph` short-circuits on the default shape; non-default filters take the extracted `computeNeighborGraphResponse` path (identical semantics to the previous inline build). ### P0-2: cache parsed `StoreObs.Timestamp` + drop RLock window - `handleObserverAnalytics` re-parsed the RFC3339 timestamp three times per observation, for 60k+ observations per active observer, under `s.store.mu.RLock` — blocking writers for the full scan. - `StoreObs.ParsedTime()` parses once via `sync.Once` (mirrors `StoreTx.ParsedDecoded`). - Handler snapshots the `byObserver[id]` pointer slice, releases the RLock immediately, then iterates locally. ### P0-3: 30s cache for `/api/observers` + sargable `IN` + covering index - Three SQL queries on every request → ~1.7s p50 at 50-concurrent. - Atomic-pointer 30s cache for the default (no-filter) query. - `GetNodeLocationsByKeys` drops `LOWER(public_key) IN (...)` (non-sargable); callers pre-lowercase in Go and the plain `IN` matches the existing `public_key` index. - New ingestor migration `obs_observer_ts_idx_v1` adds composite index `idx_observations_observer_idx_timestamp(observer_idx, timestamp)` so `GetObserverPacketCounts` can resolve its GROUP-BY + range filter from the index without scanning the 1.9M-row observations table. ### P0-4: deferred `perfMiddleware`'s global mutex was claimed to serialize every API request. A direct test (`50 concurrent requests through the middleware, handler sleeps 20ms each`) shows total elapsed ≈ 25ms, not 1s — the lock is held only for the post-handler bookkeeping (a few µs). Real impact is below measurement noise. Skipping to avoid invasive churn on PerfStats consumers without a demonstrable win. ## Test plan Red → green per P0: - `observers_cache_test.go` — handler reads `s.observersCache` before SQL, TTL boundary, atomic.Pointer (no mutex contention). - `storeobs_parsedtime_test.go` — parses three timestamp shapes, caches result, no race under concurrent readers. - `neighbor_graph_cache_test.go` — handler serves from atomic pointer when set, bypasses cache when `?region=` (or any non-default filter) is passed. Full server + ingestor suites pass: `go test -count=1 ./...`. ## Perf proof Before/after p50/p95/p99 (50 requests × 50 concurrent) against prod (before) and staging once CI deploys (after) will be posted as a PR comment per the operator's "no merge without proof of improvement" gate. Closes #1481 ## TDD exemption — P0-1 and P0-2 (net-new surfaces, AGENTS.md) Per CoreScope `AGENTS.md` § "Exemptions": **net-new code surfaces with no prior tests to break** may land tests in the same PR without a strict test-first → impl commit split. - **P0-1 (neighbor-graph atomic-pointer cache)** — `neighborGraphCache`, `recomputeNeighborGraphCache`, `loadNeighborGraphCacheBytes`, `startNeighborGraphRecomputer` and the default-shape short-circuit in `handleNeighborGraph` were brand-new code with no pre-existing assertions covering them. There was no green test to first turn red. - **P0-2 (cached `StoreObs.Timestamp` + RLock window drop)** — `StoreObs.ParsedTime()` and the snapshot+release pattern in `handleObserverAnalytics` were new surfaces; the prior code did the parse inline per call with no behavioural test to break. P0-3 was authored properly red-then-green (commit `6e63ec6a` red, then `83ae129b` green) and does NOT use this exemption. ## Default-filter detection vs frontend reality (#1483 follow-up) The Neighbor Graph analytics tab in `public/analytics.js` fetches `/analytics/neighbor-graph?min_count=1&min_score=0` because the client-side sliders need the full edge set to filter from. That shape did NOT match the `(5, 0.1)` cached default, so the UI tab still paid the cold compute cost despite #1481 P0-1. The #1483 follow-up commit caches BOTH shapes in the same recomputer pass: - `(minCount=5, minScore=0.1, no region, no role)` — `live.js` affinity-scoring consumer. - `(minCount=1, minScore=0, no region, no role)` — analytics tab. Both are served from `atomic.Pointer` with an `X-Cache-Age-Seconds` header. The per-shape cost in the background goroutine is roughly linear in edge count; total recompute time stays well under the 5-minute cadence on prod-scale graphs. --------- Co-authored-by: openclaw-bot <bot@openclaw.dev> Co-authored-by: mc-bot <mc-bot@users.noreply.github.com> |
||
|
|
43b93c6bb9 |
feat(observers): surface naive-clock observers as ⚠️ chip + detail banner (#1478) (#1480)
## Summary Issue #1478 — surface observers whose envelope timestamps are being clamped because they're emitting zone-less local-time strings (UTC-N observers showed up perpetually as "Stale" before #1466, and per-packet rxTime is still clamped to ingest time for them, muddying propagation-delay analytics). Now the UI tells operators which observers are misconfigured + how to fix it. ## What changed ### Ingestor (cmd/ingestor) - New `observers_clock_naive_v1` migration adds three columns to `observers`: - `clock_skew_seconds INTEGER` (signed: negative = behind UTC, positive = ahead) - `clock_skew_count_24h INTEGER` (rolling 24h event count) - `clock_last_naive_at TEXT` (RFC3339 timestamp of last clamp) - `resolveRxTime` now returns `(rxTime, naiveSkewSec)`. The packet-handler call site invokes `store.RecordNaiveSkew(observerID, deltaSec)` whenever a naive envelope is clamped (the existing >15 min naive-tolerance path). The counter resets to 1 if no event in the prior 24h, else increments. Single INSERT-or-UPDATE round trip per clamp. ### Server (cmd/server) - `Observer` struct + `GetObservers` / `GetObserverByID` extended to scan the three new columns. - `ObserverResp` gains four JSON fields exposed by `/api/observers` and `/api/observers/{id}`: - `clock_naive` (bool, derived from `clock_last_naive_at` being within 24h) - `clock_skew_seconds`, `clock_skew_count_24h`, `clock_last_naive_at` - Decay is **read-side**: a stale event yields `clock_naive=false` with zero counts. No background sweep, no writes from the read-only server, no race with the ingestor. ### Frontend (public) - `window.ObserversNaiveChip.render(o)` — total render helper, returns ⚠️ chip HTML when `o.clock_naive===true`, `""` otherwise. Used inline in the observers-list `name` cell and in the row-detail slide-over. Tooltip explains magnitude + direction + count + fix. - `window.ObserverDetailNaiveBanner.render(obs)` — yellow alert banner at the top of the observer-detail page with the skew magnitude, last-event timestamp, and the actionable fix ("Set host clock to UTC, OR emit Z-suffixed/offset-aware timestamps from the observer script"). ## TDD trail - `5ddd5b42` red: backend `cmd/server/observer_naive_clock_1478_test.go` (3 tests asserting JSON fields + 24h decay) + frontend `test-observer-naive-clock-1478.js` (8 jsdom-style tests asserting helpers exist and render correctly). Both failed on master with field-missing / export-missing assertions. - `4ecc79c8` green backend: schema + Observer / GetObservers / ObserverResp / handler decay. - `2137ab81` green frontend: chip + banner helpers and call sites. ## Tests - `cd cmd/server && go test ./...` → all green (full suite, 46s) - `cd cmd/ingestor && go test ./...` → all green (full suite, 98s) - `node test-observer-naive-clock-1478.js` → 8/8 pass - `node test-frontend-helpers.js` → unchanged from master (pre-existing failures only) ## Acceptance (issue #1478) - ✅ Observer running with `python datetime.now().isoformat()` (naive, off by N hours) → `clock_naive=true` after the next clamp → UI shows ⚠️ chip + banner. - ✅ Observer with `datetime.now(timezone.utc).isoformat()` (Z-suffixed) → never clamped → never flagged. - ✅ Observer that fixed its clock → `clock_naive` returns to `false` 24h after the last clamp event (read-side decay). Closes #1478. --------- Co-authored-by: openclaw <bot@openclaw.local> |
||
|
|
462cb2cb5a |
chore: update MeshCore URLs to use new site (#1445)
# Summary The main MeshCore website is https://meshcore.io. Reasons for the new website are listed here: https://blog.meshcore.io/2026/04/23/the-split # Changes Any occurrence of `meshcore.co.uk` was replaced with `meshcore.io`. No logic was changed, only updated strings. Co-authored-by: hrtndev <hrtndev@users.noreply.github.com> |
||
|
|
7c40e24a35 |
feat(server): warn at startup when GOMEMLIMIT < 50% of container memory limit (#1264) (#1429)
## Summary - Adds `readCgroupMemoryMB()` to detect container memory ceiling from cgroup v2 (`/sys/fs/cgroup/memory.max`) and v1 (`/sys/fs/cgroup/memory.limit_in_bytes`) - Adds `warnIfMemlimitUnderprovisioned()` called once from `main()` after the existing memlimit block — logs a `[memlimit] WARN` at startup if the effective GOMEMLIMIT is below 50% of the container limit - Works whether the limit was set via `GOMEMLIMIT` env var or derived from `packetStore.maxMemoryMB` - Adds `readCgroupMemoryMBFn` package-level hook for test injection (same pattern as `readProcSelfIOFn` in the ingestor) Fixes #1264. In the reported incident, GOMEMLIMIT was 1536 MiB on a 7.7 GB container; GC consumed 82% of CPU and all endpoints were 3–100× slower. This warning fires at startup so operators catch the misconfiguration before it causes an incident. ## Test plan - [ ] `TestWarnIfMemlimitUnderprovisioned_EmitsWarning` — warning fires when effective < 50% of cgroup - [ ] `TestWarnIfMemlimitUnderprovisioned_NoWarnWhenAdequate` — no warning at boundary (effective = 1024 MiB, cgroup = 1536 MiB) - [ ] `TestWarnIfMemlimitUnderprovisioned_NoCgroupNoLog` — silent on non-container hosts - [ ] `TestWarnIfMemlimitUnderprovisioned_NoneSource` — no warning when `source="none"` (no limit configured, runtime returns math.MaxInt64) - [ ] `TestMemlimitUnderprovisioned` — boundary table for the comparison helper - [ ] All existing `TestApplyMemoryLimit_*` still pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
ad45a774d7 |
test(paths): regression test for #1144 — hop name mis-resolution on prefix collision (#1433)
## Summary - Adds `TestHandleNodePaths_HopName_CanonicalPathShowsTarget_1144` as a regression test for issue #1144 - When two nodes share a short pubkey prefix (e.g. `"37"`), the biased hop resolver (`resolveWithContext`) could pick a GPS-having sibling over the actual target node, producing the wrong name in hop display - The bug was already fixed during the #1352 canonical-path work: the canonical-path branch (Option A) uses `lookupNode(resolvedPK)` with the full pubkey from `resolved_path`, bypassing the biased resolver entirely - This PR documents and locks in the correct behaviour with a targeted test ## Test setup - `targetPK` (`37cf...`): no GPS - `siblingPK` (`37bb...`): has GPS — the biased resolver's tier-3 picks this without the fix - One TX with `resolved_path = [targetPK]` → Option A fires → `lookupNode(targetPK)` → hop shows `"CJS SF Mission"`, not `"Templeton Hills"` If Option A were removed (bug re-introduced), `resolveWithContext("37", ...)` on the two candidates would return the GPS-having sibling, triggering the test failure. ## Test plan - [x] `go test -run TestHandleNodePaths_HopName -v` passes - [x] Full `go test ./...` passes - [x] Code review addressed (collapsed redundant error checks) Closes #1144 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
981664528e |
perf(server): serve stale repeater enrich cache instead of inline rebuild (#1272) (#1436)
## Summary - Removes the TTL-based inline rebuild from `GetRepeaterRelayInfoMap` and `GetRepeaterUsefulnessScoreMap` - When the cache is non-nil it is returned immediately, regardless of age — no more 700ms on-request recompute - Inline compute is retained only as a nil-cache guard (edge case: tests without a running recomputer) - Fixes the stale `// 15s-TTL gate` comment in `recomputeRepeaterEnrichmentSafe` **Root cause:** `computeRepeaterRelayInfoMap` runs inline when the TTL expires, taking ~700ms on a busy instance. `StartRepeaterEnrichmentRecomputer` (introduced in #1262) already keeps the cache warm via synchronous prewarm at startup + 5-min ticks, making the inline path dead code that fires only when the TTL is shorter than the recomputer interval (e.g. custom `analytics.defaultIntervalSeconds > 600`). ## Test plan - [ ] `TestGetRepeaterRelayInfoMap_ServesStaleOnTTLExpiry` — regression guard: stale sentinel is returned without recompute - [ ] `TestGetRepeaterUsefulnessScoreMap_ServesStaleOnTTLExpiry` — same for usefulness score map - [ ] `TestGetRepeaterRelayInfoMap_BuildsWhenNil` — nil-cache fallback still works - [ ] Full `-short` suite passes (`go test -short ./...`) Closes #1272 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
b3e55ae8d5 |
fix(nodes): sort paths-through-node by recency, count as tiebreaker (#1145) (#1431)
## Summary
- `/api/nodes/{pk}/paths` returned paths in non-deterministic map
iteration order; with many paths the UI showed a random ordering on each
page load
- Now sorted by `LastSeen` descending (newest-first), with `Count` as a
tiebreaker (higher first)
- Nil `LastSeen` sorts last (treated as oldest)
- `LastSeen` is an RFC 3339 string so lexicographic comparison is
correct
Closes #1145.
## Test plan
- [ ] `TestHandleNodePaths_SortByRecency_1145` — 3 distinct paths (via
relay1, relay2, direct), verifies newest appears first
- [ ] `TestHandleNodePaths_SortCountTiebreaker_1145` — two paths with
identical `LastSeen`, verifies higher-count path wins the tiebreak
- [ ] All existing `TestHandleNodePaths_*` tests still pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
d24246395d |
fix(#1456): rename Usefulness → Traffic share + add traffic_share_score field (#1457)
## Summary Rename the "Usefulness" UI label to "Traffic share", add hover tooltips for both Traffic share and Bridge score, and introduce a new `traffic_share_score` field on `/api/nodes` (alongside the legacy `usefulness_score`, kept for API back-compat). Closes #1456. ## Why The "Usefulness" label implied a composite score that doesn't exist yet — only the Traffic-share axis (axis 1 of 4 from #672) and the Bridge axis (axis 2 of 4 from #1275) are wired today. A node with low traffic but critical structural position read as "not useful" — exactly wrong. Neither score had a tooltip explaining what it measured. ## Changes ### Frontend (`public/nodes.js`) - Visible label `Usefulness` → `Traffic share` (with ⓘ glyph) - Tooltip explains traffic-share semantics, cross-references Bridge for structural importance, points at #672 for the 4-axis roadmap - Bridge row gets a parallel ⓘ glyph and a tooltip naming "betweenness centrality" + the "quiet but irreplaceable chokepoint" interpretation - Prefers new `traffic_share_score` with graceful fallback to legacy `usefulness_score` ### Backend (`cmd/server/routes.go`) - `/api/nodes` and `/api/nodes/{pubkey}` now emit BOTH `usefulness_score` (kept for API compat) AND `traffic_share_score` (new canonical name), populated with the same value - Inline comment documents the deprecation path: when the #672 composite ships, `usefulness_score` becomes the composite and `traffic_share_score` keeps the per-axis value ## Tests - `test-issue-1456-score-labels.js` — file-grep pins on `nodes.js` (label, tooltip fragments, percent formatting, dual-field read with fallback) - `cmd/server/traffic_share_score_test.go` — `/api/nodes` + `/api/nodes/{pk}` responses contain both fields with equal values TDD: red commit (`8bd235a0`) added failing tests; green commit (`c4d3aee5`) implemented. `go test ./cmd/server/...` passes (47s). ## Out of scope - Renaming the backend field (would break consumers) - Wiring axes 3 (Coverage) and 4 (Redundancy) — tracked in #672 - Changing the score calculation --------- Co-authored-by: clawbot <bot@openclaw.local> |
||
|
|
777f77a451 |
feat(#1420): dark-tile provider picker in customizer (4 variants) (#1430)
# feat(#1420): dark-tile provider picker in customizer (4 variants) Closes #1420. ## What Operator pick: don't force a single dark-tile choice on everyone. Wire 4 candidates into the customizer + server config so users can choose which dark basemap they want, with per-browser persistence. ## Providers shipped | ID | Source | Filter | |---|---|---| | `carto-dark` (default) | `https://{s}.basemaps.cartocdn.com/dark_all/{z}/{x}/{y}{r}.png` | none | | `esri-darkgray-labels` | Esri Dark Gray Base + Reference (two stacked layers) | none | | `voyager-inverted` | Carto Voyager + CSS `invert(1) hue-rotate(180deg) brightness(0.9) contrast(1.05)` on `.leaflet-tile-pane` | applied in dark, cleared in light | | `positron-inverted` | Carto Positron + same CSS invert | applied in dark, cleared in light | No new dependencies — all providers are URL-only. ## Architecture - **`public/map-tile-providers.js`** — registry + 5 public helpers (`MC_TILE_PROVIDERS`, `MC_setDarkTileProvider`, `MC_getDarkTileProvider`, `MC_setServerDefaultTileProvider`, `MC_applyTileFilter`). Persists to `localStorage['mc-dark-tile-provider']`. Dispatches `mc-tile-provider-changed` on user pick. - **`public/map.js` / `public/live.js`** — resolve the active dark provider via the registry, manage the Esri labels overlay lifecycle (add when needed, remove cleanly so we don't leak layers on repeated theme toggles), and apply/clear the CSS filter on `.leaflet-tile-pane`. Listen for both `data-theme` mutations AND `mc-tile-provider-changed`. - **`public/customize-v2.js`** — new "Dark Map Tiles" dropdown in the Display tab. On change, calls `MC_setDarkTileProvider(id)`; the maps re-render live without reload. - **`public/roles.js`** — hydrates the server default via `MC_setServerDefaultTileProvider` from `/api/config/client`. - **Server (`cmd/server/`)** — new `mapDarkTileProvider` string on `Config` + surfaced in `ClientConfigResponse`. Default empty → client uses `carto-dark`. - **`config.example.json`** — documents the new field with all allowed values. ## Behavior guarantees (from the acceptance criteria) - ✅ Light mode is **completely unchanged** — `_resolveTileUrl(false)` short-circuits to `TILE_LIGHT` with no filter and no overlay logic. - ✅ Switching dark→light always clears the CSS filter, even if an inverted provider remains selected (`MC_applyTileFilter` is called on every theme change and early-returns to `style.filter = ''` when not dark). - ✅ Switching light→dark with an inverted provider re-applies the filter. - ✅ Attribution is updated per provider (Esri credit for Esri, CartoDB credit for the others); the Leaflet attribution control is refreshed. - ✅ Esri uses two stacked layers (base + reference labels). The reference layer is added/removed cleanly so repeat toggles do not leak. - ✅ Customizer change → immediate re-render, no reload. Uses the same "live setting + persist + dispatch event" pattern as cb-presets (#1361). ## TDD - Red commit: `148b71c3` — `test(#1420): add failing tests for dark-tile provider registry (red)` — 6/7 assertions fail (stub only returns nulls). - Green commit: `49ffb230` — `feat(#1420): dark-tile provider picker — 4 variants wired into customizer` — 7/7 pass. ## Tests `test-issue-1420-tile-providers.js` (wired into `test-all.sh` and `.github/workflows/deploy.yml` JS-unit step): ``` ── #1420 Dark-tile provider registry ── ✅ MC_TILE_PROVIDERS has all 4 IDs with url + attribution ✅ Inverted providers have non-null invertFilter; non-inverted have null ✅ MC_setDarkTileProvider persists to localStorage and dispatches mc-tile-provider-changed ✅ MC_setDarkTileProvider rejects unknown IDs (no persistence, no dispatch) ✅ MC_getDarkTileProvider falls back to server default, then carto-dark ✅ Apply filter for inverted provider in dark mode; clear when switching to non-inverted ✅ Light mode always clears the CSS filter even if inverted provider is selected 7 passed, 0 failed ``` `cd cmd/server && go build ./... && go vet ./...` — clean. ## CDP verification Not run in this PR — the sandbox does not have a Chrome CDP endpoint reachable, and staging cannot exercise this code path until this branch is deployed. The issue body's "CDP-verified candidate set" table covers prior provider-URL validation; the new code path (registry lookup + filter swap + Esri overlay lifecycle) is covered by the unit tests above. **Recommend operator run a quick manual verification on staging post-deploy:** dark mode → open customizer → cycle through all 4 providers, confirm tiles render and the CSS filter is applied for `voyager-inverted` / `positron-inverted` (verify via `getComputedStyle(document.querySelector('.leaflet-tile-pane')).filter`). ## Files touched - `public/map-tile-providers.js` (new) - `public/map.js`, `public/live.js`, `public/customize-v2.js`, `public/roles.js`, `public/index.html` - `cmd/server/config.go`, `cmd/server/routes.go`, `cmd/server/types.go` - `config.example.json` - `test-issue-1420-tile-providers.js` (new), `test-all.sh`, `.github/workflows/deploy.yml` - `.eslintrc.json` (register new `MC_*` globals) --------- Co-authored-by: openclaw <bot@openclaw.local> |
||
|
|
f0c69d5fe7 |
perf(server): fix repeaterEnrichTTL mismatch causing 18s /api/nodes latency (#1425)
## Root cause `repeaterEnrichTTL` was **15 seconds**, but the background recomputer (`StartRepeaterEnrichmentRecomputer`) runs every **5 minutes**. After each recomputer tick, the relay/usefulness caches were valid for 15 seconds. For the remaining 4m45s, every `/api/nodes` request hit a stale TTL gate in `GetRepeaterRelayInfoMap` / `GetRepeaterUsefulnessScoreMap` and fell through to `computeRepeaterRelayInfoMap` **on the request goroutine**. On production (16k+ transmissions, 240k hop records) that rebuild takes ~18 seconds, making `/api/nodes?limit=5000` freeze on virtually every page load. The pattern was: ``` recomputer runs at T=0 → cache valid T=15s → TTL expires T=15s … T=5min → every request rebuilds on-thread (18s each) T=5min → recomputer runs again → 15s valid window repeat ``` ## Fix One line in `repeater_enrich_bulk.go`: ```go // Before const repeaterEnrichTTL = 15 * time.Second // After const repeaterEnrichTTL = 10 * time.Minute ``` The TTL now exceeds the recomputer interval so the cache is always warm between background ticks. The TTL remains as a safety net for cases where the recomputer isn't running (tests, early startup edge cases) — it just no longer expires between ticks. ## Production results (analyzer.on8ar.eu) Tested with binary injection on the live server before opening this PR. | Metric | Before | After | |--------|--------|-------| | TTFB (`/api/nodes?limit=5000`) | 18.6 s | 0.47–0.54 s | | Total response time | 18.9 s | 1.55–1.73 s | | Improvement | — | **34–39×** | Confirmed still fast at t+60s (well past the old 15s window). ## Test results ``` TestHandleNodesPerfLargeFleet elapsed=1.9ms budget=2s PASS TestHandleNodesLimit2000ColdMiss elapsed=5.3ms budget=2s PASS ``` Both existing perf regression tests pass unchanged — the TTL change doesn't affect their behavior (they test the cold-prewarm path, not TTL expiry). ## Why this wasn't caught by tests `TestHandleNodesLimit2000ColdMiss` only tests the cold-startup path (cache nil → on-thread build → cache hit). It doesn't test the TTL-expiry path (cache exists but stale → on-thread rebuild). A test covering the latter would need to fast-forward time past the TTL, which the existing fixture doesn't do. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
f15d2efe81 |
fix(#1386): #1324 follow-up — test coverage + RWMutex + lock-hold-time + dead code + cadence (#1390)
# #1324 follow-up — test coverage + RWMutex + lock-hold-time + dead code + cadence Addresses the post-merge audit findings in #1386 on PR #1324 (multi-byte capability persistence). Two independent audits (Kent Beck test-quality + Carmack perf) surfaced one top-level test-coverage gap and three perf concerns. This PR closes all of them; cadence cleanup is included. Red commit: `<RED_SHA>` (CI: `<RED_URL>`) ## What 1. **Tests** (`cmd/ingestor/multibyte_persist_test.go`): - `TestRunMultibyteCapPersist_RoundTrip` — end-to-end persist → close store → reopen → assert DB state survived. - `TestRunMultibyteCapPersist_MalformedSnapshot` — corrupt snapshot must log + no-op, not crash. - `TestRunMultibyteCapPersist_MissingSchemaColumns` — legacy DB without `multibyte_sup` cols must skip with explicit log, not panic / silently swallow. - `TestRunMultibyteCapPersist_PreservesConfirmedOnUnknown` — status=`unknown` MUST NOT clobber an existing `confirmed` row (mutation guard for the data-destruction check). 2. **`cmd/server/store.go`** - `cacheMu sync.Mutex` → `sync.RWMutex`. The per-node `GetMultibyteCapFor` read path in `/api/nodes` (`routes.go:1215`) uses `RLock` now; no longer serializes against itself or against analytics readers. - Build the multi-byte index map OUTSIDE `cacheMu`, then swap the pointer inside. Removes a 2400-iteration allocation hold from the analytics-cycle critical section. - Drop the dead `GetMultiByteCapMap` (zero callers confirmed by `rg`) and the stale `multibyteStatusToInt` tombstone comment. 3. **`cmd/ingestor/multibyte_persist.go`** - Replace the per-entry pair of `UPDATE nodes` + `UPDATE inactive_nodes` (50% guaranteed-miss) with a single dispatch-by-table-membership `UPDATE` per entry. ~50% fewer prepared-stmt round-trips. - Explicit `MalformedSnapshot` log line distinct from cold-start. - Defensive schema-presence check via `PRAGMA table_info` once at start; logs `[multibyte-persist] schema missing` and returns clean stats on legacy DBs. 4. **`cmd/server/analytics_recomputer.go` / `config.example.json`** — bump default snapshot cadence from 15s to 1m (the snapshot is a derived cache the ingestor only reads every 5 min; 4× less disk churn, no observable freshness loss). ## Why Direct quotes from the audit (#1386): > *"No end-to-end persist→restart→load round-trip — the documented > value prop of the PR ('survives restart') has no single test > exercising the full path."* (Kent Beck) > *"`cacheMu` is `sync.Mutex` not `sync.RWMutex` + per-node read in > `handleNodes` — 2400 serialized lock acquisitions per `/api/nodes` > call, contended against every analytics-cache reader/writer. > The O(1) win is consumed by lock contention."* (Carmack #1) > *"Map construction held under shared `cacheMu` — every 15s > analytics cycle blocks every API cache read for the duration of a > 2400-entry map build. Build outside the lock, swap pointer > inside."* (Carmack #2) > *"`UPDATE nodes` + `UPDATE inactive_nodes` per entry … 4800 > prepared-stmt round-trips, 2400 guaranteed-empty."* (Carmack #3) > *"Server writes 20 snapshots for every one the ingestor reads. > Cadence mismatch — server could publish every 1 min and lose > nothing."* (Carmack §2) ## TDD Red commit adds the four tests above. Two of the four (`MalformedSnapshot`, `MissingSchemaColumns`) fail on assertions against the pre-fix `multibyte_persist.go`; the other two (`RoundTrip`, `PreservesConfirmedOnUnknown`) are regression coverage of behaviour the original implementation already honoured but never exercised — they exist to guard future mutation (the audit's mutation-suggestion lens). Green commit lands the implementation. ## Bench `go test -bench BenchmarkGetMultibyteCapFor -benchmem -count=10` (local, idle laptop, n=2400-entry index, 8 reader goroutines vs. one analytics writer): | variant | ns/op | allocs/op | |--------------------|------:|----------:| | `sync.Mutex` (pre) | n/a — see note | — | | `sync.RWMutex` | n/a — see note | — | Note: did not produce a concurrent benchmark in this PR (would require non-trivial test scaffolding around the cache lifecycle). The win is structural — `RLock` allows the ~2400 per-`/api/nodes` reads to proceed in parallel rather than serializing on the same mutex held by every analytics writer. Documenting honestly per AGENTS.md "perf claims require proof": full microbench deferred to a follow-up. ## Manual verification (staging) - New tests: `go test ./... -count=1 -timeout 300s` in `cmd/ingestor` and `cmd/server` — green. - All multibyte-area tests (`#1366`, `#1368`, `#1372` regression suites in `multibyte_capability_test.go`, `multibyte_enrich_test.go`, `multibyte_region_filter_test.go`): green. - Preflight: `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` — exit 0. Fixes #1386 --------- Co-authored-by: claw <claw@openclaw.local> |