mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-07-03 16:51:44 +00:00
master
334 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
096e16409c |
fix(#1741): wrap test-DB insert loops in a single transaction (#1819)
## Fixes #1741 `TestBoundedLoad_OldestLoadedSet` (and any test building a 5000-row fixture) hung/timed out, blocking reliable `go test ./cmd/server` and CI. ## Root cause The four test-DB builders in `cmd/server/bounded_load_test.go` (`createTestDBAt`, `createTestDBWithObs`, `createTestDBWithAgedPackets`) inserted rows in a loop with no `BEGIN`/`COMMIT`. With the pure-Go `modernc.org/sqlite` driver every `Exec` auto-commits → one fsync per row → ~2N fsyncs for N transmissions (tx + obs). At `numTx=5000` that's ~10k fsyncs and the fixture blows past the test timeout. Sibling tests with `numTx<=3000` happened to stay under the timeout, so only the 5000-row cases visibly hung. ## Fix Wrap each insert loop in a single `BEGIN`/`COMMIT` so the whole fixture build becomes one commit. Fixtures now finish in well under a second regardless of `numTx`; the tests' actual assertions (`oldestLoaded` set, newest-first ordering, bounded load) are exercised instead of the timeout masking them. Also made the prepared-statement `Exec` calls check their error (previously discarded) so a failed insert surfaces instead of silently leaving the DB short. No production code changed — test infrastructure only. ## Verified - `TestBoundedLoad_OldestLoadedSet`: **0.18s** (was: 30s timeout / FAIL). - Full `TestBoundedLoad*` + retention group: passes in ~1.2s. - `go test ./...` in `cmd/server`: exit 0 (no longer blocks on this test). Co-authored-by: Waydroid Builder <build@waydroid.local> Co-authored-by: Claude <noreply@anthropic.com> |
||
|
|
6a32ec2b2d |
fix(#1729): preserve firmware-default Public channel (0x11) in analytics (#1817)
## Fixes #1729 The firmware-default **Public** channel (channel-hash byte `0x11` = 17) was rendered as an opaque **"Encrypted (0x11)"** row at the bottom of the analytics Channels tab, despite the key being well-known and builtin. ## Root cause `computeAnalyticsChannels` applied the #978 rainbow-table validation (`SHA256(SHA256("#name")[:16])[0]`, the **hashtag** hash scheme) to every decoded channel name. The Public channel is a **PSK** channel whose hash byte is key-derived (`SHA256(key)[0]` = 17), not hashtag-derived (`186` for `#Public`). So the ingestor-decoded name `"Public"` failed the hashtag check and was discarded, the row forced to `encrypted=true, name="ch17"`. ## Fix Trust the ingestor's `decryptionStatus`. The ingestor already persists `decryptionStatus:"decrypted"` when it decoded a packet with a real key (PSK), and `"no_key"` / `"decryption_failed"` otherwise. When the packet is `decrypted`, skip the hashtag hash check and keep the name — it came from a key-based decryption, not a rainbow-table lookup. The #978 mismatch rejection still applies to non-decrypted packets, so rainbow-table collisions are still caught. Frontend needs no change: `encrypted=false, name="Public"` lands in the "Network" group (top), not "Encrypted". ## Tests - `makeGrpTx` gains `makeGrpTxWithStatus` companion to set `decryptionStatus`. - `TestComputeAnalyticsChannels_PublicChannelPreserved`: hash 17 / "Public" / `decrypted` → name stays `"Public"`, `encrypted=false`. - `TestComputeAnalyticsChannels_UndecryptedNameStillValidated`: a non-`decrypted` name failing the hashtag check is still downgraded to `ch17` (#978 regression guard). All channel-analytics tests pass; `go build ./...` clean. Co-authored-by: Waydroid Builder <build@waydroid.local> Co-authored-by: Claude <noreply@anthropic.com> |
||
|
|
750b8742a7 |
fix(staging-compose): decouple in-container mosquitto from standalone broker (#1813)
Red commit: `3898dbc5` (verified locally — CI run URL pending)
## Problem
A standalone `mqtt-broker` container (`eclipse-mosquitto:2`) was
provisioned out-of-band on the staging VM. It now owns MQTT, is attached
to external docker network `meshcore-net`, and binds host port `8883`.
The current `docker-compose.staging.yml` still:
- Publishes `1883:1883` on the host (dead weight; conflicts the moment
the broker moves to that port).
- Defaults `DISABLE_MOSQUITTO=false`, so the in-container mosquitto
burns RAM and briefly contests the `mqtt-broker` docker DNS name on cold
start.
- Doesn't join `meshcore-net`, so the ingestor can't resolve
`mqtt-broker:1883` via docker DNS without manual surgery.
## Fix (`docker-compose.staging.yml` only)
1. Remove the `1883:1883` host port publish from `staging-go`.
2. Flip `DISABLE_MOSQUITTO` default from `false` to `true`. Operators
can opt back in with the env var.
3. Attach `staging-go` to both `default` and `meshcore-net`; declare
`meshcore-net` as `external: true` so the file never tries to
create/destroy operator state.
Healthcheck and Caddy/443 plumbing untouched (out of scope).
## Test added (TDD framing: Option A — Go shape-asserts)
`cmd/server/staging_compose_broker_test.go:1` adds four regex-based
assertions on the compose file shape:
- staging-go does **not** bind port `1883` in ANY form (quoted/unquoted
short form, or long-form `target: 1883` / `published: 1883`).
- `DISABLE_MOSQUITTO` uses the interpolated default form
`${DISABLE_MOSQUITTO:-true}` (preserves operator override). Bare literal
`true`, or a later `=false` override in the same env block, is rejected.
- Top-level `networks:` declares `meshcore-net` as `external: true`.
- `staging-go` attaches to `meshcore-net` via a real
`services.staging-go.networks:` sub-key (comment-stripped so an
in-comment example can't masquerade).
Regex (not YAML byte-equality) so cosmetic edits don't break the guard.
No new go module deps. Red commit `3898dbc5` fails all 4 assertions on
master. Green commit `38297ff4` makes them pass. Round-1 hardening
commit `9f7155e2` tightens the regexes (per adversarial + kent-beck
must-fixes) and was verified against master's YAML shape — all 4 tests
fail on `origin/master`'s compose, pass on branch, proving the tightened
regexes still gate a real regression.
## Risk
Low, with one intentional semantic change.
- **Semantic change (v3.7+):** `DISABLE_MOSQUITTO` in
`docker-compose.staging.yml` now defaults to `true`. This is a
**deliberate flip** — the standalone `mqtt-broker` container is now
authoritative on the staging host, and running the in-container
mosquitto alongside it wastes RAM and races the docker DNS name
`mqtt-broker` on cold start. Operators who want the pre-v3.7 shape
(in-container mosquitto + host-published `1883`) must explicitly opt
back in via env override AND re-add the `1883:1883` port mapping
(concrete snippet is inline in the compose file and in `DEPLOY.md` under
"Standalone MQTT broker (staging)"). This intent is called out in a
`SEMANTIC CHANGE (v3.7+)` header comment at the top of
`docker-compose.staging.yml`.
- **Deploy prereq:** the external `meshcore-net` docker network MUST
already exist on the host before `docker compose up`. If it doesn't,
compose refuses to start `staging-go`. This is documented inline in the
compose file (with the `docker network create meshcore-net` one-liner)
and in `DEPLOY.md`.
- **Only takes effect where the standalone broker is deployed** — which
it already is on staging today. The legacy `DISABLE_MOSQUITTO=false`
path remains reachable via env override; the ingestor's upstream config
is untouched.
Partial fix — no tracking issue; follow-up to operator-side broker
provisioning.
---------
Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: openclaw-bot <bot@openclaw.local>
|
||
|
|
fa15ab0a30 |
fix(#1809): gate background loader on LoadChunked completion (#1811)
Partial fix for #1809. Red commit: |
||
|
|
242c7c609b |
fix(mqtt): escalate persistent paho disconnect + recover from emit panic + expose watchdog tick (#1749) (#1810)
# Partial fix for #1749 — MQTT watchdog escalation + panic recovery +
tick exposure
Red commit:
|
||
|
|
b74a64ccfa |
fix(ui): canonical payload label map across packets/live/packet-filter (#1799) (#1804)
## Summary Replaces the three drifted per-surface payload-type label vocabularies with a single canonical map keyed by firmware enum name. Per the locked triage comment on #1799 ([comment-4823975431](https://github.com/Kpa-clawbot/CoreScope/issues/1799#issuecomment-4823975431)): > Create `public/payload-labels.js` exporting `{GRP_DATA: {short:'Group Data', long:'Group data packet', enumId:6}, ...}`. Migrate `packets.js typeMap`, `packet-filter.js FW_PAYLOAD_TYPES`, `live.js TYPE_COLORS legend` to consume it. E2E that scrapes each surface and asserts label equality. ## Changes - **`public/payload-labels.js`** (new) — canonical map exposed as `window.PayloadLabels` and `window.PayloadLabelsApi`. Keys are firmware enum names; values carry `{short, long, enumId}` plus derived `SHORT_BY_ID` / `FW_PAYLOAD_TYPES` / `TYPE_ALIASES` for legacy callers. - **`public/packets.js`** — `TYPE_NAMES` + `typeMap` now read from `PayloadLabelsApi.SHORT_BY_ID`. Literal kept only as a defensive fallback for the case where the script tag fails to load. - **`public/packet-filter.js`** — `FW_PAYLOAD_TYPES` + `TYPE_ALIASES` now sourced from `PayloadLabelsApi`. Literal fallback retained so `node test-packet-filter.js` still works headlessly. - **`public/live.js`** — legend `<li>` rows are now generated from `window.PayloadLabels` in stable order, killing the third-vocabulary `Message — Group text` / `Direct — Direct message` drift the #1797 review surfaced. - **`public/index.html`** — `<script src="payload-labels.js">` loaded before `roles.js` / `packet-filter.js` / `packets.js`. - **`test-issue-1799-label-vocab-e2e.js`** (new) — Playwright E2E. Scrapes `#liveLegend` rows and the `/packets` type-filter checklist, asserts each label matches `window.PayloadLabels[ENUM].short` for `TXT_MSG`, `GRP_TXT`, `GRP_DATA`. Also verifies `window.PacketFilter` still recognises the enum names. - **`.github/workflows/deploy.yml`** — wired the new E2E into the existing Playwright block. ## TDD trail - Red commit `eb392d4` — adds the failing E2E only (asserts `window.PayloadLabels` exists and labels match; both fail). - Green commit `44e902a` — introduces the canonical map and migrates the three surfaces. ## Verification - `node test-packet-filter.js` — 92/92 pass with the new fallback wiring. - Preflight: `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` — clean. Browser verified: E2E `test-issue-1799-label-vocab-e2e.js` exercises `/live` legend + `/packets` type filter against a Playwright headless Chromium; CI's Playwright block runs it on every push. E2E assertion added: `test-issue-1799-label-vocab-e2e.js:139` — `assert(fromLegend === canon, ...)` and `assert(fromPackets === canon, ...)` per enum. Fixes #1799 --------- Co-authored-by: mc-bot <bot@corescope> Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: clawbot <clawbot@kpa.com> Co-authored-by: clawbot <bot@clawbot.local> |
||
|
|
9ae547ed7b |
test: de-flake distance-202 and anchor-bias tests (deterministic timing) (#1808)
Two server tests flaked intermittently and reddened CI on unrelated (frontend) PRs that merged master: - TestDistanceConcurrentRequestsDuringBuildReturn202 asserted all 10 concurrent requests get 202 'during the build window', but the lazy distance build on the tiny test DB finishes almost instantly, so on a fast machine some requests raced past it and got 200 (~50% flake). Add a nil-by-default distanceBuildHook seam on PacketStore (zero overhead in prod) that the test uses to hold the build open until all requests have been served — making the window guarantee deterministic. - TestHandleNodePaths_AnchorBiasInconsistency_Issue1278 queried /paths right after store.Load(), racing the path-hop index that Load() builds in a background goroutine (#1008); the membership/canonical result was thus non-deterministic (rarer flake, worse under suite load). Wait for PathHopIndexReady() before querying. Both run 30x green and pass -race. No production behavior change (hook is nil). Co-authored-by: Waydroid Builder <claude@michael.arcan.de> |
||
|
|
ec0ebeda2f |
fix(#1793): WebSocket CheckOrigin allowlist (block cross-origin scrapers) (#1795)
## Summary Closes the wide-open `/ws` WebSocket upgrader (`CheckOrigin: return true`) that lets any browser origin scrape live packet data. Replaces it with an explicit allowlist consulted from `cfg.CORSAllowedOrigins`, plus an implicit same-origin allowance and an empty-Origin (non-browser client) allowance. Fixes #1793. ## Rules (`Hub.checkOrigin`) - Empty `Origin` header → **allow** (non-browser clients; per-IP rate/deny gating tracked separately in #1794). - `Origin` host == request `Host` (case-insensitive) → **allow** (same-origin). - `Origin` matches an entry in `cfg.CORSAllowedOrigins` by exact case-insensitive match → **allow**. - `"*"` in `cfg.CORSAllowedOrigins` is **deliberately ignored** for `/ws`. A startup `[ws] WARNING:` is logged once when present. - Anything else → **reject** (gorilla returns 403). ### Deliberate divergence from CORS XHR CORS XHR (`corsMiddleware`) still honors `"*"` for read-only cross-origin GETs. The `/ws` upgrade does NOT, per OWASP's WebSocket Security Cheat Sheet: > Use an allowlist, not a denylist. Avoid wildcards or substring matching. — https://cheatsheetseries.owasp.org/cheatsheets/WebSocket_Security_Cheat_Sheet.html `"*"` on the WS path would re-open the exact CSWSH/scraping vector this PR closes, so it is rejected with a startup warning rather than silently honored. This intentional asymmetry is documented in the updated `_comment_corsAllowedOrigins` in `config.example.json`. ## TDD red → green - `e5974c6a` **RED** — adds `cmd/server/websocket_checkorigin_test.go` with five cases; `SetAllowedOrigins` introduced as an enforcement stub so the test compiles and fails on the assertion (CI fails on this commit by design). - `a4791dc3` **GREEN** — implements `Hub.checkOrigin`, wires `SetAllowedOrigins` from `main.go`, updates the config example. All tests pass. ## Tests added (`cmd/server/websocket_checkorigin_test.go`) - `TestCheckOriginRejectsForeignOrigin` — foreign Origin → 403 - `TestCheckOriginAllowsEmptyOrigin` — non-browser client → 101 - `TestCheckOriginAllowsSameHost` — same-origin → 101 - `TestCheckOriginAllowsAllowlistedOrigin` — exact allowlist match → 101 - `TestCheckOriginWildcardDoesNotAllowForeignOrigin` — `"*"` in allowlist still rejects foreign origin → 403 ## Files changed - `cmd/server/websocket.go` — `Hub.allowedOrigins`, `SetAllowedOrigins`, `checkOrigin`, wired into `Upgrader.CheckOrigin`. - `cmd/server/main.go` — `hub.SetAllowedOrigins(cfg.CORSAllowedOrigins)` at the single call site. - `cmd/server/websocket_checkorigin_test.go` — new test file. - `config.example.json` — updated `_comment_corsAllowedOrigins` to document `/ws` gating and the `"*"` divergence. ## Out of scope (follow-up) - **#1794** — per-IP rate limit / deny list / connection cap for non-browser clients (which still bypass Origin because they don't send one). Layered defense; not in this PR. ## Verification - `go test ./cmd/server/...` — all server tests pass locally (574s). - Preflight clean (`bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master`). --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
ae2e3933dd |
feat(server): store memory diagnostics + drop redundant obs.RawHex (#1773)
Drops the redundant per-observation RawHex (~98MB on a live store; reader already falls back to tx.RawHex #881) and adds an opt-in /api/perf?mem=1 memory breakdown (flood-forward share + per-component bytes). Profiled against a live instance. **Savings substantiation:** live-instance profiling shows ~1.66M observations in the store, each previously carrying its own per-observation `raw_hex` (avg ≈118 hex chars ≈59 bytes) that exactly duplicates the parent transmission's `raw_hex`. Dropping the duplicate on every load/ingest path eliminates ≈98 MB of redundant in-memory storage plus ~1.66M string allocations, with no data loss — the read path (`enrichObs`) already falls back to `tx.RawHex` when `obs.RawHex` is empty (verified by the new safety-gate test). The patched build cannot be run against the live instance here; instead the new opt-in `/api/perf?mem=1` diagnostic lets operators measure the real before/after (`trackedMB` and the per-component breakdown) directly after deploy. |
||
|
|
3efa37c46c |
feat(server): complete the #672 4-axis repeater usefulness score (#1762)
Adds Coverage (harmonic reach) + Redundancy (Tarjan articulation) axes + composite & grade. Closes #672. **TDD note (BLOCKER-1):** Community PR delivered as a single squashed commit, so there is no separate pre-fix failing-test commit — please accept as a community-PR exemption. The tests are *gating*, not just thorough: each axis test pins a specific topology outcome (coverage on line/star/disconnected/weight-sensitive; redundancy online/triangle/star/bridged-cliques), and an end-to-end `/api/nodes` surface test drives the whole pipeline and asserts the composite diverges from the Traffic axis. Inverting the `1/weight` distance, dropping the NaN/Inf reject, removing the `redundancyMinWeight` floor, or aliasing `usefulness_score` back onto `traffic_share_score` each break a specific assertion. The axis functions are pure (no hidden state), so the suite fully characterises the behavior without the red anchor. Co-authored-by: Waydroid Builder <build@waydroid.local> |
||
|
|
17654dd090 |
docs(api): document per-node usefulness metrics in OpenAPI (#1769)
Documented Node schema (the four #672 usefulness axes + composite + A-F grade + relay fields) and response schemas on the node endpoints. Documentation-only; no behaviour change. Pairs with #1762 (documents the metrics it adds). Co-authored-by: Waydroid Builder <build@waydroid.local> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
fc26fb6b3a |
feat(#1751): show transported region scopes in repeater sidebar (#1752)
Closes #1751. |
||
|
|
f0763aecce |
fix(#1726): clear stale "varies" hash size once a node settles (#1788)
Fixes #1726. ## Problem A MeshCore v1.16.0 repeater configured for 2-byte path hashes (`path.hash.mode=1`) — e.g. `36f6c7c7…` (`DK_3400_RAK_TEST`) — kept showing as **"varies"** / mixed 1-byte + 2-byte for the full 7-day advert window. Per the live data in the issue triage: of the node's ~20 recent adverts, exactly **one** (2026-06-09, across 15 distinct observer paths) was a genuine 1-byte flood advert; every other advert was 2-byte. The flip-flop heuristic in `computeNodeHashSizeInfo` weighs that stale advert equally with recent ones, so an operator who flips `path.hash.mode` mid-flight (or a single old 1-byte advert) stays flagged for the full window with no way to signal "the config is settled now." ## Fix Two coupled changes in `cmd/server/store.go` `computeNodeHashSizeInfo`: 1. **Chronological ordering.** `byPayloadType[4]` iterates in insertion order, not timestamp order, so `HashSize = Seq[last]` could pick the wrong advert under out-of-order MQTT ingest or chunked cold-load (the "carmack" concern from triage). We now collect `(FirstSeen, size)` pairs and **stable-sort by `FirstSeen`**; ties keep insertion order, preserving prior behavior when timestamps are equal. 2. **Recency decay.** After `transitions >= 2` raises the flip-flop flag, clear it when the most recent `hashSizeRecentAgreeCount` (= **3**) non-zero-hop adverts all agree on a single size. A node still flapping (recent adverts disagree) stays flagged. `3` mirrors the existing ≥3-observation threshold used to raise the flag. ## Policy note Triage marked this **needs-operator-input** because the decay is a behavior/policy change. This PR implements the rule the triage proposed ("if the last 3 adverts agree, clear inconsistent"), which matches the reporter's stated expectation. Happy to adjust the threshold or gate it differently per your call. ## Tests `cmd/server/issue1726_hash_decay_test.go`: - `TestIssue1726_SettledNodeNotInconsistent` — reporter's case (`[2,1,2,2,2]` within window) → `Inconsistent=false`, `HashSize=2`. - `TestIssue1726_HashSizeUsesChronologicallyLatest` — out-of-order insertion still reports the chronologically-latest size. - `TestIssue1726_ActiveFlapperStaysInconsistent` — a node whose recent adverts disagree stays flagged. Existing flip-flop / hash-collision tests unchanged and green; full `cmd/server` package suite passes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Erwin Fiten <e.fiten@opteco.be> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
d437958474 |
fix(map): pin APC (Napa) and STS (Sonoma) observers (#1786) (#1787)
Fixes the map-coordinate gap in #1786. ## Problem Observers tagged with IATA code **APC** (Napa County) or **STS** (Charles M. Schulz–Sonoma County) render with no location and never pin on the map. ## Root cause `iataCoords` in `cmd/server/routes.go` is a hardcoded `IATA -> lat/lon` lookup used purely for placing observer/region markers on the map. It had no entry for APC or STS, so those observers had no coordinates to render with. This is **display-only**. Ingestion is not gated on these codes: `IsObserverIATAAllowed` (`cmd/ingestor/config.go`) short-circuits to `true` when the observer IATA whitelist is empty — which is the staging configuration. The reporter''s "packets disappear entirely" symptom is therefore **not** explained by this code path (likely an upstream `meshcoretomqtt`/broker topic issue; needs operator `mosquitto_sub` confirmation per triage). ## Fix - Add `APC {38.2132, -122.2807}` and `STS {38.509, -122.8128}` to `iataCoords`, matching the airports'' published coordinates. - Add a regression test (`TestIataCoordsIncludesNapaAndSonoma`) asserting both are present with the expected coordinates. ## Verification - `go test ./cmd/server/` — full package passes (`ok`). - `go vet ./cmd/server/` — clean. ## Scope note Checked the repo for other statically-enumerable region codes (`config.example.json` regions: SJC/SFO/OAK/MRY) — all already covered. The broader "are other in-use codes missing" question can only be answered against the live `cfg.Regions` + `db.GetDistinctIATAs()` set, which is operational, not in-tree. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Erwin Fiten <e.fiten@opteco.be> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
57956712e7 |
fix(#1768): Relay Airtime Share uses LoRa Time-on-Air (preamble-aware) — partial fix (#1776)
Partial fix for #1768 — Relay Airtime Share now uses closed-form LoRa Time-on-Air instead of a payload-bytes-only proxy, removing the ~3-4× bias against small frames (preamble + fixed-symbol intercept). cross-stack: justified — backend score formula needs a frontend caption change (`public/analytics.js` dumbbell preset banner + tooltip) so operators can interpret the assumed PHY block. Both move together or the metric is misleading. ## Red commit `8da57062` — failing test asserts ToA-based score (~83.48 % ADVERT share on the locked acceptance fixture) instead of the byte proxy's 95.24 %. `internal/lora.TimeOnAir` was a zero-returning stub at the red commit; tests failed with assertion errors, not build errors. ## Green commit `dd402edd` — implements `lora.TimeOnAir` (Semtech AN1200.13 / SX126x §6.1.4 closed form, cross-checked against RadioLib), wires `score = TimeOnAir(payloadBytes, preset) × distinctRelays` in `cmd/server/relay_airtime_share.go`, surfaces the preset in the JSON response and analytics caption. ## Config (per AGENTS Config Documentation Rule) New keys under existing `analytics` block: ```json "loraPreset": { "freq": 869600000, "bw": 62.5, "sf": 8, "cr": 5 } ``` Defaults match the deployment's actual `get radio` (869.6 MHz / BW 62.5 kHz / SF 8 / CR 4/5). `CRC=1`, `IH=0`, `DE = (T_sym ≥ 16 ms)`, and the SF-dependent preamble (32 for SF≤8 else 16, per firmware `preambleLengthForSF` / MeshCore PR #1954) are firmware-fixed constants in `internal/lora/toa.go` and intentionally NOT surfaced as config (per re-triage). ## Scope In-scope files (6): - `internal/lora/toa.go` (new package — closed-form ToA) - `internal/lora/toa_test.go` (table-driven preset tests) - `cmd/server/relay_airtime_share.go` (wire ToA into score) - `cmd/server/relay_airtime_share_test.go` (recomputed expected values) - `cmd/server/config.go` + `config.example.json` (preset config keys) - `public/analytics.js` (preset caption on dumbbell chart + tooltip) Plus `cmd/server/go.mod` (replace directive for the new internal module). ## Deferred to v2 (separate issues per re-triage) - Per-observation SF/BW + radio-settings-aware dedup (blocked: ingestor stores SNR/RSSI only, no SF/BW on observations). - CR-per-hop dual-point sensitivity band (CR scales only the payload symbol term `(CR+4)`, not the preamble/header; second-order accuracy gain). - Cross-SF bridge accounting. ## Tests ``` cd internal/lora && go test ./... → PASS cd cmd/server && go test -run RelayAirtime → PASS ``` ## Preflight overrides - `check-branch-clean` (cross-stack): justified above — score formula change requires matching caption update; both files trace to the same issue. --------- Co-authored-by: kpa-clawbot <kpa-clawbot@users.noreply.github.com> Co-authored-by: Kpa-clawbot <bot@openclaw.local> Co-authored-by: bot <bot@meshcore> |
||
|
|
22fe929da2 |
feat: opt-in mobile client-RX coverage (crowdsourced RF reach) + /api/nodes/resolve (#1728)
Implements #1727. ## What this adds **Mobile client-RX coverage** — an opt-in, crowdsourced RF-coverage feature. A roaming MeshCore **companion** radio (driven by the open-source [corescope-rx](https://github.com/efiten/corescope-rx) PWA, GPLv3) reports which nodes it heard directly, tagged with the phone's GPS and the packet's SNR/RSSI. CoreScope ingests these into a new `client_receptions` table and renders per-node **hex coverage** on the Reach page, plus a standalone **Coverage dashboard** (`#/rx-coverage`) with a top-mobile-observers leaderboard. Also includes **`GET /api/nodes/resolve?prefix=<hex>`** — a read-only node-name lookup by pubkey prefix (`{name, pubkey, ambiguous}`), used by the companion app for friendly names. ## Opt-in — default OFF (zero impact on existing deployments) The whole feature is gated behind one config flag, **disabled by default**: ```jsonc "clientRxCoverage": { "enabled": false } ``` When disabled (the default): the ingestor writes **no** `client_receptions`; the three coverage endpoints return a clean **404**; the UI hides the Coverage nav link, the `#/rx-coverage` route, and the Reach-page toggle. `/api/nodes/resolve` is always available (not coverage-specific). ## How it works ``` companion ──BLE 0x88 (snr+rssi+raw)──▶ corescope-rx PWA ──▶ MQTT meshcore/client/{pubkey}/packets │ ingestor (gated) ──▶ client_receptions (GPS + SNR + heard-key) │ server: pure-Go hex grid ──▶ GeoJSON ──▶ Reach hex overlay + Coverage dashboard ``` - **Direct-only capture:** records only what the companion heard itself and directly — a 0-hop advert's pubkey, or `path[last]` (last forwarder) for FLOOD routes; ≥2-byte path-hash required. Upstream hops discarded. - **No new deps:** hexbins are a pure-Go pointy-top grid over Web Mercator (`cmd/server/hexgrid.go`) computed at query time (`CGO_ENABLED=0` / `modernc.org/sqlite` friendly); frontend uses the existing Leaflet. - **Trust:** companion pubkey = identity; an EMQX ACL binds each client to publish only to its own `meshcore/client/{pubkey}/packets` topic. Payload contract in `docs/client-rx-coverage.md`. ## How to enable / try it 1. In `config.json`, set `"clientRxCoverage": { "enabled": true }` and restart server + ingestor. 2. Point an EMQX (or any broker) listener so a client can publish to `meshcore/client/<pubkey>/packets`; the ingestor already subscribes under `meshcore/#`. 3. Run the [corescope-rx](https://github.com/efiten/corescope-rx) PWA on an Android phone paired (BLE) to a MeshCore companion — it captures heard nodes + GPS and publishes. 4. View results: per-node Reach page → toggle **coverage**, or the **Coverage** dashboard at `#/rx-coverage`. ## What's where - **Ingestor:** `cmd/ingestor/client_reception.go` (ingest), `db.go` (`client_receptions` + `client_observers` schema), `main.go` (gated dispatch), `config.go` (flag). - **Server:** `cmd/server/rx_coverage.go` + `rx_dashboard.go` (endpoints, self-guard 404 when off), `hexgrid.go` (pure-Go grid), `node_resolve.go` (resolve), `routes.go` / `types.go` / `config.go` (wiring + flag + `/api/config/client` field). - **Frontend:** `public/rx-coverage.js` (dashboard), `node-reach-coverage.js` + `.css` (overlay), `node-reach.js` (Reach toggle, flag-gated), `roles.js` (reads the flag, hides nav when off). - **Docs:** `docs/client-rx-coverage.md`. ## Testing - Go: `cd cmd/server && go test ./...` and `cd cmd/ingestor && go test ./...` — green, including new gate tests (`coverage_gate_test.go` in both: off → no rows / 404, on → works) and the rx-coverage / resolve / hexgrid suites. - JS: `node test-coverage-gate.js`, `node test-node-reach-coverage.js` (wired into CI). The Playwright `test-node-reach-coverage-e2e.js` is wired into the e2e job and **skips when `clientRxCoverage` is disabled**, so it's safe under the default-off config. ## Notes for reviewers - The four new routes are registered in `cmd/server/openapi_known_gaps.json` (the existing OpenAPI-completeness ratchet), matching how other not-yet-spec'd routes are tracked. Happy to write full OpenAPI spec entries instead if you prefer. - Commits are split per layer (ingestor / server endpoints / resolve / frontend / CI) for review. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Erwin Fiten <e.fiten@opteco.be> |
||
|
|
97833c523b |
fix(post-packets): use v3 observations schema (closes #1196) (#1704)
## Summary `POST /api/packets` is broken on every v3-schema install — which is the default since #1289. The handler issues two writes against legacy v2 column names and silently swallows the observation insert's error, returning `200 OK` with `id>0` while persisting zero observation rows. ## Root cause `cmd/server/routes.go:1225-1235` (pre-fix) used the v2 schema shape: ```go INSERT INTO transmissions (... path_json ...) // path_json removed in v3 INSERT INTO observations (transmission_id, observer_id, observer_name, snr, rssi, timestamp) // v2 columns // timestamp written as RFC3339 text; v3 wants unix INTEGER // second Exec's error was discarded ``` v3 schema (`cmd/ingestor/db.go:289-304`): `observations.observer_idx INTEGER` (FK `observers.rowid`), `observations.timestamp INTEGER` (unix epoch), `path_json` lives here not on `transmissions`. Reporter [@EldoonNemar](https://github.com/EldoonNemar) called this out precisely in #1196 — both the schema mismatch and the divergence between the test harness (which uses the v3 shape) and the handler (v2 shape). ## Fix `cmd/server/routes.go`: - `transmissions` insert: drop `path_json` column. - Observer resolution: `INSERT OR IGNORE INTO observers (id, name, ...)` then `SELECT rowid` — mirrors the ingestor resolver at `cmd/ingestor/db.go:778,906`. - `observations` insert: write `observer_idx INTEGER` + `timestamp = time.Now().Unix()`; `path_json` moved here. - **Propagate both insert errors** (transmission + observation) as `500` instead of swallowing them. ## TDD | Step | Commit | Result | | ----- | ------- | ------ | | RED | `46d25389` | Test fails on master: `id=0` because the transmissions insert references a column not present in v3. | | GREEN | `dae57d67` | Test passes; round-trip persists the observation with `observer_idx` resolved from the seeded `obs1` row and a unix-epoch `timestamp`. | Local repro: ``` # RED on the test commit alone: $ go test -run TestPostPacketPersistsV3Schema -count=1 . --- FAIL: TestPostPacketPersistsV3Schema (0.03s) routes_test.go:4755: expected transmission id > 0, got 0 (body: {"id":0,"decoded":{...}}) FAIL # GREEN on HEAD: $ go test -run TestPostPacketPersistsV3Schema -count=1 . ok github.com/corescope/server 0.037s ``` ## Scope Two files, both in `cmd/server/`: - `cmd/server/routes.go` (+38/-12) — handler rewrite - `cmd/server/routes_test.go` (+66) — round-trip regression test No public API signature changes. No DB schema changes (consumes the existing v3 schema correctly). Closes #1196 |
||
|
|
76e130b313 |
fix(#1702): grant actions: write to release-fast-path workflow (#1703)
## Summary Fixes the missing `actions: write` permission on `.github/workflows/release-fast-path.yml` so the fallback `gh workflow run deploy.yml` dispatch no longer returns HTTP 403. ## Triage verdict From issue #1702 root-cause section: > Fast-path workflow YAML likely lacks: > ```yaml > permissions: > contents: read > packages: write > actions: write # MISSING — required to dispatch other workflows > ``` > ## Fix > One-line addition to `.github/workflows/release-fast-path.yml` permissions block. ## Root cause `.github/workflows/release-fast-path.yml` lines 16-18 (before this change) only granted `contents: read` and `packages: write`. The fallback step (`gh workflow run deploy.yml` when `:edge`'s `org.opencontainers.image.revision` label doesn't match the tag SHA) calls the GitHub Actions REST API, which requires `actions: write` on `GITHUB_TOKEN`. Without it, the dispatch fails with `Resource not accessible by integration` and the release stalls until an operator manually re-runs the fast-path job after `:edge` rebuilds. ## Change - `.github/workflows/release-fast-path.yml`: add `actions: write` to the workflow-level `permissions:` block. - `cmd/server/release_fast_path_workflow_test.go`: extend the existing config-gate test (issue #1677) to require `actions: write` alongside the previously asserted `contents: read` and `packages: write`. Two commits, red→green: 1. `test(#1702): assert release-fast-path.yml requires actions: write` — extends the assertion. Verified to fail on this commit (`release-fast-path.yml: missing required permission "actions: write"`). 2. `fix(#1702): grant actions: write to release-fast-path workflow` — adds the permission. Test green. ## TDD posture The repo already had a YAML-config gate at `cmd/server/release_fast_path_workflow_test.go` (parses the workflow as text and asserts required permission strings). Strict TDD applied: red commit extends the test, green commit fixes the workflow. No exemption needed. ## Acceptance criteria (from #1702) - [x] `permissions.actions: write` added to the fast-path workflow - [ ] Manual test: tag a scratch SHA where `:edge` is stale; confirm fallback dispatches deploy.yml without 403 — by-design out of CI scope (would require a throwaway tag + race condition); covered by next real release. - [ ] Operator-felt: next release where notes-commit lands AFTER `:edge` build completes works in one pass without manual rerun — verifiable only on next release; in-scope of `Closes #1702` because bullet 1 (the structural defect) is the cause of bullets 2 and 3. ## Preflight `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` → **clean** (all hard gates pass, no warnings). Closes #1702 --------- Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com> |
||
|
|
e96f0f9f9f |
fix(#1694): port extended ACK decoder to server (ackLen/ackAttempt/ackRand parity) (#1695)
## Summary Ports the firmware-1.16.0 extended ACK decoding from the ingestor (PR #1618, issue #1610) into the server-side re-decoder. Previously `cmd/server/decoder.go` silently dropped `ackLen`, `ackAttempt`, and `ackRand` (and the multipart inner equivalents) — the server emitted plain 4-byte ACKs even when the wire carried the 5/6-byte extended form. Now both decoders agree byte-for-byte. Closes #1694. ## What changed - `cmd/server/decoder.go::decodeAck`: sets `AckLen` (capped at 6), `AckAttempt` (`buf[4]` when `len>=5`), `AckRand` (`buf[5]` when `len>=6`). Mirrors `cmd/ingestor/decoder.go:279-305`. - `cmd/server/decoder.go::decodeMultipart` ACK branch: sets `InnerAckLen = len(buf)-1` (capped at 6), `InnerAckAttempt`, `InnerAckRand`. Mirrors `cmd/ingestor/decoder.go:696-714`. - `Payload` struct gains six `*int` fields tagged `omitempty`: `AckLen`, `AckAttempt`, `AckRand`, `InnerAckLen`, `InnerAckAttempt`, `InnerAckRand`. Backward-compatible JSON — legacy 4-byte ACKs leave attempt/rand nil and the fields are omitted from the output. No other decoder consumer is touched. Routes / store auto-surface the new fields via JSON marshaling. ## Test layout `cmd/server/decoder_ack_extended_test.go` drives `decodeAck` table-driven across the three wire shapes: | Buffer | AckLen | AckAttempt | AckRand | |---|---|---|---| | `EF BE AD DE` (CRC only) | 4 | nil | nil | | `EF BE AD DE 07` | 5 | 7 | nil | | `EF BE AD DE 07 42` | 6 | 7 | 0x42 | Plus `TestDecodeMultipartAckExtendedInner` for a 7-byte multipart buffer (`0x33` header + 6-byte inner ACK), asserting `InnerAckLen=6`, `InnerAckAttempt=7`, `InnerAckRand=0x42`. ## TDD trail - **Red commit** (test + struct stubs only, `decodeAck`/`decodeMultipart` unchanged) → assertions fail on `AckLen=nil`. - **Green commit** (port implementation) → all assertions pass. Full `cd cmd/server && go test ./...` passes locally. ## Firmware refs - `firmware/src/helpers/BaseChatMesh.cpp:218-234` (extended ACK layout) - firmware commit `f6e6fdaa` (attempt counter) - firmware commit `a130a95a` (RNG byte) --------- Co-authored-by: Kpa-clawbot <bot@kpa-clawbot> |
||
|
|
a8c99c61fd |
fix(#1659): block analytics endpoint until first pass complete (503 Retry-After) (#1688)
## Summary Fixes #1659 — analytics cards no longer show the post-restart slice when "All data" is selected. ## Root cause After server restart, `s.recompRF` / `s.recompTopology` / `s.recompChannels` cache the FIRST computation, which is the small in-RAM observations slice (background chunk-loader has not yet backfilled history). The recomputer serves that slice through `GetAnalyticsRFWithWindow`'s default shortcut for an entire recompute interval, while the client pins it via `CLIENT_TTL.analyticsRF`. UX: cards show a tiny window even when the user selects "All data". ## Fix shape (option B from the issue body) Server-side per-recomputer warm-up gate: - `cmd/server/analytics_warmup_1659.go` adds a per-recomputer `firstPassDoneNs` atomic timestamp, set ONLY by the first successful `runOnce()` (CAS-guarded for idempotency). `IsWarmingUp_1659()` / `FirstPassDoneAt_1659()` are lock-free reads. - `cmd/server/analytics_recomputer.go` `runOnce()` calls `markFirstPassDone_1659()` after every successful compute. - `cmd/server/routes.go` handlers for RF / Topology / Channels: when the request is the default shape (`region=="" && area=="" && window.IsZero()`) AND the matching recomputer is still warming up, return `503` + `Retry-After: 5` + `{"error":"analytics warming up","retry_after_s":5}`. Windowed / region-filtered requests bypass the gate (they already bypass the recomputer cache, so they are unaffected by the warm-up bug). Client-side: - `public/app.js` `api()` helper retries any 503 response, honoring `Retry-After`, with exponential backoff capped at 30s, max 6 attempts (~63s total). - Small "Computing analytics…" banner appears while any warm-up retry is in flight, dismissed once the request resolves. Pages can override via `window.onWarmup_1659`. ## Tests RED commit `8b2b2d7` ships failing-on-assertion tests + a stub. GREEN commit `2716c23` lands the fix and flips them green. - `cmd/server/analytics_warmup_1659_test.go` — 3 cases: 503 during warmup, 200 after first pass, windowed request bypasses gate. - `test-1659-analytics-warmup.js` — 3 cases: Retry-After honored, retry cap bounded, non-503 errors not retried. Wired into `.github/workflows/deploy.yml`. ## Preflight overrides - cross-stack: justified — server-side 503 contract MUST be paired with client-side retry-and-banner handling; splitting across two PRs would land a half-working fix. Fixes #1659. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw <openclaw@local> |
||
|
|
048143f54f |
fix(#1690): cold-load uses last_seen (effective recency) instead of first_seen (#1691)
## #1690 — cold-load uses wrong time axis (RED → GREEN) The on-disk DB has thousands of long-lived hashes with recent traffic. Prod's cold-load filter (`transmissions.first_seen >= cutoff`) is bound to a column that is set once at insert time and never updated — so re-observation of an old hash does not move it into the hot window. Result: prod cold-loaded ~0.3% of the on-disk rows and flipped `backgroundLoadComplete=true` without ever walking the retention window (the `retentionHours - hotStartupHours <= 0` short-circuit at line 1353 of `cmd/server/store.go`). ### Three sub-fixes **A) Denormalize `transmissions.last_seen`** so cold-load can window on effective recency. - `internal/dbschema/dbschema.go::ensureTransmissionsLastSeenColumn` adds the column + `idx_tx_last_seen` (single-column INTEGER ALTER + index; both PREFLIGHT-annotated as cheap metadata-only ops). - `cmd/ingestor/db.go::OpenStoreWithInterval` schedules `tx_last_seen_backfill_v1` via `Store.RunAsyncMigration` — `UPDATE transmissions SET last_seen = MAX(observations.timestamp) WHERE last_seen = 0` — non-blocking on boot (1.9M+ obs row scan in prod). - Writer-side: `InsertTransmission` seeds `last_seen` on initial insert, and every observation insert bumps `last_seen = ?` via prepared statement `stmtBumpTxLastSeen` (conditional `last_seen < ?` so out-of-order ingest never goes backwards). - Reader-side: `cmd/server/store.go::Load`, `loadChunk`, and `cmd/server/chunked_load.go::LoadChunked` switch the WHERE/ORDER-BY clauses to `t.last_seen` when the column is present (PRAGMA-detected via `DB.hasLastSeen`). Test/legacy DBs without the column fall back to `first_seen` so existing fixtures stay green. **B) Honest `backgroundLoadComplete` gating.** - Drop the `retentionHours - hotStartupHours <= 0` short-circuit. Prod runs with both at 12h, which flipped Done=true immediately. - After the chunk loop, query `SELECT COUNT(*) FROM transmissions WHERE last_seen >= retentionFloor` and compute `loadCoverageRatio = inMem / inDB`. Done=true only when `ratio >= 0.90` AND no chunk errors. `backgroundLoadFailed=true` + `backgroundLoadError` populated otherwise (e.g. `"loaded 20.0% of 5000 rows (1000 in memory)"`). - `bgErrMu`-guarded `loadCoverageRatio` + `backgroundLoadErr` so the perf endpoint can read them without blocking the writer. **C) Perf exposure.** `PerfPacketStoreStats` gains `RetentionHours`, `OldestLoaded`, `LoadCoverageRatio`, `BackgroundLoadError` — surfaces what fraction of the on-disk DB the in-memory store currently reflects, so operators can see the 0.3% case in `/api/perf` without reading the logs. ### TDD trail - **RED**: `05f0c6dd2bea6dc37324c548a49564d739aca920` — failing tests + 21-line store.go scaffolding. CI on this commit failed on assertions (intended). - **GREEN**: this PR's HEAD commit (8 files, +271/-24). Targeted suite: `Test1690_ColdLoad_TimeAxis`, `Test1690_BackgroundLoadHonesty`, `Test1690_PerfStats_NewFields`, `TestHotStartup_*`, `TestIssue1690_LastSeenUpdatedOnObservation` — all pass. Anti-tautology: locally reverted the `if !s.backgroundLoadFailed.Load()` guard around `backgroundLoadDone.Store(true)` — `Test1690_BackgroundLoadHonesty` fails on the assertion `"backgroundLoadDone=true with only 1000/5000 packets loaded; must be false until coverage ≥ 90%"`. Restored. ### Async-migration preflight - `ensureTransmissionsLastSeenColumn` — ALTER + CREATE INDEX both `// PREFLIGHT: async=true reason="..."` annotated. - `tx_last_seen_backfill_v1` — wrapped in `Store.RunAsyncMigration`. - `stmtBumpTxLastSeen` prepared statement — annotated; it is a row-level UPDATE BY PRIMARY KEY, not a migration. ### Preflight overrides PREFLIGHT-MIGRATION-SCALE: <30s N=5K - check-async-migration: justified for `cmd/server/issue1690_cold_load_test.go` CREATE TABLE/INDEX statements — these build an in-memory test fixture DB (≤5000 rows, runs in <1s in CI), not a prod migration. Fixes #1690. --------- Co-authored-by: meshcore-bot <bot@meshcore.local> Co-authored-by: bot <bot@example.com> |
||
|
|
d910ea0208 |
feat(#1638): confidence rating weighted by hash mode (#1687)
Fixes #1638. ## Problem `getConfidenceIndicator` in `public/nodes.js` treats every observation as equal evidence, so a node seen 5 times via 1-byte hash prefixes (which collide ~8-way across a typical mesh) scores the same as a node seen 5 times via 6-byte prefixes (effectively unambiguous). The user asked for confidence to respect ambiguity. ## Change - `cmd/server/neighbor_graph.go` — new `CountsByMode map[int]int` on `NeighborEdge`, bumped in `upsertEdge` / `upsertEdgeWithCandidates` based on the observation's hash-prefix byte length (1/2/4/6). Merged in `resolveEdge` when ambiguous→resolved edges collapse. - `cmd/server/neighbor_api.go` — `NeighborEntry.counts_by_mode` exposed (omitempty), and `dedupPrefixEntries` merges per-mode counts when an unresolved prefix entry collapses into a resolved one. Flat `Count` field preserved for back-compat. - `public/nodes.js::getConfidenceIndicator` — weights observations by mode: 1-byte=0.125, 2-byte=0.5, 4/6-byte=1.0. A single 6-byte sighting counts ~8× a raw 1-byte one. HIGH triggers when EITHER the legacy heuristic clears OR weighted count ≥3. Legacy entries without `counts_by_mode` keep working (default weight 0.5). - Tooltip now shows the per-mode breakdown (e.g. "Observations: 5 (1-byte: 3, 6-byte: 2)"). ## TDD - RED: `cmd/server/neighbor_graph_test.go::TestBuildNeighborGraph_CountsByMode` — fixture with 1/2/4-byte sightings asserts per-mode tally (commit `838965f3`). - RED: `test-confidence-indicator.js` — 6-byte mostly-sighted neighbor must outrank 1-byte mostly-sighted neighbor at equal flat count (commit `4bd5e18e`). - GREEN: implementation in commit `7511606d`. All 4 JS tests pass; new Go test passes; full Go suite passes (two pre-existing flakes unrelated, both pass when isolated). ## Browser verification Synthetic side-by-side of OLD vs NEW classifier against representative inputs — see screenshot. 1-byte-only and 6-byte-only at the same flat count diverge from MEDIUM/MEDIUM to MEDIUM/HIGH, and 3 6-byte sightings now upgrade where 20 1-byte sightings stay MEDIUM. ## Preflight overrides - check-branch-scope: cross-stack: justified — backend exposes the new `counts_by_mode` field and the frontend consumes it; the whole point of the change. ## Compat - `Count` field unchanged in shape and value. - `counts_by_mode` is `omitempty`; legacy persisted edges (loaded from `neighbor_edges` via `neighbor_persist.go`) get no per-mode breakdown and fall back to the default weight (0.5) — no UI regression. --------- Co-authored-by: bot <bot@local> Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
efd66ea3f5 |
feat(mqtt): per-source status endpoint + Observers panel (#1682)
## Summary Adds MQTT source status visibility per #1043 acceptance criteria: - **Ingestor:** per-source counter registry (`cmd/ingestor/source_status.go`) tracking `connected`, `lastConnectUnix`, `lastDisconnectUnix`, `lastPacketUnix`, `connectCount`, `disconnectCount`, `packetsTotal`, `packetsLast5m` (sliding 5-min window via per-second buckets keyed by unix second — no stale-leak), `lastError`. Wired at the existing OnConnect / ConnectionLost / DefaultPublish callsites alongside the liveness watchdog. Idempotent registration so counters survive reconnects. Snapshot emitted in the existing stats file under `source_statuses` (additive, `omitempty`). - **Backend:** new `GET /api/mqtt/status` handler reads the ingestor stats file and returns the per-source list. **Broker passwords are masked** via a regex over the `scheme://user:pass@host` form (covers mqtt/mqtts/tcp/ssl/ws/wss). Mask is also applied to `lastError` as defense-in-depth (broker libs occasionally quote the failing URL). OpenAPI completeness gate satisfied with a `routeDescriptions` entry. - **Frontend:** small self-contained panel (`public/mqtt-status-panel.js`) mounted above the Observers table. Auto-refreshes every 10s, color-codes each row (green = connected + recent packet, yellow = connected idle, red = disconnected), and tears down its timer on SPA route change. ## TDD - Red commit `f19a93b5` — stub `/api/mqtt/status` handler + assertion test that the broker password is `****`-redacted. Test fails on the assertion (handler passes the URL through verbatim). Compile-clean — assertion-fail, not build-fail. - Green commit `77042e41` — `maskBrokerURL` helper + table-driven unit tests across all schemes + handler rewires to mask both `Broker` and `LastError`. - Subsequent commits land the ingestor wiring and the frontend panel. ## Tests ``` $ cd cmd/server && go test -run 'TestMqttStatus|TestMaskBrokerURL' -v ./... PASS: TestMqttStatus_MasksBrokerPassword PASS: TestMqttStatus_EmptyWhenNoStatsFile PASS: TestMaskBrokerURL_Patterns (10 subtests) $ cd cmd/ingestor && go test -run 'TestSourceStatus|TestSnapshotSourceStatuses' -v ./... PASS: TestSourceStatus_BasicLifecycle PASS: TestSourceStatus_Disconnect PASS: TestSnapshotSourceStatuses_ReturnsAll $ node test-mqtt-status-panel.js 7 passed, 0 failed ``` Full `go test ./...` clean in both `cmd/server` and `cmd/ingestor`. ## Preflight overrides - `cross-stack`: justified — issue #1043 is intrinsically full-stack (ingestor stats → server endpoint → observers panel). Per-stack split would land an unreachable endpoint or a fetch with no backend. - `check-xss-sinks` (public/mqtt-status-panel.js:55): justified — the flagged `innerHTML=` is a fully-static literal (empty-state placeholder, no payload data interpolated). All payload-bearing `innerHTML=` sites in this file run through `escapeHTML` (defined in the same file); the test `renderPanel never echoes a plaintext password (defense-in-depth)` exercises the rendered HTML against payload strings. ## Acceptance criteria - [x] `/api/mqtt/status` returns per-source connection state — `cmd/server/mqtt_status.go` - [x] UI panel shows all configured sources with live status — `public/mqtt-status-panel.js` - [x] Connection state updates on reconnect/disconnect events — `MarkConnect` / `MarkDisconnect` wired in `cmd/ingestor/main.go` - [x] Broker URLs don't expose passwords in the API response — `maskBrokerURL` + 13 test cases - [x] Works with 1-N sources — registry is keyed per-source, snapshot iterates the map **Partial fix for #1043** — per-packet `mqtt_source` attribution (the issue's "Follow-up" section) is **deferred** per the `mc-bot-triaged:v1` triage and the autofix comment ("Per-packet attribution deferred to follow-up issue"). That work requires a new observation-row column and DB schema migration, both explicitly out of scope for this PR. Refs #1043 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
2ef7d2437d |
fix(ci): release fast-path re-tag :edge → :vX.Y.Z when SHA matches (Fixes #1677) (#1680)
## Summary Adds `.github/workflows/release-fast-path.yml`: a metadata-only re-tag workflow that fires on `push.tags: v[0-9]+.[0-9]+.[0-9]+` and, when `:edge`'s `org.opencontainers.image.revision` label matches the tag SHA, applies `:vX.Y.Z`, `:vX.Y`, `:vX`, `:latest` to the existing edge manifest via `crane tag`. No rebuild, no test re-run — ~seconds vs ~30 min today. If the SHA doesn't match (tag points to an older commit, or `:edge` wasn't built yet), it dispatches the existing `deploy.yml` pipeline as a fallback so validated bytes always ship. To prevent double-fire, `deploy.yml`'s top-level `on:` block drops `tags: ['v*']` — `release-fast-path.yml` is now the sole consumer of `push.tags`. Edge publishing on master push is untouched. ## TDD Red commit adds `cmd/server/release_fast_path_workflow_test.go` (two tests: one asserts the new workflow exists with the required trigger/permissions/markers; the other asserts `deploy.yml`'s `on:` block no longer mentions `tags:`). Both fail on assertions in the red commit. Green commit adds the workflow file + edits `deploy.yml`; both pass. ## Acceptance criteria (from #1677) - Tag-CI completes in <2 min when tag SHA == `:edge` revision → fast-path is metadata-only, single short job - Falls back to full pipeline on SHA mismatch → `gh workflow run deploy.yml --ref ${{ github.ref }}` - `:vX.Y.Z` has same digest as `:edge` → `crane tag` copies the manifest, bytes are byte-identical - No regression on older-SHA tags → fallback path runs the unchanged full validation Fixes #1677 --------- Co-authored-by: Kpa-clawbot <bot@corescope.local> |
||
|
|
653d47e03c |
test(openapi): add CI completeness gate for /api routes (Phase 1 of #1670) (#1678)
## Summary Partial fix for #1670 — **Phase 1 only** (CI completeness gate). Phase 2 (backfilling the 18 currently-undocumented routes into `openapi.go`) is deferred to a separate issue per the triage on #1670 and is explicitly out of scope here. ## What this adds - `cmd/server/openapi_completeness_test.go` — AST-walks every non-`_test.go` file in `cmd/server/`, finds string-literal first args to `*.HandleFunc(...)` calls beginning with `/api/`, and diffs against the paths declared in `routeDescriptions()` in `cmd/server/openapi.go`. - `cmd/server/openapi_known_gaps.json` — seeded allowlist of the **18** `/api/` routes currently registered via `HandleFunc` but not yet documented in `openapi.go`. ## Ratchet pattern From this branch forward, `TestOpenAPICompleteness` fails when: 1. A new `HandleFunc("/api/...")` is added without a matching entry in `openapi.go` **or** the allowlist (regression gate — the main goal of Phase 1). 2. A route in the allowlist is *also* documented in `openapi.go` — the allowlist must shrink as Phase 2 backfills land, never go stale. The two-commit history (red → green) demonstrates the gate works: - **Red commit**: adds only the test. Fails on master with the 18 missing routes listed. - **Green commit**: adds the allowlist seeded with that exact 18-route set. Test passes at the current baseline. ## Local verification - `go test ./cmd/server/ -run TestOpenAPICompleteness -v` → PASS at baseline (`44/62 covered; 18 in allowlist; 18 gaps remain`). - Ratchet validation: temporarily inserted `r.HandleFunc("/api/ratchet-test-route", ...)` into `routes.go` → test FAILED with that exact route name; reverted → test PASSES again. ## Files changed - `cmd/server/openapi_completeness_test.go` (+203 / new) - `cmd/server/openapi_known_gaps.json` (+24 / new) ## Preflight `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` → all hard gates pass; no warnings. ## Out of scope - Backfilling the 18 allowlisted routes into `openapi.go` (Phase 2 — tracked separately). - Schema validation of the spec against OpenAPI 3.0 (Phase 3 per the issue). - PR template checkbox update (Phase 2 follow-up). Issue #1670 stays open for Phase 2. --------- Co-authored-by: clawbot <bot@corescope.local> |
||
|
|
938153dd92 |
fix(nodes): rebuild relay-hop history on startup from path_json (#1643)
## Problem A relay node's **activity timeline** — and its per-node `packetsToday` / observer counts — collapses to *"only the hour the server restarted"* after every restart. Before the restart the timeline shows only the node's own adverts (~1–2/hr); all of its relay activity piles into the single post-restart hour. ## Root cause All DB cold-load paths (`Load`, `loadChunk`, `scanAndMergeChunk`) index relay-hop attribution into `byNode` **only** from `observations.resolved_path`. But since #1287 the ingestor persists relay data as aggregate `neighbor_edges` and **never writes `resolved_path`** — it is `NULL` on every deployment (verified on a live DB: 0 of ~440k rows populated). So relay attribution is never reconstructed on startup; it only re-accumulates from live traffic (`IngestNew*`, which re-resolves from `path_json` + the neighbor graph), piling a relay node's whole history into the post-restart window. ## Fix Server read-side only — **no schema / ingestor / migration change**. When `resolved_path` is empty, re-resolve relay hops from the already-persisted `path_json` using the in-memory prefix map + neighbor graph (the same `resolvePathForObs` compute the live ingest path already runs). `main.go` now loads the persisted neighbor graph *before* the packet load so resolution has the graph available. Two correctness details worth a close look: 1. **Fetch the prefix-map/graph snapshot BEFORE opening each load cursor.** `getCachedNodesAndPM` issues its own DB query; doing so while a load cursor is open deadlocks on a single-connection SQLite pool (the test harness uses one). 2. **Index into `byNode` ONLY** — not the `resolved_path` / path-hop indexes. Those are cross-checked by `handleNodePaths` against the persisted `resolved_path` column (NULL here); populating them from an in-memory re-resolution would make that SQL confirmation fail and wrongly drop the tx from paths-through (#1352). ## Tests New coverage asserts a relay pubkey reachable *only* via `path_json` lands in `byNode` after a restart-style load, for both the hot-window (`LoadChunked`) and background-window (`loadChunk`) paths. Existing #1558 (`resolved_path`) and #1352 (paths-through) tests still pass. Full `cd cmd/server && go test ./...` is green under `-race`. ## Perf The fallback runs `resolvePathForObs` per observation with a non-empty `path_json` during cold load — the same per-packet compute the live ingest path already performs, so no new asymptotic cost. The prefix map + graph are snapshotted **once per load** (not per row); `getCachedNodesAndPM` is 30s-cached. In `loadChunk` the resolution runs in the existing lock-free scan and is accumulated locally, matching that function's "build local, merge under lock" design. ## Note on a pre-existing flaky test `TestDistanceConcurrentRequestsDuringBuildReturn202` is timing-fragile (fails ~1/15 on `master` without this change). It relies on the lazy distance build being slow because it's the first caller of `getCachedNodesAndPM` (cold cache). This PR pre-warms that cache during `Load`, narrowing the build window, so the test fails more often in **non-race** local runs. It passes reliably under `-race` (CI mode), where the build stays slow. Flagging in case you want to harden the test separately. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com> Co-authored-by: openclaw-bot <bot@openclaw> |
||
|
|
825b26485c |
fix(#1181): hide nodes whose name starts with a configured prefix (#1655)
Fixes #1181. ## Summary Adds operator-configurable name-prefix hiding for nodes. When a node's name starts with any prefix listed in the new `hiddenNamePrefixes` config field (default `["🚫"]`), it is omitted from `/api/nodes`, `/api/nodes/search`, and `/api/nodes/{pubkey}`. DB rows are preserved — the filter runs at the API layer only, so observation history (paths, hops, distances) stays intact and the node simply re-appears if the operator clears the prefix list. This mirrors the convention already in use on other MeshCore map dashboards: an operator who wants their node hidden renames it with the 🚫 prefix and sends an advert; the next advert is then dropped from the dashboard. The node is **not** hidden from the mesh itself — only from this dashboard. This is documented inline in `config.example.json`. Implementation follows the existing `IsBlacklisted` pattern exactly: a new `Config.IsNameHidden(name)` method, and three filters in `routes.go` placed alongside the corresponding blacklist filters. No DB schema, public API, or websocket changes. ## Files changed - `cmd/server/config.go` — new `HiddenNamePrefixes []string` field + `IsNameHidden` method - `cmd/server/routes.go` — filters in `handleNodes`, `handleNodeSearch`, `handleNodeDetail` - `config.example.json` — new field + `_comment_hiddenNamePrefixes` operator doc - `cmd/server/hidden_name_prefix_1181_test.go` — new test file (red → green) ## Test plan Two new subtests in `TestHiddenNamePrefix_1181_*`: 1. `_NodesList` — inserts a node named `🚫 ban me`, asserts it is present when `HiddenNamePrefixes` is empty and absent when set to `["🚫"]`. 2. `_Search` — inserts `🚫 search me`, asserts `/api/nodes/search?q=search` does not surface it when the prefix is configured. Verified red→green: - Red commit `d0903852`: `go test -run TestHiddenNamePrefix_1181` fails on the leak assertion (`hidden_name_prefix_1181_test.go:94`). - Green commit `e79a0d8d`: same command passes. ``` $ cd cmd/server && go test -run TestHiddenNamePrefix_1181 -count=1 . ok github.com/corescope/server 0.060s ``` ## Out of scope - Auto-purging DB rows for hidden nodes — left to existing retention. The triage was explicit: hide, do not delete. - Live websocket broadcast: nodes are not broadcast via websocket (only packets), so no separate emit path needs filtering. Frontend reads nodes via `/api/nodes`, which is filtered. - Frontend customizer for the prefix list — operators configure via `config.json` like every other knob. |
||
|
|
e04c7113cb |
feat: integrate hashtag channels from meshcore-channels catalogue (#1323) (#1656)
Fixes #1323 ## Summary Adds a small in-memory cache of the community-maintained hashtag-channels catalogue (`marcelverdult/meshcore-channels`) and exposes it as `GET /api/known-channels?region=XX` plus a collapsed sidebar section on the Channels view ("Known channels (catalogue)") with a one-click "+ Add" button per row. Per triage (#1323): new `cmd/server/known_channels_cache.go`, new `GET /api/known-channels?region=…`, frontend section in `public/channels.js`. No new DB tables — cache is in-memory only. ## What changed - `cmd/server/known_channels_cache.go` — `knownChannelsCache` with an atomic snapshot pointer, 24h default refresh, 30s HTTP timeout, 4 MB body cap, custom `User-Agent`. Fail-soft: a failed refresh leaves the last-known snapshot in place. Background goroutine started from `main.go` after the neighbor-graph recomputer; never blocks startup. - `cmd/server/known_channels_route.go` — `GET /api/known-channels?region=` serves the cached snapshot off the atomic pointer (never blocks on upstream). Region filter is case-insensitive ISO 3166-1 alpha-2. Empty/missing cache returns 200 with an empty entries list (fail-soft for the UI). - `cmd/server/config.go` — `KnownChannelsURL` + `KnownChannelsRefreshMs`. - `config.example.json` — example values + `_comment_knownChannels`. - `public/channels.js` — new collapsed sidebar section "Known channels (catalogue)" that lazy-fetches `/api/known-channels` on first render and renders rows with a "+ Add" button. The button calls the existing `addUserChannel(name)` path, so adding catalogue channels reuses the full save-key + decrypt flow that user-typed hashtags already use. - `cmd/server/known_channels_cache_test.go` — failing-first tests: - `TestKnownChannelsParseFixture` asserts the parser populates `GeneratedAt`/`License` and region-stamps every entry while skipping empty countries. - `TestKnownChannelsRouteRegionFilter` asserts the route returns 200 with exactly the filtered subset for `?region=be`. - `TestKnownChannelsFailSoftOn500` asserts a failed upstream fetch leaves the prior snapshot in place and bumps `failCount`. ## Upstream pinning The default URL is pinned to the specific file `channels-by-country.json` on `main`: > https://raw.githubusercontent.com/marcelverdult/meshcore-channels/main/channels-by-country.json Shape (verified 2026-05-24): ```json { "generated_at": "...", "license": "CC0-1.0", "countries": { "be": [{"channel": "#antwerpen", "description": "..."}], ... } } ``` ## Test plan ``` cd cmd/server && go test -run 'TestKnownChannels' -count=1 . ok github.com/corescope/server 0.008s ``` Red commit: |
||
|
|
1116801b2f |
M5: emoji → Phosphor Icons — settings & customize (#1648) (#1653)
**Red commit:** `851cc8c3a024b1675558092d772444bf4f1ec625` — failing test on a stub branch (will link CI run after PR opens). Partial fix for #1648 (M5 of 6). **Do NOT close the tracking issue** — M6 (server-side residual emoji sweep + lint gate) still pending. ## Per-file swap counts | File | Phosphor `<use>` refs | Notes | |---|---|---| | `public/customize.js` | 20 | DEFAULTS → `ph:<name>` tokens; render path keeps legacy emoji branch (back-compat) | | `public/customize-v2.js` | 26 | same as v1; cv2 overrides path unchanged | | `public/home.js` | (helpers added) | `_renderHomeGlyph` / `_renderHomeLabel` accept both `ph:<name>` and legacy emoji | | `public/geofilter-builder.html` | 5 | clear / undo / save / load buttons (+inline `.ph-icon` CSS) | | `public/audio.js` | 1 | audio unlock prompt | | `public/filter-ux.js` | 5 (3 new) | help popover star + close, saved-filter delete | | `public/style.css` | 0 | `#chList .ch-share-btn::before { content: '📤' }` removed; JS now renders an inline sprite | | `cmd/server/routes.go` | (6 `ph:` tokens) | onboarding home defaults updated in lockstep with customize-v2.js | ## Operator config back-compat — PROMINENT Per design call #1 (user-locked): existing operator-stored emoji values in `config.json` / `localStorage` are **NOT** touched. The render path supports both: ```js function renderConfigGlyph(value) { var m = String(value || '').match(/^ph:([a-z][a-z0-9-]+)$/); if (m) return '<svg class="ph-icon"><use href="/icons/phosphor-sprite.svg#ph-' + m[1] + '"/></svg>'; return esc(value); // EMOJI-OK-LEGACY-RENDER — operator-stored emoji/text path } ``` Defaults flipped to `ph:<name>` tokens, so new operators (and operators who hit "Reset to Defaults") see Phosphor sprites. Operators with stored emoji values continue to see their emoji exactly as before. Verified end-to-end (see E2E (b) below). ## cmd/server/routes.go — changed in lockstep Per design call #2: the home-defaults `steps` / `footerLinks` mirror the JS DEFAULTS, so they MUST update together. routes.go now emits `ph:<name>` tokens; the frontend home-render path resolves them. Existing tests (`TestConfigThemeHomeDefaults`) still pass — they assert structure, not glyph values. ## E2E assertions added - `test-issue-1648-m5-emoji-scan.js` — per-file zero-emoji + ph-token DEFAULTS + sprite presence - `test-issue-1648-m5-icons-e2e.js`: - (a) customize chrome — tabs/header rendered as sprites; chrome text icon-free - **(b) back-compat — injects fake `🐙` operator step into localStorage, reloads, opens customize, asserts the emoji renders verbatim in both the input value AND the live preview span; asserts the ph-token step renders as a sprite** (design call #1 in action) - (c) `/channels` modal sprite count - (d) `/audio-lab` sprite presence - (e) `geofilter-builder.html` control buttons sprite-driven - (f) every `<use>` resolves to a defined symbol id ## Out of scope (M6 cleanup) - cmd/server/routes.go residual server-rendered emoji **not** tied to customize defaults (none found by my grep — file already audited) - `make lint-no-emoji` CI grep gate (M6 owns it) - `public/icons/README.md` workflow doc cross-stack: justified — design call #2 requires Go + JS update together. --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
8295c2115c |
fix(reach): bust response cache on blacklist change (#1629) (#1636)
Red commit:
|
||
|
|
078225a54e |
perf(neighbor_api): fold first_seen into cached map — fix #1627 r3 regression (#1632)
## TL;DR Post-merge regression introduced by #1627 r3 (commit `e2212f50`): `buildNodeInfoMap` in `cmd/server/neighbor_api.go` ran an uncached `SELECT … FROM nodes` scan on every call. Folded `first_seen` into the already-cached `getCachedNodesAndPM` (30s TTL) so the 4 hot handlers that call `buildNodeInfoMap` no longer pay for a full table scan per request. ## Before / After `buildNodeInfoMap` is called by **4 hot handlers**: - `cmd/server/neighbor_api.go:130` - `cmd/server/neighbor_api.go:297` - `cmd/server/neighbor_debug.go:83` - `cmd/server/node_reach.go:421` | | Before | After | |---|---|---| | `SELECT … FROM nodes` per call | 1 (uncached) | 0 (cache hit) | | `SELECT … FROM observers` per call | 1 (uncached) | 1 (unchanged) | | At Cascadia scale (~2600 nodes) | full scan × 4 handlers × N req/s | one scan / 30s | ## How - Extended the `getAllNodes` schema probe to also `COALESCE(first_seen, '')`. Falls back through the existing richest → leanest ladder if the column is missing. - `nodeInfo.FirstSeen` is therefore populated for every cached entry in `getCachedNodesAndPM`. - `buildNodeInfoMap` drops its second `SELECT` entirely and just copies `nodeInfo` values out of the cached map. - Public signature of `buildNodeInfoMap` is unchanged. `node_reach.go:421` still sees `nodeInfo.FirstSeen` populated, served from cache. `cmd/server/store.go` is touched because `getAllNodes` is the only sensible owner of the `first_seen` SELECT — adding a parallel cache would duplicate the 30s TTL machinery this fix is designed to leverage. ## Test (red → green) - Commit 1 (`test:`): `TestBuildNodeInfoMap_FirstSeenIsCached` — calls `buildNodeInfoMap`, mutates `first_seen` out-of-band via a separate rw connection, calls it again, and asserts both calls return the same (cached) value. Fails on `origin/master` (call 2 sees the mutated value, proving the uncached scan). - Commit 2 (`perf:`): the fold. Test now passes. ## Refs Post-merge audit identified this as the only MAJOR finding from #1627; recommendation was a follow-up hot-fix PR. This is that PR. --------- Co-authored-by: openclaw-bot <bot@openclaw> Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
43be1bb76a |
fix(reach): scanReachRows DB errors must surface as 500 not 404 (#1631) (#1635)
Red commit:
|
||
|
|
e2212f5015 |
feat(nodes): per-node Reach page + GET /api/nodes/{pubkey}/reach (v2, review-complete) (#1627)
Re-submission of #1625 (which was merged early, then reverted in #1626) — now with **all three round-1 reviews addressed** so it lands in one hardened state instead of as post-merge follow-ups. ## What Per-node **Reach** view: a standalone page (`#/nodes/{pubkey}/reach`) + a node-detail section + `GET /api/nodes/{pubkey}/reach`. It shows which nodes a node has a **stable two-way RF link** with, derived from raw `path_json` adjacency (a path travels origin→observer, so `[A,B]` ⇒ B heard A). A link is bidirectional when both directions have observations; the **bottleneck** (weaker direction) rates two-way reliability. Nodes are identified only by **unique 2–3 byte** path prefixes (1-byte collides → excluded). ## Review fixes folded in vs #1625 **Performance (Carmack):** hard scan LIMIT (200k) + modest prealloc; `json.Unmarshal` replaced by a single-pass `parsePathTokens` (100k-row scan 2.2M→1.3M allocs, 344→203ms); memoized resolver; size-hinted maps (attribution over 100k rows: 102 allocs); `context.Context` plumbed; cache `RWMutex` + evict-oldest (no full wipe); singleflight dedup; degree/rank from a 60s shared snapshot; bench rewritten (ReportAllocs, 1k/10k/100k, mixed-payload, isolated attribution). **Correctness/safety + tests (Independent + Kent Beck):** pubkey validation → 400; error logging instead of silent swallow (first_seen / degree / marshal→500 / discarded rows); `public_key=?` index use; canonical `PayloadADVERT`; `min()` builtin; documented cache-slice immutability; mux ordering comment. New tests: scanReachRows decode, 3-byte token branch, non-advert first-hop guard, observer SNR aggregation across rows, HTTP-level attribution (asserts non-zero we_hear/they_hear), 400/404/blacklist/cache-hit. **UI / a11y / Tufte:** in-map legend (tiers + thresholds); dropped the colour+width double-encoding (constant width, colour-only); colour-blind glyphs (●●●/●●/●) + tier title beside the bottleneck number; dark-theme `--link-*`; lighter table (horizontal rules, sentence-case headers); map built once + link layer updated in place on toggle (no flicker); time-range no longer flashes a loader; `destroy()` generation guard; statCard escaping; scoped `@media print` to `#nq-report`; `fieldset/legend` + `for/id` toggles; `aria-pressed` / `aria-live` / back-link `aria-label`; "distance (km)" + bottleneck tooltip + no-GPS note; inline styles → CSS; decorative emoji removed. **Docs:** api-spec documents the 5-min cache, 200k scan cap, and 400. ## Testing - `cmd/server` full suite green; reach unit + endpoint + bench all pass. - `eslint public/*.js` (no-undef) and the XSS-sink gate clean. - E2E updated: request status checks + exact (non-tautological) toggle assertions + hard map-render assert. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- ## TDD-history note (Kent Beck gate) This branch carries production + tests together, not a fabricated red→green sequence. That's deliberate: the branch was rebased onto upstream and the intermediate SHAs were squashed, so reconstructing a "failing-test-first" commit after the fact would be theatre, not evidence — and rewriting history to stage it would be dishonest. The behaviour is instead covered by a comprehensive, anti-tautological suite (directional attribution edges, 3-byte token branch, non-advert first-hop guard, observer SNR aggregation, HTTP-level attribution asserting non-zero counts, scan-cap truncation, zero-reach 200-not-404, companion mis-attribution, cache eviction). Requesting maintainer acceptance of the work on test *substance* rather than commit *choreography*; the net-new-UI exemption is not claimed for the server endpoint. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: meshcore-bot <bot@meshcore> |
||
|
|
9c5faab1e4 |
Revert "feat(nodes): per-node Reach page (#1625)" (#1626)
Reverts #1625. #1625 was merged before the round-1 reviews (Independent / Kent Beck / Tufte) were addressed. Reverting to land it cleanly: a fresh PR will re-add the feature with the perf pass, the backend correctness/safety + test-coverage fixes, and the UI/a11y (Tufte) batch folded in, so it goes through review in a single hardened state rather than as a string of post-merge follow-ups. No functional loss — the feature returns in the replacement PR. |
||
|
|
47f85f6c4c |
feat(nodes): per-node Reach page + GET /api/nodes/{pubkey}/reach (directional link quality) (#1625)
## What
Adds a per-node **Reach** view that answers "how well does this specific
node hear, and get heard by, its neighbours?" — both as a standalone
page (`#/nodes/{pubkey}/reach`) and as a section on the node detail
page.
New endpoint: **`GET /api/nodes/{pubkey}/reach`**.
## What it measures
For the target node it derives, from raw `path_json` adjacency (a path
travels origin→observer, so in `[A,B]` B received A directly):
- **Directional link counts** per neighbour: `we_hear` (how often we
received them) vs `they_hear` (how often they received us).
- **Bidirectional / bottleneck**: a link is two-way stable when both
directions > 0; the weaker direction is the bottleneck and rates real
two-way reliability.
- **Importance**: neighbour degree + rank, relay-observation volume,
bidirectional-link count, direct-observer count.
- **Direct observers**: who received the node at 0 hops, with SNR.
Reliability rule: a neighbour is only attributed when its pubkey
**prefix is unique** at the path's byte length (collisions are skipped,
never misattributed).
## UI
- Standalone Reach page + node-detail section.
- Reusable bidirectional link map (OSM) with links coloured by
bottleneck.
- Incoming/outgoing toggles to isolate each direction.
## Naming note (deliberate, no collision)
This is distinct from the existing **per-observer reachability** in
topology analytics (`ReachNode` / `ObserverReach` / `perObserverReach`).
This PR adds its own `NodeReach*` response structs in a new
`node_reach.go` and a new `/api/nodes/{pubkey}/reach` route — there are
no symbol or route collisions (verified: `go build ./...` clean). Happy
to rename to disambiguate further (e.g. "Link Quality") if you'd prefer
to reserve "Reach" for the per-observer feature.
## Testing
- `cmd/server`: endpoint shape/404/limit-clamp + unit tests for token
derivation and directional attribution, plus a scan benchmark — all
pass.
- Frontend: helper tests + Reach-page E2E (`test-node-reach-e2e.js`),
standalone route + incoming/outgoing toggles.
- `go build ./...` and `eslint public/*.js` (no-undef) clean.
## Docs
Design spec, implementation plan, and the `GET
/api/nodes/{pubkey}/reach` API contract are included under `docs/`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|
|
a4776557ae |
feat(#1290): use firmware repeat:on|off hint to exclude listener-only observers from disambiguator (#1624)
Closes #1290. cross-stack: justified — backend persists firmware-side `repeat` hint to a new observers column, frontend surfaces the listener/repeater status as a badge on the observers list and node-detail Heard By table per the issue's UI acceptance criterion. ## What Firmware 1.16 publishes a `repeat: on|off` flag in the MQTT `/status` JSON (confirmed by @cwichura on the issue thread — see [`MQTTMessageBuilder.cpp:58`](https://github.com/agessaman/MeshCore/blob/b45373a31f111fb0de98bb3b168226d09ceadc47/src/helpers/MQTTMessageBuilder.cpp#L58) in `agessaman/MeshCore mqtt-bridge-implementation-flex`). Listener-only observers (`repeat:off`) by firmware contract never relay packets, so they cannot legitimately be a hop in someone else's resolved path. This PR plumbs the hint end-to-end so the disambiguator stops considering them. ## How * **`internal/dbschema`**: idempotent `can_relay INTEGER DEFAULT 1` migration on `observers`, plus `AssertReady` probe (server fatal-logs if absent). Mirrored in `cmd/ingestor/db.go` `CREATE TABLE` for fresh DBs. Annotated `PREFLIGHT: async=true` — `DEFAULT 1` is constant so SQLite does this as a metadata-only schema rewrite. * **`cmd/ingestor`**: `extractObserverMeta` accepts `repeat` as bool, case-insensitive string (`on|off|true|false|yes|no`), or numeric `0|1`. Missing field → `nil` → `COALESCE` preserves the existing column value (back-compat with legacy observers). Plumbed through `UpsertObserverAt` and the prepared upsert statement. * **`cmd/server`**: `GetNonRelayObserverPubkeys` + new `prefixMap.markNonRelay` drop matching candidates inside `pm.resolveWithContext` at the top of the resolver, so all 4 tiers see the pruned candidate set. `ObserverResp.CanRelay` is surfaced on `/api/observers` and `/api/observers/{id}`. `GetNodeHealth` enriches per-observer rows with `can_relay` so the node-detail badge renders. Probe-and-fall-back when the `can_relay` column is absent (legacy test fixtures). * **`public/`**: listener vs repeater pill on observers list, observer detail `Relay` stat card, and node-detail `Heard By` table. CSS uses existing theme vars. ## Test Added `TestResolveWithContext_ExcludesNonRelayObservers_Issue1290` in `cmd/server/resolve_non_relay_1290_test.go` covering all three required cases: * `repeat:off` pubkey → not a candidate (assertion failed in red commit `5f7fdb96`, passes after green `f12911dc`) * `repeat:on` pubkey → still a candidate (regression guard) * legacy obs (no field) → still a candidate (back-compat) Red→green proof: ``` $ git log --oneline origin/master..HEAD |
||
|
|
3d12266595 |
fix(#1608): address PR #1609 follow-up findings — config doc, receipt-time liveness, buffer stop/clamp warn (#1623)
Follow-up to #1609 / #1608. Addresses the 5 unresolved findings from the PR #1609 round-1 polish review. ## Findings addressed | Tag | Severity | Fix | Commits | |-----|----------|-----|---------| | **B1** | BLOCKER | Document `ingestBufferSize` in `config.example.json` near other ingestor knobs. Default `50000`, comment text from review. | `f0b4e411` | | **M1** | MAJOR (option 1 from review) | Split receipt-time vs post-write liveness: add `SourceLivenessState.LastReceiptUnix` + `MarkReceipt`, stamp at the MQTT receipt callback, leave `LastMessageUnix` post-write only. Drop the double-stamp at receipt that masked write-path stalls. Surface both clocks via the ingestor stats file (`source_liveness`) and the server's `/api/healthz` (`ingest_liveness`, additive — older builds unaffected). | RED `fa78233d` / GREEN `bc81b544` | | **M1 (drop-log)** | MAJOR | Log every drop when buffer is at capacity. Removes the `n==1 \|\| n%1000` throttle that hid the first stall behind 1000 lost packets. The Submit drop branch only fires when the channel is at cap so volume is naturally bounded by the stall, not by an arbitrary modulo. | RED `a468763e` / GREEN `7b24fce5` | | **m1** | MINOR | Add `IngestBuffer.Stop()` and `Done()` so tests stop leaking the consumer goroutine that `Start()` spawns. Existing tests gain `t.Cleanup(b.Stop)`. Drain semantics: stop-before-Ready exits immediately; stop-after-Ready best-effort drains queued jobs. | RED `8430c822` / GREEN `78c9b223` | | **m2** | MINOR | `NewIngestBuffer(<1)` now logs a `[ingest-buffer] WARN` line on clamp so misconfigured `ingestBufferSize` values are visible instead of silently running a 1-slot queue. Test captures log output. | RED `62119ab4` / GREEN `815bfd02` | | **m3** | MINOR | Add godoc to `Submit` and `Ready` documenting the Start-before-Submit / Start-before-Ready ordering invariant. | `564a813b` | ## TDD discipline Each behavioral fix (M1, M1-drop-log, m1, m2) lands as a red-then-green pair. Red commits compile + run + fail on assertion, verified locally before the green commit. Per-finding red→green pairs are visible in the commit graph above. B1 and m3 are docs-only and ship as single commits (preflight script accepts them under the docs/comments exemption). ## Schema compatibility `/api/healthz` change is purely additive: `ingest_liveness` is only included when the ingestor publishes the new `source_liveness` field, so older ingestor + newer server combos are unaffected. Field order in the response stays stable for prior consumers. ## Test output - `go test -count=1 -timeout 180s ./cmd/ingestor/...` → green (160s) - `go test -count=1 -timeout 300s ./cmd/server/...` → green (48s) - Race-mode runs of the touched packages (`IngestBuffer|Liveness|Watchdog|Receipt|Healthz`) → green - Full-package race runs locally exceed the brief's 120s timeout on pre-existing slow integration tests (TestObsTimestampIndexMigration, TestNeighborEdgesBuilderDeltaScan); CI has the headroom. ## Preflight `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` → all hard gates pass, no warnings. ## Files changed - `config.example.json` — B1 - `cmd/ingestor/ingest_buffer.go` — m1, m2, M1-drop-log, m3 - `cmd/ingestor/ingest_buffer_test.go` — m1, m2, M1-drop-log - `cmd/ingestor/mqtt_watchdog.go` — M1 - `cmd/ingestor/mqtt_watchdog_m1_test.go` — M1 (new) - `cmd/ingestor/main.go` — M1 (receipt callsite) - `cmd/ingestor/stats_file.go` — M1 (publish `source_liveness`) - `cmd/server/perf_io.go` — M1 (type + reader) - `cmd/server/healthz.go` — M1 (surface `ingest_liveness`) Original review reference: PR #1609 polish review by the M-axis bot. --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
bc1822e46c |
perf(load): chunked Load with early HTTP readiness (#1009) (#1596)
## What Switches the server's startup from a synchronous full-scan `PacketStore.Load()` to a chunked `LoadChunked(chunkSize)` that: 1. Streams transmissions+observations from SQLite in id-ordered chunks (default `chunkSize=10000`, configurable via `db.load.chunkSize`). 2. Closes `FirstChunkReady()` after the first chunk is merged — `main.go` binds the HTTP listener on that signal instead of blocking on the full multi-minute load. 3. Stamps `X-CoreScope-Load-Status: loading; progress=<rows>` on every response while LoadChunked is in flight, flipping to `ready` once it completes (via `loadStatusMiddleware`). 4. Preserves the existing retention/`hotStartupHours`/`maxMemoryMB` clamps and the post-load index rebuild (`pickBestObservation` / `buildSubpathIndex` / `buildPathHopIndex` / `buildDistanceIndex`). ## Why Per #1009: at 5M+ observations (Cascadia scale) the synchronous Load blocked HTTP for ~80s with a 2–3× steady-state RAM peak. With chunked load the listener binds within seconds; dashboards and probes can read partial data and see the `loading` status header until the background load finishes. ## Notes - `/api/healthz` readiness gate (`readiness` atomic, init `WaitGroup`) is unchanged — it still waits for neighbor-graph build + initial `pickBestObservation` before reporting `ready:true`. `LoadChunked` only changes when the listener BINDS, not when it advertises ready. - `cmd/server/main.go` waits for `FirstChunkReady` (or the full load on a tiny DB) before proceeding, and drains the load goroutine in the background with a logged error path. - Config Documentation Rule: `config.example.json` now documents `db.load.chunkSize` with a nested `_comment` describing the trade-off. ## Tests - `cmd/server/chunked_load_test.go` asserts: - (a) `FirstChunkReady` fires before `LoadChunked` returns - (b) `X-CoreScope-Load-Status` transitions `loading; progress=...` → `ready` - (c) `chunkSize` honored (2500 rows @ 1000 → 3 chunks via `OnChunkLoaded`) - (d) `Config.DBLoadChunkSize()` default 10000 + override - Red commit (`102a4c84`) lands the tests with stubs that fail on assertion — verified locally before the green commit. - Green commit (`35cecf16`) makes all four pass; full `cmd/server` suite green (47s locally). Closes #1009 ## TDD red-commit exemption The original red commit `f878e15e` ("test(load): failing tests for chunked Load + early HTTP readiness") fails to **compile** rather than failing on an assertion, because it references symbols (`store.LoadChunked`, `store.FirstChunkReady`, `store.OnChunkLoaded`, `Config.DBLoadChunkSize`, `loadStatusMiddleware`) that do not exist on master. Per `AGENTS.md` the bar is "MUST fail on an assertion ... A compile error is NOT a valid red commit." This is claimed under the **net-new surface** exemption with the following justification: - LoadChunked / FirstChunkReady / loadStatusMiddleware / DBLoadChunkSize are all introduced by this PR — no prior implementation existed to refactor. There is no behaviour on master that the red commit could meaningfully assert against without first declaring the new symbols. - The cheapest "proper" alternative (split the red into two commits: stub-first + assertion-fail) was deferred because the test file unambiguously fails on missing-symbol — there is no risk of the test becoming a tautology against a pre-existing stub. - **Behaviour gating IS proven elsewhere on this branch.** Commit `799bde49` ("test(load): red — LoadChunked must mark indexes ready + not flip Complete on error") is a proper assertion-fail red against the same package, and commit `92cadd1d` is the matching green. Reviewers can verify the red→green pattern there. If a future reviewer wants the strict pattern, the follow-up is mechanical: split `f878e15e` into a stub-only commit followed by the assertion commit. Not done here to keep the rework cost proportional to the risk (zero, in this case). ## Preflight overrides - check-async-migrations: justified — the flagged `CREATE TABLE`/`CREATE INDEX` statements live in `cmd/server/chunked_load_id_zero_test.go` and `cmd/server/chunked_load_oldest_test.go` only. They run against per-test `t.TempDir()` SQLite files (in-process, ~10 rows, lifetime = single test) — they are NOT production schema migrations. No prod table is touched. PREFLIGHT-MIGRATION-SCALE: <30s N=10 (per-test tempdir fixture). --------- Co-authored-by: CoreScope Bot <bot@corescope.local> Co-authored-by: clawbot <bot@noreply.example.com> Co-authored-by: Kpa-clawbot <bot@example.com> Co-authored-by: Kpa-clawbot <bot@kpa-clawbot> |
||
|
|
7421ead9b0 |
fix: bypass API limit clamps for internal UI requests. Revisit of issue #1540 (#1589)
This PR replaces the strict, hardcoded limits on API list endpoints (introduced in the recent security patch) with a new operator-configurable `listLimits` block. This change is needed as issue 1540's implementation introduced a 500max node limit on the live map or any other function that leverages the api/nodes backend. Previously, we attempted to bypass public caps for internal UI requests using a heuristic based on browser headers (`Sec-Fetch-Site`). Following review, we decided to drop that heuristic entirely to eliminate any security-by-browser-convention surface area. Instead, `queryLimit()` returns to its original, mathematically simple bounds-checking shape, and the absolute maximums are now drawn from `config.json`. This provides equal DoS protection against all callers while allowing server operators to tune the ceilings based on the size of their mesh (e.g. embedded devices can tighten the knobs, regional hubs can raise them). ### Changes Made: - **`config.go`**: Introduced a `ListLimits` config struct containing `PacketsMax`, `NodesMax`, `AnalyticsMax`, and `ChannelMessagesMax`. Added safe initialization to ensure default caps (10000, 2000, 200, 500 respectively) apply even if the block is omitted from the config. - **`clamp_limit.go`**: Deleted `isInternalUIRequest` entirely and restored `queryLimit` to its original signature (`r, def, max`). - **`routes.go`**: Replaced all hardcoded integer ceilings on list endpoints (`/api/packets`, `/api/nodes`, etc.) with `s.cfg.ListLimits.*`. - **`config.example.json`**: Added the `listLimits` block with documentation to guide new operators. - **`clamp_limit_test.go`**: Purged all header-heuristic testing. ### Verification: - All 611 backend unit tests pass (`npm run test:unit`). - Bounds-checking math continues to enforce hard DoS clipping exactly at the operator's specified configuration limit. --------- Co-authored-by: mc-bot <bot@openclaw.local> Co-authored-by: openclaw-bot <bot@openclaw> |
||
|
|
1bdb92de88 |
feat(#1574): operator-configurable liveMap.maxNodes (default 2000) (#1577)
Red commit:
|
||
|
|
ad41b9bb7b |
fix(tests): subpaths_window tests wait for index readiness after #1595 chunked load (#1621)
## Why master is red After PRs #1592 (route-window subpath regression test) and #1595 (background/chunked index build with 503 readiness gate) were merged together, two tests in `cmd/server/subpaths_window_test.go` started failing on master: ``` --- FAIL: TestSubpathsHonorsTimeWindow_StoreLevel subpaths_window_test.go:70: unbounded: expected totalPaths=2, got 0 (subpaths=[]) --- FAIL: TestSubpathsHandlerHonorsTimeWindow subpaths_window_test.go:116: GET /api/analytics/subpaths?...: status=503 body={"error":"index loading","retryAfter":5} ``` Both branches passed in isolation; the conflict only manifested post-merge. Reason: - **#1592** added tests that call `store.Load()` then immediately query `GetAnalyticsSubpathsWithWindow` / hit `/api/analytics/subpaths`. - **#1595** moved the subpath + path-hop index builds off the critical path of `Load()` into background goroutines, and hard-gated the analytics handlers behind `SubpathIndexReady()` (returning 503 + `Retry-After: 5` until the build completes). So after `Load()` returns, `s.spIndex` is still empty for a short window and the handler returns 503. The store-level test sees `totalPaths=0`; the handler test sees the 503. ## Fix (test-only) Add `store.WaitIndexesReady(5 * time.Second)` between `Load()` and the assertions in both tests. This matches the established pattern already used by `routes_test.go` and `repeater_enrich_recomputer_1008_test.go`. The 503 readiness gate from #1595 is intentional production behavior and is **not** touched. No production code is modified. ## Repro Before: ``` $ go test ./cmd/server/ -run TestSubpaths.*Window -v -count=1 --- FAIL: TestSubpathsHonorsTimeWindow_StoreLevel (0.01s) subpaths_window_test.go:70: unbounded: expected totalPaths=2, got 0 (subpaths=[]) --- FAIL: TestSubpathsHandlerHonorsTimeWindow (0.02s) subpaths_window_test.go:116: GET /api/analytics/subpaths?minLen=2&maxLen=8: status=503 body={"error":"index loading","retryAfter":5} FAIL ``` After: ``` $ go test ./cmd/server/ -run TestSubpaths.*Window -v -count=3 --- PASS: TestSubpathsHonorsTimeWindow_StoreLevel (0.01s) --- PASS: TestSubpathsHandlerHonorsTimeWindow (0.02s) ... (x3) ... PASS ok github.com/corescope/server 0.097s $ go test ./cmd/server/ -count=1 -timeout 300s ok github.com/corescope/server 46.292s ``` ## Files changed - `cmd/server/subpaths_window_test.go` (+11 lines, test-only) ## Notes - TDD exemption: this is a test-fix PR for a merge-conflict-induced failure. The "failing test" already exists on master; this PR makes it pass correctly by waiting on the readiness gate the test was previously unaware of. - Unblocks staging deploys. Co-authored-by: openclaw-bot <bot@openclaw> |
||
|
|
222bfdf6cf |
feat(perf): SQLite writer-lock wait/hold instrumentation per component (#1340) (#1594)
## What Per-component SQLite writer-lock instrumentation so the next neighbor-builder-style write-lock starvation (root cause of #1339, invisible to operators for ~3 days) is detectable from `/api/perf`. Adds `Store.WriterExec` / `Store.WriterTx` wrappers that gate every wrapped call on a package-level `writerMu` so the wait the SQLite driver hides becomes Go-visible, and record `wait_ms` + `hold_ms` + `contention_total` (wait_ms > 100ms) under a component tag. Per-component p50/p95/p99 + max are published to `/api/perf/write-sources` under `.writer_perf` via the existing ingestor stats-file path. Slow-writer log line (`[db-slow-writer] component=X duration=Yms query=<200ch>`) fires on `hold_ms > 500ms` (threshold overridable via `CORESCOPE_DB_SLOW_WRITER_MS` env var). ## Tagged call sites | Component | Location | |-----------|----------| | `mqtt_handler` | `InsertTransmission` (db.go) | | `neighbor_builder` | `buildAndPersistNeighborEdges` (neighbor_builder.go) | | `prune_packets` | `PruneOldPackets` (maintenance.go) | | `prune_observers` | `RemoveStaleObservers` + orphan-metrics cleanup (db.go) | | `prune_metrics` | `PruneOldMetrics` (db.go) | | `vacuum` | `RunIncrementalVacuum` + `CheckAutoVacuum`'s full VACUUM (db.go) | ## TDD red→green - **Red commit** `68de585b` — `cmd/ingestor/db_writer_perf_test.go` + `Store.Writer*` stubs at end of `db.go`. Test synthetically blocks the writer for 60s tagged `neighbor_builder`, then asserts `mqtt_handler.wait_ms.p99 > 50000ms` on concurrent inserts. Fails on the assertion (p99 = 0.0ms) with the stub — not a build error. - **Green commit** `6a9be174` — replaces stubs with real wait/hold/contention aggregator + wires every writer call site. Same test passes: ``` 2026/06/05 04:36:47 [db-slow-writer] component=neighbor_builder duration=60059.0ms query=COMMIT --- PASS: TestWriterStarvationVisibleInPerf (60.40s) PASS ok github.com/corescope/ingestor 60.408s ``` ## Scope discipline - **API**: no public `Store`/`DB` signature change. Only additive exports. - **Server**: extends existing `/api/perf/write-sources` JSON with `.writer_perf` — does **not** add a new route, does **not** replace `handlePerf`. Empty `.writer_perf` map when paired with an older ingestor. - **Read/write invariant** (#1283) preserved: all instrumentation lives on the ingestor's writer connection. - **Files touched** (6 total): `cmd/ingestor/db.go`, `cmd/ingestor/db_writer_perf_test.go`, `cmd/ingestor/maintenance.go`, `cmd/ingestor/neighbor_builder.go`, `cmd/ingestor/stats_file.go`, `cmd/server/perf_io.go`, `config.example.json`. ## Deferred (acceptance items NOT in this PR) - **`mbcap_persist` component tag** — `RunMultibyteCapPersist`'s tx is intentionally NOT wrapped in this PR to stay within the implementation brief's 3-files-outside-whitelist budget. One-file follow-up to instrument. - **CI smoke test** asserting "neighbor-builder hold_ms < 1000ms on 100k-obs fixture" — deferred to a separate PR per the brief; this PR is scoped to instrumentation only. ## Preflight overrides PREFLIGHT-MIGRATION-SCALE: <30s N=runtime — the async-migration gate flagged five `instrumentedExec` / wrapped-`tx.Exec` lines on `DELETE FROM observer_metrics`, `UPDATE observers`, `DELETE FROM observer_metrics`, `DELETE FROM observations`, `DELETE FROM transmissions`. These are **not** schema migrations — they are the existing runtime prune / retention queries that already ran sync against `s.db.Exec` / `tx.Exec` on every retention cycle on master. This PR only swapped the surface call (sync → sync, via the wrapper) to record wait/hold timing; no new sync schema work was introduced. Behavior on production data is identical to master. Also: red commit's synthetic `UPDATE nodes SET name = name WHERE 0` is a test-only stub designed to acquire the writer without mutating any row (the `WHERE 0` is a no-op predicate). Fixes #1340 --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
1b112f0b08 |
feat(memlimit): GOMEMLIMIT via runtime.maxMemoryMB in server + ingestor (#1010) (#1595)
Red commit:
|
||
|
|
df61660a5e |
perf(load): background subpath+pathHop index builds with ready gates (#1008) (#1604)
## Summary
Mirrors the distance-index lazy pattern (#1011): the subpath and
path-hop index builds are no longer part of `Load()`'s synchronous
critical section. They now run in **two parallel background goroutines**
kicked off after `s.loaded = true`, so HTTP comes up immediately even at
Cascadia scale (5M observations, previously ~60s blocked on these two
builds inside `Load()` under `s.mu`).
Fixes #1008.
## Approach
Two new `atomic.Bool` fields on `PacketStore` (`subpathReady`,
`pathHopReady`) plus a one-shot broadcast channel (`indexReadyChan`) for
waiters. `Load()` removes the synchronous `s.buildSubpathIndex()` /
`s.buildPathHopIndex()` calls and instead kicks
`s.startBackgroundIndexBuilds()` right before returning. That function
spawns **two independent goroutines** (review m7), one per index. Each
goroutine:
1. acquires `s.mu.Lock()` (blocks until `Load()`'s deferred Unlock
fires),
2. runs its builder, releases the lock, stores its `ready = true`,
3. closes the broadcast channel if both flags are now true,
4. logs `[startup] index build complete: subpath (Xs)` (or pathHop).
Analytics handlers whose entire response IS the index aggregate —
`/api/analytics/subpaths`, `/api/analytics/subpaths-bulk`,
`/api/analytics/subpath-detail`, `/api/nodes/{pubkey}/paths` — gate
reads behind the corresponding atomic and respond with `503 Service
Unavailable`, `Retry-After: 5`, body `{"error":"index
loading","retryAfter":5}` until the build completes — matching the
triage spec.
### Handler scope (review M2)
A second class of handlers also touches these indexes — `/api/nodes`,
`/api/nodes/{pubkey}`, the `GetRepeaterRelayInfoMap` /
`GetRepeaterUsefulnessScoreMap` / `GetBridgeScore` enrichment helpers,
and `repeater_liveness` / `repeater_usefulness`. These are
**intentionally NOT 503-gated**: they expose the index via optional
enrichment fields that callers already treat as "may be empty", and
503-ing the SPA bootstrap to wait for an index that only affects
relay-activity badges would be a worse UX than a 30–60s window of "—"
values. The rationale is documented in the package doc-comment at the
top of `index_ready_1008.go`.
The recomputer's synchronous prewarm path
(`StartRepeaterEnrichmentRecomputer`) gates on `WaitIndexesReady(60s)`
(review M1) so it never snapshots an empty `byPathHop` into
`s.repeaterRelayCache`; on timeout it skips the prewarm and lets the
5-minute ticker pick up the populated index.
## Concurrency safety
Each build goroutine acquires `s.mu.Lock()` before calling the existing
`buildSubpathIndex()` / `buildPathHopIndex()` helpers, which replace
`s.spIndex` / `s.spTxIndex` / `s.byPathHop` with freshly-allocated maps.
Visibility of the populated maps to handlers that observe
`Ready()==true` is established by Go 1.19+ sync/atomic acquire-release
semantics: the atomic store of `true` happens-after `s.mu.Unlock()`, and
the handler's atomic load synchronizes-with that store. The handler's
subsequent `s.mu.RLock` serializes against concurrent ingest writers,
not against the builder.
The existing `main.go` boot sequence does not start ingest goroutines
until after `store.Load()` returns and graph init completes, so the
brief window between `Load()` returning and the two goroutines acquiring
`s.mu` does not race with concurrent ingest writes.
## TDD: red → green
- **Red** commit `63e79e11`: `cmd/server/index_ready_1008_test.go` adds
four assertions; `cmd/server/index_ready_1008.go` adds compile-only
stubs returning `true` so the tests fail on assertions, not build
errors.
- **Green** commit `fb1d22b0`: implements the real atomic gates, the
background goroutine, and the four handler 503 branches; also updates
four existing tests that read indexes directly post-`Load()` to call
`store.WaitIndexesReady(5s)` first.
- **Race-fix commit `b77d56eb`** (review m8 — test-infra exemption):
adds `WaitIndexesReady` calls in test helpers/setup paths so the race
detector no longer flags the read-after-Load() pattern in existing
tests. Per AGENTS.md, race-detector flakes are observable evidence (test
crashes under `-race`) and qualify for the test-infra exemption from the
TDD red-commit requirement; no behavior change in production code.
- **Polish round 2 — M1 red `408c7462` / green `85e82c8a`**:
`TestIssue1008_M1_PrewarmWaitsForIndexes` asserts the recomputer prewarm
SKIPs when indexes are not ready. Red commit adds the assertion + a stub
`repeaterEnrichmentPrewarmWait` var; green commit wires
`WaitIndexesReady` into the prewarm path and adds the handler-scope docs
for M2.
- **Polish round 2 — minor cleanups `fd089bd0`** (m3..m7): chunk-loader
wires `markIndexesReadySync`, memory-model comment rewritten to cite
acquire-release, sentinel deleted, polling replaced with a broadcast
channel, two parallel goroutines for the builds.
`TestIssue1008_m7_BothFlagsSetAfterParallelStart` covers the parallel
path.
## Reproduction
```
git fetch origin fix/issue-1008
git checkout
|
||
|
|
3898688d6d |
analytics: Relay Airtime Share endpoint + dumbbell chart (#1359) (#1601)
Implements the locked spec from #1359. Red commit: |
||
|
|
d6384c3c59 |
fix(#1217): honor time-window filter on Route Patterns analytics (#1592)
## What The Route Patterns chart on `/#/analytics` ignored the Time window picker — every selection returned identical data. This PR threads `?window=` through to the backing endpoints and the store-level computation. ## Root cause `cmd/server/routes.go:2065` (`handleAnalyticsSubpaths`) and `cmd/server/routes.go:2090` (`handleAnalyticsSubpathsBulk`) never called `ParseTimeWindow(r)`. The store-level entry points (`GetAnalyticsSubpaths`, `GetAnalyticsSubpathsBulk`) had no window-aware variant. The frontend (`public/analytics.js`) didn't append `&window=` to the `/analytics/subpaths-bulk` request. ## Fix ### Backend (`cmd/server/store.go`) Added `GetAnalyticsSubpathsWithWindow` + `GetAnalyticsSubpathsBulkWithWindow`. Zero `TimeWindow` → byte-equivalent to the existing fast path (no perf regression on the default view). Non-zero window → iterate `s.packets`, filter on `tx.FirstSeen` via `TimeWindow.Includes`, reuse `rankSubpaths`. Cached by `(region|area|window)`. ```diff -data := s.store.GetAnalyticsSubpaths(region, minLen, maxLen, limit) +window := ParseTimeWindow(r) +data := s.store.GetAnalyticsSubpathsWithWindow(region, minLen, maxLen, limit, window) ``` ```diff -results := s.store.GetAnalyticsSubpathsBulk(region, groups) +results := s.store.GetAnalyticsSubpathsBulkWithWindow(region, groups, ParseTimeWindow(r)) ``` ### Frontend (`public/analytics.js`) `renderSubpaths` now appends `&window=<value>` to the `/analytics/subpaths-bulk` request, matching how RF / topology / channels tabs already wire the picker. ## Before / after ``` GET /api/analytics/subpaths?window=24h → totalPaths=2 (all data — ignored window) GET /api/analytics/subpaths?window=24h → totalPaths=1 (24h-bounded — honored) ``` ## Tests `cmd/server/subpaths_window_test.go`: - `TestSubpathsHonorsTimeWindow_StoreLevel` — seeds a 1h-old tx with path `[aa,bb]` + a 30d-old tx with path `[cc,dd]`; asserts the unbounded call sees both and the 24h-windowed call sees only the recent one. - `TestSubpathsHandlerHonorsTimeWindow` — same scenario via the HTTP handlers for `/api/analytics/subpaths` and `/api/analytics/subpaths-bulk`. TDD: red commit `eefc27d3` (test fails on assertion with stub that ignores window), green commit `4c4c45d0` (implementation makes it pass). Full `go test ./...` in `cmd/server` green locally (~47s). ## Performance Default view (no window selected) is unchanged — `window.IsZero()` short-circuits to the existing precomputed-index hot path. Windowed view is O(N_tx · path²), same complexity as the existing region-filtered slow path. Results cached per `(region|area|window)`. Closes #1217 --------- Co-authored-by: Kpa-clawbot <bot@corescope> |
||
|
|
5629a489b2 |
perf(distance): lazy build distance index on first request (#1011) (#1597)
## Summary Build the distance analytics index lazily on the first `/api/analytics/distance` request instead of eagerly inside `Load()` (and its background-load chunked merge). Per the triage Fix path on the issue: - Eager startup build removed from `Load()` and from `loadAllPacketsBackground()`'s post-merge pass. - First request returns `202 Accepted` + `Retry-After: 5` and kicks off the build in a background goroutine, gated by `sync.Once` so concurrent first-window requests all observe 202 (single build, not N parallel O(n²) computations). - Once built, subsequent requests fall through to the existing analytics-recomputer / TTL cache and serve 200 as before. - Debounced rebuild policy: refire only when `Δobs > 5%` since last build OR `>5 min` elapsed, whichever is more restrictive. Background loader also resets the gate so the next request rebuilds against the larger dataset. Effect: operators who never visit distance analytics no longer pay the O(n²) construction at startup. Acceptance criteria (a) no startup build, (b) first request triggers build, (c) concurrent in-flight requests get 202 are encoded as failing-first tests. ## Red → green - Red: `bc947ad1` — 3 assertion failures (`expected ... empty, got 3`, `expected 202, got 200`, `expected all 10 ... got 0`). - Green: `5264b68a` — production change makes them pass, no other tests regress. ## Files changed - `cmd/server/store.go` — lazy-build state (`distLazyMu`/`Once`/`Built`/`Building`/`LastBuilt`/`LastObs`), `TriggerDistanceIndexBuild`, `DistanceIndexBuilt`, `DistanceIndexBuilding`; eager `buildDistanceIndex` calls in `Load()` post-pass and chunked-background-load post-pass removed (Once reset instead so the next request rebuilds against the full dataset). - `cmd/server/routes.go` — `/api/analytics/distance` returns 202 + `Retry-After` until built. - `cmd/server/distance_lazy_index_test.go` — new tests (the three triage acceptance criteria). - `cmd/server/coverage_test.go`, `cmd/server/parity_test.go`, `cmd/server/routes_test.go`, `cmd/server/hop_disambig_e2e_test.go` — pre-warm the index via `TriggerDistanceIndexBuild()` + `DistanceIndexBuilt()` poll where the test asserts the 200 JSON shape. ## Perf justification Startup cost on a 500K-obs / 2K-node dataset: previously O(n²) hop scan during `Load()` post-pass and again during the background-load merge — measured at 10–20s in `specs/startup-audit.md`. New code: zero work at startup, the same O(n²) work runs at most once per HTTP request cycle (and only when the index is stale per debounce policy). Cold-path concurrency is bounded by `sync.Once`, so N parallel first-window requests never produce N parallel builds. ## Scope No config field added (debounce thresholds are hardcoded constants per the triage Fix path — `5%` / `5min`). No public API signature changes. No DB-side migration. Tests cover the lazy invariant, the 202+Retry-After contract, and concurrent first-request behavior. Closes #1011 --------- Co-authored-by: Kpa-clawbot <bot@corescope.local> |
||
|
|
3df8924114 |
fix(#1218): include multi-byte prefix repeaters in 1-byte hash usage matrix view (#1591)
## Problem
`/analytics` Hash Usage Matrix 1-byte view excluded repeaters configured
for 2- or 3-byte hash prefixes. In MeshCore, 1-byte path-matching is a
first-byte equality check, so any packet routed by 1-byte hash collides
on that first byte regardless of the downstream repeater's configured
prefix size. Omitting multi-byte prefix repeaters under-reports real
conflicts in the 1-byte hash space.
## Fix
**Data layer — `cmd/server/store.go` (`computeHashCollisions`,
~L7907-L7918 before, L7907-L7941 after):**
Before — `one_byte_cells` was populated only from `prefixMap`, which
only contained repeaters with `hash_size == 1`:
```go
if bytes == 1 {
oneByteCells = make(map[string][]collisionNode)
for i := 0; i < 256; i++ {
hex := strings.ToUpper(fmt.Sprintf("%02x", i))
oneByteCells[hex] = prefixMap[hex]
if oneByteCells[hex] == nil {
oneByteCells[hex] = make([]collisionNode, 0)
}
}
} else if bytes == 2 { ... }
```
After — additionally project all `hash_size in {2,3}` repeaters to their
first byte:
```go
if bytes == 1 {
// ... (same baseline population) ...
for _, cn := range allCNodes {
if cn.Role != "repeater" { continue }
if cn.HashSize != 2 && cn.HashSize != 3 { continue }
if len(cn.PublicKey) < 2 { continue }
hex := strings.ToUpper(cn.PublicKey[:2])
if _, ok := oneByteCells[hex]; !ok { continue }
oneByteCells[hex] = append(oneByteCells[hex], cn)
}
}
```
The 2-byte view's bucketing is unchanged — that view continues to count
only repeaters configured for 2-byte prefixes (those semantics differ).
**UI — `public/analytics.js` L1459:** clarified the 1-byte view
description so the inclusion of multi-byte prefix repeaters is explicit.
## API shape
No response-shape change. `one_byte_cells[HEX]` is still
`[]collisionNode`; only the contents now include 2/3-byte prefix
repeaters in the appropriate first-byte buckets. The existing frontend
decoder is unaffected.
## Tests
-
`cmd/server/routes_test.go::TestHashCollisionsOneByteIncludesMultiBytePrefixRepeaters`
— seeds three repeaters with first byte `CC` configured for 1/2/3-byte
prefixes plus an unrelated `DD` repeater, asserts all three appear in
`one_byte_cells["CC"]`, and that the 2-byte view's `nodes_for_byte` is
unchanged.
Red commit `278bdf8d` (test only) fails on assertion ("got 1, want 3");
green commit `9127ea4e` passes.
## Preflight
`bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master`
→ clean.
Closes #1218
---------
Co-authored-by: clawbot <bot@corescope>
|
||
|
|
1a2b8c48be |
feat(node-detail): link RTC-reset warning to offending packet hashes (#1094) (#1590)
## Problem Node detail's bimodal-clock warning showed only `⚠️ N of last M adverts had nonsense timestamps (likely RTC reset)` — no way to tell which packets, no way to verify the heuristic, no way to drill in. ## Fix Additive, two-sides: **Backend** (`cmd/server/clock_skew.go`) - New type `BadSample { Hash, AdvertTS, SkewSec }`. - New field `NodeClockSkew.RecentBadSamples []BadSample` (`omitempty`). - Populated from the **same** bimodal-bad classification pass that produces `RecentBadSampleCount` — no heuristic change. `tsSkewPair` carries `hash` + `advertTS` so the classifier can record per-sample evidence without a second walk; drift code is unaffected (reads only `ts`/`skew`). **Frontend** (`public/nodes.js`) - `bimodalWarning` preserves the existing count summary line, then renders a `<ul>` of bad samples: each `<li>` is `<a href="#/packets/HASH">hash[:8]</a> → formatTimestamp(advertTS)` with ISO tooltip. Defensive `Array.isArray` so older API responses still render the summary alone. ## TDD - **Red:** `cmd/server/clock_skew_issue1094_test.go::TestIssue1094_RecentBadSamples_ExposesHashAndTimestamp` — seeds 3 healthy + 2 bimodal-bad adverts, asserts `RecentBadSamples` has length 2 with the expected hashes and advert timestamps. Fails on the assertion (`len = 0, want 2`) with the stub-only commit. - **Green:** classifier populates the slice; existing #1285 and bimodal tests stay green. - Red commit: `ed501f4b` - Green commit: `54305b06` ## Cross-stack Backend + frontend ship together (`cross-stack: justified` commit). API stays backward compatible (`omitempty` server, `Array.isArray` client) but the feature only lights up with both halves present. ## Preflight Clean — PII, branch scope, red-commit, CSS vars, XSS sinks, migrations, fixture coverage all pass. ## Acceptance - [x] Warning lists specific packet hashes - [x] Each hash links to `#/packets/<hash>` - [x] Bad advert timestamp shown next to the hash - [x] Pattern is reusable — `BadSample` is a clean shape any future heuristic that flags specific packets can adopt Fixes #1094 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
7533b3b67b |
feat(nodes): sortable First Seen column on Nodes table (#1166) (#1587)
## Summary Adds a sortable **First Seen** column to the Nodes table so users can spot newly observed repeaters in their region (per the reporter's use case). Closes #1166 ## Backend `/api/nodes` already exposes `first_seen` per node via `db.scanNodeRow` (sourced from the existing `nodes.first_seen` column — no schema migration, no recomputation, no extra query cost). The red test pins that contract. ## Frontend (`public/nodes.js`) - New `<th data-sort-key="first_seen" data-sort-default="desc">First Seen</th>` between Last Seen and Adverts. - Cell renders via `renderNodeTimestampHtml(n.first_seen)` — same relative-time + absolute-ISO `title=` tooltip as the Last Seen column. Empty values render as `—`. - `sortNodes` gains a `first_seen` branch with **empty-last** semantics: nodes without a `first_seen` always sort to the bottom regardless of asc/desc direction, so unknowns never clutter the top of the table. - Empty-state `colspan` bumped 7 → 8. ## TDD - **Red commit** `112442f4` — `test-issue-1166-first-seen-column.js` + `cmd/server/first_seen_1166_test.go`. The backend half passes on red (field already returned); 5 frontend assertions fail on assertions (column header missing, sort branch missing, empty-last violated). - **Green commit** `9274b36c` — only `public/nodes.js`. All 6 tests pass. Verified red is real-fail (assertion-shaped) by checking out the red commit's `nodes.js` and re-running the test: 5 failures, all on `assert.strictEqual`, none on parse/import. ## Test results ``` node test-issue-1166-first-seen-column.js → 6 passed, 0 failed node test-frontend-helpers.js → 611 passed, 0 failed go test ./cmd/server/... → ok (45.16s, all pass) ``` ## Files changed - `public/nodes.js` (+14 / −1) - `test-issue-1166-first-seen-column.js` (new) - `cmd/server/first_seen_1166_test.go` (new) ## Scope guardrails - No schema migration. - No new files outside the worktree's three allowed surfaces. - No refactor of other Nodes columns. - Empty cells handled in both render (em-dash) and sort (always last). --------- Co-authored-by: fix-1166-bot <bot@corescope.local> |