mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-05-19 09:55:05 +00:00
master
248 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
467b01a1b3 |
fix(#1285): exclude RTC-reset outliers from clock-skew hash median + recent bad count (#1288)
Red commit:
|
||
|
|
1da2034341 |
refactor(db): move all writes from server to ingestor; server truly read-only (fixes #1283) (#1286)
**Red commit:**
|
||
|
|
d667dc0a74 |
fix(#1278): /api/nodes/{pk}/paths uses canonical persisted resolved_path (drop anchor-bias inconsistency) (#1282)
First failing (RED) commit:
|
||
|
|
e6c30e1a7e |
feat(decoder): GRP_DATA + MULTIPART + advertRole fix + CONTROL flags (#1279 P0+P1) (#1280)
Addresses the four P0+P1 firmware reconciliation gaps from the umbrella audit (issue #1279). RED commit: `0a4c084e` (asserts on stub returns; all 13 assertions fail). GREEN commit: `13867681`. ## What's in this PR ### P0 — silently dropped data - **#1 GRP_DATA (0x06) decoder.** Outer envelope is the same shape as GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per `firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored `decryptChannelBlock(...)` helper used by both 5 and 6. When a channel key matches, the inner is parsed per `firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE) + data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType, dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC, encryptedData}` otherwise. Server-side decoder surfaces envelope only (no key store). - **#2 MULTIPART (0x0A) decoder.** Per `firmware/src/Mesh.cpp:289`, byte0 = `(remaining<<4) | inner_type`. When `inner_type == PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per `firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType, innerTypeName, innerAckCrc | innerPayload}`. ### P1 — mis-classified / opaque - **#3 `advertRole()` raw-type fix.** Per `firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and 5-15 are FUTURE. The previous boolean fallback collapsed both into `"companion"`, silently relabelling unknown/reserved types. New behaviour: type 0 → `none`, 1 → `companion`, 2-4 → `repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts the new labels. - **#4 CONTROL (0x0B) byte0 flags + length.** Per `firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`, byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags, ctrlZeroHop, ctrlLength}`. ### Drift fix - `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA` and `10: MULTIPART` (previously omitted; canonical decoder map already had them). ## Lockstep & TDD Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in the same commits — same wire-vector tests live in both packages (`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible in `git log`. | Item | Tests | RED proof | |---|---|---| | #1 GRP_DATA | ingestor: NoKey + DecryptedInner; server: Envelope | 6 assertions failed pre-impl | | #2 MULTIPART | ingestor + server: Ack + NonAck | 8 assertions failed pre-impl | | #3 advertRole | ingestor + server: 7-row table | 3 assertions failed pre-impl | | #4 CONTROL | ingestor + server: ZeroHop + MultiHop | 6 assertions failed pre-impl | ## What's NOT in this PR The umbrella issue lists P2 items that ship in follow-up PRs: - Live + compare legend entries for the long tail of newly-named types (#1274 + others). - TransportCodes UI surface + filter grammar. - feat1/feat2 capability badges. - `payloadTypeNames` consolidation across server/ingestor (drift-prevention). Leave the umbrella open after this merges. Refs #1279 --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local> |
||
|
|
8bf7709970 |
feat(repeater): usefulness score — bridge axis (#672 axis 2 of 4) (#1275)
RED test commit: `fd661569` — CI will fail on this (stub returns empty map; assertions fail by design). GREEN: `bf4b8592`. ## What Implements **axis 2 of 4** for the repeater usefulness score per #672 ([status comment](https://github.com/Kpa-clawbot/CoreScope/issues/672#issuecomment-4484635378)). The Bridge axis measures *structural importance*: how many shortest paths between other nodes route through this one. A high-traffic redundant node and a low-traffic critical bridge will no longer look identical. ## Algorithm **Brandes' weighted betweenness centrality** with Dijkstra for shortest paths (`cmd/server/bridge_score.go`). - Nodes: pubkeys in the `neighbor_edges` graph - Edge weight: `Score(now) * Confidence()` — per the convention from #1235 (count + recency decay scaled by observer-diversity confidence). Geo-rejected edges already excluded at graph build time (#1230) so we don't re-filter here. - Dijkstra distance: `1 / max(epsilon, weight)` — high affinity = cheap cost. - Normalize: divide by max observed centrality so output is in `[0, 1]`. Cost: `O(V · (E + V log V))`. Staging-scale (~600 nodes / ~2 000 edges) ≈ ~4.8M ops, completes in milliseconds. ## Where it lives - `cmd/server/bridge_score.go` — pure algorithm, no locks - `cmd/server/bridge_recomputer.go` — background recomputer (mirrors #1240/#1262 pattern), 5-min default interval, initial sync prewarm, snapshot stored in `s.bridgeScoreMap atomic.Pointer[map[string]float64]` - `cmd/server/routes.go` — `handleNodes` adds `node["bridge_score"]` on repeater/room rows; node-detail handler adds it on the single-node path - `public/nodes.js` — separate **Bridge** row in the node detail panel, alongside the existing **Usefulness** (Traffic) row. Distinct colour-coded bar. ## What's NOT in this PR (still pending for #672) - **Coverage axis** (axis 3) — unique observer-pair connectivity - **Redundancy axis** (axis 4) — simulated node-removal impact - **Composite** — once all 4 axes ship, swap the `usefulness_score` formula from "traffic-only" to the weighted composite `Refs #672` (not `Fixes` — issue stays open until all 4 axes + composite ship). ## Tests - `TestComputeBridgeScores_LineGraph` — 4-node line: middles non-zero, leaves zero, max normalized to 1.0 - `TestComputeBridgeScores_TriangleNoBridge` — clique has zero bridges - `TestComputeBridgeScores_Empty` — defensive nil-safety - `TestComputeBridgeScores_WeightSensitive` — mutation guard: revert the `1/w` inversion and this test fails - `TestBridgeScore_HandleNodesSurface` — integration: `/api/nodes` returns `bridge_score` on repeater rows; middle nodes > 0, ends == 0 --------- Co-authored-by: clawbot <bot@meshcore.local> |
||
|
|
4cd8445233 |
perf(#1265): wire /api/observers/clock-skew + /api/nodes/clock-skew into analytics recomputer (#1266)
RED:
|
||
|
|
ae17a2be12 |
perf(#1262): /api/nodes?limit=2000 cold-miss 15.7s → <100ms — prewarm repeater enrichment cache (#1263)
RED commit: `22ce5736066142583017cad7303fa48d9e00ccf0` — CI on red: https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1262 ## Problem After #1260 added a 15s-TTL bulk cache for repeater enrichment in `handleNodes`, `/api/nodes` (default limit) dropped to ~500ms. But `/api/nodes?limit=2000` — called by `public/live.js` at SPA startup for hop resolution — still took **15.7s cold** on staging (75k tx, 600 nodes). Warm hits were ~40ms. Root cause: the bulk cache was lazily populated on the first request after TTL expiry. The rebuild ran on the request-serving goroutine. Every cold SPA load triggered the rebuild and ate 15s. ## Fix Add `StartRepeaterEnrichmentRecomputer` — a steady-state background recomputer that mirrors the `analytics_recomputer.go` pattern from #1240: - **Prewarm**: initial synchronous compute on Start so the first request hits a populated cache. - **Steady-state**: ticker refreshes the snapshot every 5min (configurable via the existing analytics recompute interval knob). - **Panic-safe** + idempotent Start. Wired into `main.go` right after `StartAnalyticsRecomputers`, using `cfg.GetHealthThresholds().RelayActiveHours` as the window. ## Test `TestHandleNodesLimit2000ColdMiss` — seeds 600 nodes + 150k non-advert tx with repeaters indexed under a shared 1-byte hop prefix (matches production hop-prefix collisions), starts the recomputer, then issues `/api/nodes?limit=2000` with **no HTTP warmup**. | State | Latency | |---|---| | Before (master, on-thread rebuild) | 3.37s | | After (prewarm + steady-state) | 56ms | | Budget | 2s | Staging end-to-end: 15.7s → expected sub-100ms on the same call path. Red commit (`22ce5736066142583017cad7303fa48d9e00ccf0`) compiles with a no-op stub of the new method so the test fails on the latency **assertion**, not a missing symbol. Fixes #1262 --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
1efe93d7f6 |
perf(#1257): bulk-cache repeater enrichment in /api/nodes — 32s → <500ms (#1260)
RED commit `a2879e12` — perf regression test; CI run: see Actions tab. Fixes #1257. ## Root cause `handleNodes` looped over the response page and called `store.GetRepeaterRelayInfo(pk, win)` + `store.GetRepeaterUsefulnessScore(pk)` for every repeater/room. Each call: - grabbed its own `s.mu.RLock`, - walked `byPathHop[pk]` (+ the matching 1-byte raw-prefix bucket, which on busy networks fans out to nearly the entire non-advert tx set), - and re-parsed every `tx.FirstSeen` with `parseRelayTS`. Default page is the 50 most-recently-seen nodes — almost all hot repeaters — so the request did O(50) lock acquisitions and hundreds of thousands of timestamp parses on the same set of txs. That's the classic load-then-paginate / per-row N+1 shape called out in the issue (same family as #1226). The `?limit=2000` variant looks faster relatively only because per-node enrichment dwarfs serialization; on staging both still bottleneck on the same loop. ## Fix Two new bulk methods on `PacketStore`: - `GetRepeaterRelayInfoMap(windowHours)` → `pubkey → RepeaterRelayInfo` - `GetRepeaterUsefulnessScoreMap()` → `pubkey → 0..1` Both snapshot `byPathHop` under a single `RLock`, pre-parse each `FirstSeen` exactly once (a tx that appears in N hop buckets used to be parsed N times), and emit one entry per hop key. Cached 15s — same TTL as `GetNodeHashSizeInfo` / `GetMultiByteCapMap`, same status-column freshness budget. `handleNodes` is one map-lookup per node; behavior, output schema, and `RelayActive` / `RelayCount{1h,24h}` / `LastRelayed` / `usefulness_score` semantics are preserved. ## Why no `limit` default change The issue mentioned a default-limit knob. Investigated: `queryInt(r, "limit", 50)` already defaults to 50 — frontends calling `/api/nodes` (no limit) get a 50-row page today. Capping further would change behavior (live.js already passes `?limit=2000` when it wants more); the cost was per-repeater enrichment, not page size. Fixing the N+1 is the correct lever and preserves backward compat. ## Perf Regression test `TestHandleNodesPerfLargeFleet` (600 nodes, 150k non-advert tx, repeaters indexed under `byPathHop`): | | elapsed | vs 2s budget | |---|---|---| | before (master) | 4.72s | ✗ | | after | ~4ms | ✓ (~1000×) | ## TDD - RED: `a2879e12` — test fails at 4.72s on master. - GREEN: `c529d29a` — fix; full `cmd/server` + `cmd/ingestor` suites green. --------- Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
f81ed5b3cf |
perf(#1256): wire /api/analytics/roles into steady-state recomputer (#1259)
RED commit: `0190466d` — failing CI: https://github.com/Kpa-clawbot/CoreScope/actions (will populate after PR creation) ## Problem On staging (commit `d69d9fb`, 78k tx, 2.3M obs), `curl http://localhost/api/analytics/roles` times out at 60s with 0 bytes — the Roles tab is unusable. Issue #1256. PR #1248's steady-state recomputer fan-out (topology / rf / distance / channels / hash-collisions / hash-sizes) **didn't include roles**. The legacy handler: 1. Holds `s.mu.RLock` for the entire compute. 2. Calls `GetFleetClockSkew()`, which drives `clockSkew.Recompute(s)` over all ADVERT transmissions — O(78k) per request. 3. Concurrent ingest writers compound the latency through writer-starvation. Result: every request hits the cold path; the response never comes back inside the 60 s HTTP budget. ## Fix Add `roles` as the 7th endpoint in the recomputer fan-out — same pattern as #1248: - `PacketStore.recompRoles` slot, registered in `StartAnalyticsRecomputers` with default 5-min interval. - `PacketStore.GetAnalyticsRoles()` → atomic-pointer load from the snapshot (sub-ms), with a `computeAnalyticsRoles()` fallback only for the brief startup window before the initial sync compute completes. - Handler is now a thin wrapper — no lock-held work on the request path. - New optional `roles` key under `analytics.recomputeIntervalSeconds` in config; `config.example.json` and `_comment_analytics` updated. ## Latency (unit-scope benchmark) - Worst-of-50 handler latency: **<100 ms** (test budget; well under the 2 s p99 acceptance). - Compute itself is bounded by the existing 5-min recompute window — it runs once in the background, never on the request path. ## Tests - RED `0190466d`: asserts `recompRoles` is registered and the handler returns under the latency budget. Fails on master with `recompRoles not registered`. - GREEN `d7784f76`: registers the recomputer + snapshot accessor — both tests pass. Fixes #1256 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
d69d9fbf8e |
perf(#1247): surgical fix for resolveWithContext tier-1 hot path (4.6× speedup) (#1253)
## Summary Surgical fix for #1247: analytics endpoints regressed 3-9× between prod `d818527` and master. pprof against staging traced the regression to `resolveWithContext` tier-1 affinity loop running on every analytics `resolveHop` call (post-#1198 plumbing) with redundant per-(cand, ctx) work. **Result: 4.6× speedup on the synthetic hot-shape benchmark (202µs → 44µs / op).** ## Root cause - PR #1198 (`353c5264`) lit up `resolveWithContext` tier 1 from every analytics resolveHop closure (previously they passed `contextPubkeys=nil` and short-circuited the entire tier-1 block). - The inner loop did `N_cand × N_ctx` iterations where each one did: - `graph.Neighbors(strings.ToLower(ctxPK))` — graph RLock + ToLower allocation **per candidate**, redundantly - `strings.ToLower(cand.PublicKey)` per `ctxPK` - `strings.EqualFold(otherPK, ctxPK)` + `EqualFold(otherPK, candPK)` — both sides were already lowercased (`NeighborEdge.NodeA/B` via `makeEdgeKey`; `contextPubkeys` via `buildHopContextPubkeys`) - At staging scale (5k+ contextPubkeys × 30k+ resolveHop calls) this dominated `computeAnalyticsTopology` (37% of its CPU) and `computeAnalyticsRF` (55%). ## pprof attribution (staging, region-keyed queries bypassing #1240 cache) ``` computeAnalyticsTopology cum: 19.24% (5.45s / 28.32s sampled) └─ resolveWithContext 37% ├─ strings.ToLower 41% ├─ strings.EqualFold 28% └─ graph.Neighbors 24% computeAnalyticsRF cum: 10.38% ``` ## Fix (~80 LoC in `cmd/server/store.go`) 1. Lowercase `contextPubkeys` **once per call**, skipped entirely when already lowercased (the analytics fast path). 2. Lowercase candidate pubkeys **once per call**. 3. Invert the loop nesting: outer-ctx / inner-edge / candidate-map lookup. `graph.Neighbors` is called once per context pubkey instead of `N_cand` times. 4. Raw `==` instead of `strings.EqualFold` for pubkey comparisons (both sides lowercased by step 1/2). 5. Added a tiny `hasUpperASCII` byte-loop helper next to `isHexLower` for the fast-path check. Behavior preserved: same `Score × Confidence` formula, same tier-1 ratio + min-observations gate, same per-candidate "best edge wins" semantics. No change to tiers 2/3/4. ## TDD evidence - Red commit (`5f8d1564`): `TestResolveWithContextTier1Floor` asserts `<100 µs/call` on the hot shape. **199 µs/call on regressed master → FAIL.** - Green commit (`e3bdbc65`): surgical fix lands. **44 µs/call → PASS.** - Reverification: locally stashed the fix, ran the test → 199.5 µs FAIL; popped fix → 44 µs PASS. `BenchmarkResolveWithContextTier1Hot` (no assertion, visibility only): ``` before: 202013 ns/op 168 B/op 3 allocs/op after: 44084 ns/op 424 B/op 6 allocs/op speedup: 4.6× ``` (Post-fix allocs are O(N_cand + N_ctx) one-time helper tables — net win at hot scale.) ## Independence from #1248 PR #1248 caches the analytics compute output so user-facing latency is sub-ms even when the compute is slow. That's correct for UX but it masks the regression. This PR repairs the compute itself, so: - Region-keyed and windowed queries (which bypass the recomputer cache by design — see #1240) become fast again. - Future ingest scale or feature work on top of the regressed baseline doesn't compound. ## Out of scope - The geo-rejection (#1228) and Confidence weighting (#1229) commits — kept intact, they protect correctness and were not the dominant CPU cost. - Reverting any suspect commit — surgical only. ## Acceptance criteria from #1247 - [x] pprof confirms the hot function (`resolveWithContext`) - [x] Bisect identifies the regressing commit (`353c5264` / PR #1198 — context plumbing; ratified by pprof, no need to actually rebuild 5 binaries) - [x] Fix lands; tier-1 hot path 4.6× faster - [x] No regression in disambiguator correctness — full `go test ./...` green, all existing `ResolveWithContext` / `HopDisambig` / `NeighborGraph` / `Affinity` tests pass Fixes #1247 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
356f001027 |
perf(#1240): steady-state background recompute for analytics endpoints (#1248)
RED commit: `27630f6a` — adds latency test that fails on master (p99=225ms > 50ms budget) and a stub `StartAnalyticsRecomputers` that returns a no-op so the assertion (not a build error) gates the change. GREEN commit: `20fbbceb` — wires real background recompute infrastructure. Test passes at p99=~1µs. ## What changed Replaces the on-request "compute-then-cache" pattern for the default-shape analytics queries with a steady-state background recompute loop. Reads always hit an `atomic.Value` snapshot in <1µs regardless of compute cost or writer contention. Operator principle: serving slightly stale data quickly beats real-time data slowly. ## Endpoints converted (default 5min interval each) | Endpoint | Cold compute | Recomputer interval | |---|---|---| | `/api/analytics/topology` | ~5s | 5 min | | `/api/analytics/rf` | ~4s | 5 min | | `/api/analytics/distance` | ~3s | 5 min | | `/api/analytics/channels` | ~0.5s | 5 min | | `/api/analytics/hash-collisions` | ~0.5s | 5 min | | `/api/analytics/hash-sizes` | ~22ms | 5 min | All intervals configurable per-endpoint via `analytics.recomputeIntervalSeconds.<name>` in `config.json`; documented in `config.example.json`. Default override via `analytics.defaultIntervalSeconds`. ## Scope: default query only Only the canonical shape `(region="", window=zero)` is precomputed. Region- or window-filtered requests fall back to the legacy TTL cache + on-request compute — keeps recomputer count bounded (6, not 6×N×M). ## Latency Test `TestAnalyticsRecomputerSteadyStateLatency`: 100 concurrent readers + 4 writers churning `s.mu.Lock` on 20k distHops. - Before: p50=188ms p99=225ms (assertion failed) - After: p50=240ns p99=1.1µs (atomic load + map return) ## Shutdown integration `StartAnalyticsRecomputers` returns a stop closure invoked from `main.go`'s SIGTERM handler BEFORE `dbClose()` so any in-flight SQLite compute drains cleanly. `TestAnalyticsRecomputerShutdownNoLeak` confirms all 6 goroutines are reaped (Δ=6 within 2s). ## Safety details - Initial compute is synchronous in `Start()` — first read after startup never sees nil. - `recover()` inside `runOnce` keeps a compute panic from killing the goroutine; previous snapshot remains valid. - `analyticsRecomputerMu` is a sync.RWMutex; recomputer pointers are read-locked in the hot path. The atomic.Value swap inside `runOnce` is lock-free. Fixes #1240. --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local> |
||
|
|
b881a09f02 |
feat(#1188): show observer IATA on packets + filter grammar (#1189)
Red commit:
|
||
|
|
2754251a53 |
perf(#1239): /api/analytics/distance — TTL 15s→60s + drop main RLock around compute (#1241)
## Summary Fixes #1239 — `/api/analytics/distance` 15s cold on staging under heavy ingest. Two independent fixes. First commit on this branch is the RED test for Fix B (`a539882`), demonstrating reader/writer contention against the main store lock. CI: see Actions tab for the run on the test-only commit — it asserts >150µs avg writer cycle and fails at 82367µs pre-fix. GREEN commit (`d3938f1`) brings it to 1µs. ## Fix A — TTL bump 15s → 60s (`5eae1e0`) - `rfCacheTTL` default in `cmd/server/store.go` changed from `15 * time.Second` to `60 * time.Second`. This is the shared TTL for RF / topology / distance / hash-sizes / subpath / channel analytics caches. - Per operator clarification (issue thread): distance analytics IS viewed live during analysis sessions, not background-glanced. 60s smooths the cold-miss churn during heavy ingest without freezing data. - `config.example.json`: documented `cacheTTL.analyticsRF` with new default + caveat. - Existing assertions (`TestCacheTTLDefaults`, `TestHashCollisionsCacheTTL`) updated to the new default. ## Fix B — Drop main RLock around compute (`a539882` red, `d3938f1` green) `computeAnalyticsDistance` previously held `s.mu.RLock()` for the entire iteration: region match-set construction, hop/path filtering, sort, dedup, histogram, category stats, time series. Readers serialized writers (ingest, `buildDistanceIndex`). Refactor: hold the RLock only long enough to snapshot the `distHops`/`distPaths` slice headers AND build the region match-set (which reads `tx.Observations`, mutated under `s.mu.Lock`). For `region=""` (the hot cold-call path) the lock hold is just the header snapshot — microseconds. Everything else runs on the locally-captured slices outside the lock. Safety: `distHops`/`distPaths` are append-only via re-slice in `buildDistanceIndex` / `updateDistanceIndexForTxs` (both under `s.mu.Lock`). If the backing array reallocates after the snapshot, the snapshot still references the prior array (GC-pinned) at the consistent length captured under the lock. Records are value types — no torn writes. ## Test results `cmd/server/distance_lock_contention_test.go` (8 reader goroutines × 20k synthetic distHops × 200 writer Lock/Unlock cycles): - pre-fix avg writer cycle: **82367µs** (16.5s for 200 cycles) - post-fix avg writer cycle: **1µs** (279µs for 200 cycles) - ~82000× reduction in writer contention; reader result shape unchanged Full `go test ./cmd/server/...` green with `-race`. ## Out of scope (per issue) - Same lock pattern in topology / RF / hash / subpath analytics — file separately if needed. - Per-region cache key sharding. - WebSocket-driven cache invalidation. --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
2e28aa3e04 |
fix(#1229): source-diversity confidence weighting in neighbor-graph tier-1 resolver (#1235)
RED |
||
|
|
b21badbcbd |
fix(#1225): paginate channel messages at SQL level — 30s → <500ms (#1226)
## Summary Fixes #1225 — channel messages endpoint took ~30s on staging. ## Root cause `(*DB).GetChannelMessages` SELECTed every observation row for the channel (one row per observation, not per transmission), JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender, packetHash)`, then sliced the tail in Go for pagination. On staging `#wardriving`: - `transmissions` rows with `channel_hash='#wardriving' AND payload_type=5`: **5,703** - `observations` joined to those: **274,632** (~48× amplification) - `time curl /api/channels/%23wardriving/messages?limit=50`: **30.04s / 31.41s / 31.48s / 35.33s / 34.05s** (5 calls before I killed the loop) `EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being used — the cost was entirely in fetching, unmarshalling, and folding the full observation set per request even for `limit=50`. Hypothesis #1 from the issue (full table scan on `messages/decoded`) is rejected; #2 (missing index) is rejected; the actual cause was **pagination in Go instead of SQL** — request cost was O(observations) not O(limit). ## Fix Move pagination into SQL on the `transmissions` table. Because `transmissions.hash` is `UNIQUE` and the original dedup key was `(sender, hash)`, each transmission collapses to exactly one logical message — paginating on transmissions is semantically equivalent to the prior in-Go dedup + tail slice. New shape: 1. `COUNT(*)` on transmissions for total (uses `idx_tx_channel_hash`). 2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ? OFFSET ?` to pick the page of newest transmissions. 3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` — typically 50 ids → a few hundred observation rows. 4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API contract. Region filtering, observation-count-as-`repeats`, and "first observation wins for hops/snr/observer" semantics are preserved (observations are scanned `ORDER BY o.id ASC`). ## Perf measurements **Before** (staging `#wardriving`, limit=50, 5 samples killed mid-loop): 30.04s, 31.41s, 31.48s, 35.33s, 34.05s. **Synthetic regression test** (`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs. - Broken impl: ~4.5s (test fails the 500ms budget — the RED commit). - Fixed impl: well under 500ms (test passes). **After (staging)**: will measure post-deploy and post-comment on issue with numbers. Synthetic scaling: staging is ~2× the test's transmission count, fixed-path cost scales with `limit` (50) + `COUNT(*)` (~5k rows on index) — expect <100ms p99. ## TDD - RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at ~4.5s. - GREEN: `3f1f82d3` — fix; full suite green, perf test passes. ## Hypotheses status | # | Hypothesis | Verdict | |---|---|---| | 1 | Endpoint slow on prod-sized data | **CONFIRMED** (different mechanism — see root cause) | | 2 | Missing channel_hash index | Rejected (`idx_tx_channel_hash` exists & used) | | 3 | Frontend re-render storm | Not investigated (backend was clearly the bottleneck) | | 4 | Decode in request path | Rejected (decode is at ingest time; JSON unmarshal of cached `decoded_json` is the cost, addressed by reducing row count) | | 5 | WS subscription failure | Rejected | | 6 | Staging artifact | Rejected (reproducible) | ## Out of scope - The in-memory `(*PacketStore).GetChannelMessages` path (used when `s.db == nil`) has the same shape but operates on bounded in-memory data; not touched. If we ever fall back to it in production we'll revisit. --------- Co-authored-by: clawbot <bot@corescope> |
||
|
|
7179afcfde |
feat(#1228): reject geo-implausible neighbor-graph edges at build time (#1230)
Fixes #1228 — geo-implausible neighbor-graph edges are rejected at build time. Red commit: `5a6d9660` — failing tests for 4 cases (reject SF↔Berlin, accept local CA, accept no-GPS endpoint, counter increments). Live CI run (latest commit): https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1228 ## Why The disambiguator's tier-1 affinity graph is built blindly from path co-occurrence. On wide-geo MQTT deployments, a single bad hop disambiguation seeds an edge across geographically impossible distances (e.g. Bay Area ↔ Berlin), which then reinforces the same wrong resolution next time. Self-poisoning spiral. ## What changed - `upsertEdge` now consults a per-graph GPS index. When **both** endpoints have known GPS and their haversine distance exceeds the threshold, the edge is dropped and `NeighborGraph.RejectedEdgesGeoFar` (atomic) is incremented. - Either endpoint missing GPS ⇒ accept (no signal to reject), per acceptance criteria. - Threshold is configurable via `neighborGraph.maxEdgeKm` (default **500 km** — well above any plausible terrestrial LoRa hop, including satellite-assisted). 0 ⇒ use default; negative ⇒ disable the filter. Exposed via `Config.NeighborMaxEdgeKm()`. - New `BuildFromStoreWithOptions` carrying the threshold; `BuildFromStore` and `BuildFromStoreWithLog` are kept as thin wrappers. - Stats are surfaced under `GET /api/analytics/neighbor-graph` as `stats.rejected_edges_geo_far`. - All rejection logs PII-truncate pubkeys to 8 hex chars (public repo discipline). - `config.example.json` updated with the new field + comment. ## Follow-up #1229 (per-region scoped affinity graphs) depends on this landing first. --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
eba9e89a72 |
fix(#1203): path-inspector — singleflight + stale-while-revalidate (#1208)
Red commit:
|
||
|
|
11d2026bb1 |
feat(startup): hot startup — load hotStartupHours synchronously, fill retentionHours in background (#1187)
Closes #1183 ## Summary - Adds `packetStore.hotStartupHours` config key (float64, default 0 = disabled). When set, `Load()` loads only that many hours of data synchronously, reducing startup time on large DBs. Background goroutine fills the remaining `retentionHours` window in daily chunks after startup completes. - A background goroutine (`loadBackgroundChunks`) fills the remaining `retentionHours` window in daily chunks after startup completes. Analytics indexes are rebuilt once at the end. - `QueryPackets` and `QueryGroupedPackets` check `oldestLoaded` and fall back to `db.QueryPackets()` for any query whose `Since`/`Until` predates the in-memory window — covering days 8–30 permanently (beyond `retentionHours`) and the background-fill gap during startup. - `/api/perf` gains `hotStartupHours`, `backgroundLoadComplete`, and `backgroundLoadProgress` fields inside `packetStore` so operators can monitor the fill. ### Drive-by fixes - E2E: added `gotoPackets` navigation helper used across packet-related tests - E2E: rewrote stripe assertion to check per-row stripe parity rather than a fragile computed-style comparison - E2E: theme test updated to use `#/home` as the initial route (was `#/`) - `db.go`: removed the RFC3339→unix-timestamp subquery path in `buildTransmissionWhere`; `t.first_seen` is now always compared directly as a string for both RFC3339 and non-RFC3339 inputs ## Configuration ```json "packetStore": { "retentionHours": 168, "hotStartupHours": 24 } ``` `hotStartupHours: 0` (default) preserves existing behavior exactly. Recommended for large DBs to reduce startup time; set to 0 to disable (loads full retentionHours at startup, legacy behavior). ## Test plan - [x] `TestHotStartupConfig_Clamp` — clamping when `hotStartupHours > retentionHours` - [x] `TestHotStartupConfig_ZeroIsDisabled` — zero leaves feature disabled - [x] `TestHotStartup_LoadsOnlyHotWindow` — only hot-window packets in memory after `Load()` - [x] `TestHotStartup_DisabledWhenZero` — all retention packets loaded when disabled - [x] `TestHotStartup_loadChunk_AddsOlderData` — chunk merges correctly, ASC order maintained - [x] `TestHotStartup_BackgroundFillsToRetention` — background goroutine fills to `retentionHours` - [x] `TestHotStartup_ChunkErrorRecovery` — chunk SQL failure logged and skipped, loop terminates - [x] `TestHotStartup_SQLFallback_TriggeredForOldDate` — query before `oldestLoaded` routes to SQL - [x] `TestHotStartup_SQLFallback_NotTriggeredForRecentDate` — recent query stays in-memory - [x] `TestHotStartup_PerfStats` — new fields present in `GetPerfStoreStats()` (backs the perf endpoint) - [x] `TestHotStartup_PerfStoreHTTP` — HTTP-level: GET /api/perf returns `hotStartupHours`, `backgroundLoadComplete`, `backgroundLoadProgress` in `packetStore` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: CoreScope Bot <bot@corescope.local> |
||
|
|
85e97d2f37 |
fix(#1211): bounds-check path length to prevent slice [218:15] panic in MQTT decode (#1214)
**RED commit:** `65d9f57b` (CI run will appear at https://github.com/Kpa-clawbot/CoreScope/actions after PR opens) Fixes #1211 ## Root cause `decodePath()` returns `bytesConsumed = hash_size * hash_count` where both come straight from the wire-supplied `pathByte` (upper 2 bits → `hash_size`, lower 6 bits → `hash_count`). Max claimable: 4 × 63 = 252 bytes. A malformed packet on the wire claimed `pathByte=0xF6` (hash_size=4, hash_count=54 → 216 path bytes) inside a 15-byte buffer. The inner hop-extraction loop in `decodePath` did break early on overflow — but `bytesConsumed` was still returned at face value (216). `DecodePacket` then did `offset += 216` (offset=218) and `payloadBuf := buf[offset:]` panicked with the prod-observed signature: ``` runtime error: slice bounds out of range [218:15] ``` The handler-level `defer/recover` at `cmd/ingestor/main.go:258-263` caught it, but the message was silently dropped with no usable diagnostic. ## Fix Add a `if offset > len(buf)` guard at BOTH decoder sites (same pattern, same panic potential): - `cmd/ingestor/decoder.go` — DecodePacket after decodePath - `cmd/server/decoder.go` — DecodePacket after decodePath Return a descriptive error citing the claimed length and pathByte hex so operators can reproduce. Also: `cmd/ingestor/main.go` decode-error log now includes `topic`, `observer`, and `rawHexLen` so future malformed packets are reproducible without needing to attach a debugger. ## Tests (TDD red → green) Both packages got two new tests: - **`TestDecodePacketBoundsFromWire_Issue1211`** — feeds the exact wire shape from the prod log (`pathByte=0xF6` inside a 15-byte buf). Asserts `DecodePacket` does NOT panic and returns an error. - **`TestDecodePacketFuzzTruncated_Issue1211`** — sweeps every `(header, pathByte)` combination with tails 0..19 bytes (≈1.3M inputs). Asserts zero panics. ### Red commit proof On commit `65d9f57b` (RED), both tests fail with the panic: ``` === RUN TestDecodePacketBoundsFromWire_Issue1211 decoder_test.go:1996: DecodePacket panicked on malformed input: runtime error: slice bounds out of range [218:15] --- FAIL: TestDecodePacketBoundsFromWire_Issue1211 (0.00s) === RUN TestDecodePacketFuzzTruncated_Issue1211 decoder_test.go:2010: DecodePacket panicked during fuzz: runtime error: slice bounds out of range [3:2] --- FAIL: TestDecodePacketFuzzTruncated_Issue1211 (0.01s) ``` On commit `7a6ae52c` (GREEN), full suites pass: - `cmd/ingestor`: `ok 53.988s` - `cmd/server`: `ok 29.456s` ## Acceptance criteria - [x] Identify the slice op producing `[218:15]` — `payloadBuf := buf[offset:]` in `DecodePacket` (decoder.go), where `offset` had been advanced by an unchecked `bytesConsumed` from `decodePath()`. - [x] Bounds check added at the identified site(s) — both ingestor and server decoders. - [x] Test with crafted payload (length-field > remaining buffer) — `TestDecodePacketBoundsFromWire_Issue1211`. - [x] Log topic, observer ID, payload byte length on drop — updated `MQTT [%s] decode error` log line. - [x] Existing tests stay green — confirmed both packages. ## Out of scope Reconnect-after-disconnect (#1212) — handled by a separate subagent. This PR touches NO reconnect logic. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
dbb013a6bf |
test(#1201): regression coverage for hop disambiguator tier-1 + end-to-end top-hops fixture (#1202)
Mutation test confirmed: reverting cmd/server/store.go:2975 (`setContext(buildHopContextPubkeys(tx, pm))` → `setContext(nil)`) in `buildDistanceIndex` produces failing assertion in `TestTopHopsRespectsContextAcrossAllCallSites`: top-hops ranking flips to `72dddd→8acccc@13.0km` (Berlin↔Berlin misresolution), CA↔CA pair absent. After reverting the mutation, the test passes again. Fixes #1201 ## Summary Pure test addition. No production code changed. Adds regression coverage for the hop disambiguator's tier-1 (neighbor affinity) path and an end-to-end fixture that catches revert-to-nil-context regressions across all 9 call sites of `pm.resolveWithContext`. ## Sub-tasks (all 4 landed) 1. **Tier-1 explicit** — `hop_disambig_tier1_test.go`: - `Tier1_StrongAffinityPicksX` (strong-X edge wins) - `Tier1_StrongAffinityPicksY` (reverse weights — proves score is read) - `Tier1_AmbiguousEdgeSkipsToTier2` (`Ambiguous=true` → skip) 2. **Tier ordering** — `Tier1_BeatsTier2WhenBothSignal` (tier 1 wins when both signal) 3. **Tier-1 fallback** — - `Tier1_EmptyGraphFallsThrough` (graph has no edges for context) - `Tier1_NilGraphFallsThrough` (graph is nil) - `Tier1_ScoresTooCloseFallsThrough` (best < `affinityConfidenceRatio` × runner-up) 4. **End-to-end fixture** — `hop_disambig_e2e_test.go`: - 9 nodes with intentional prefix collisions across SLO/LA/NYC/Berlin (prefix `72`) and SF/CA/Berlin (prefix `8a`); Berlin candidates have `obsCount=200` so they'd win tier-3 absent context. - 50 transmissions path `["72","8a"]`, sender + observer in CA. - Affinity graph seeded with strong `sender↔72aa` and `sender↔8aaa` edges. - Asserts: CA↔CA hop present, no Berlin pubkeys in `distHops`, max distance < 300 km cap. ## TDD exemption Net-new regression-sentinel tests for behavior already correct on master post-#1198. Each test passed on first run (no production bug surfaced). The mutation test on sub-task 4 is the gating proof: forcing `setContext(nil)` at `store.go:2975` makes the test fail with the exact misresolution class the issue describes (Berlin↔Berlin leaks into top-hops). ## Acceptance criteria - [x] Tier-1 affinity test added with 3 cases - [x] Tier-ordering test added - [x] Tier-1 fallback tests added (nil / empty / scores-too-close) - [x] End-to-end fixture added with multi-candidate-prefix nodes - [x] End-to-end fixture fails if any call site reverts to `nil` context (mutation-verified) - [x] Test files live in `cmd/server/` alongside `prefix_map_role_test.go` --------- Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
2beeb2b324 |
fix(#1199): 6 deferred quality items from PR #1198 r2 review (#1200)
Red commit: |
||
|
|
353c5264ad |
fix(#1197): plumb hop-context + observation-count tiebreak to disambiguator (#1198)
Red commit:
|
||
|
|
f4cf2acbc0 |
perf: cancelled writes + ingestor I/O + threshold tests (#1120 follow-up) (#1167)
Red commit:
|
||
|
|
fb744d895f |
fix(#1143): structural pubkey attribution via from_pubkey column (#1152)
Fixes #1143. ## Summary Replaces the structurally unsound `decoded_json LIKE '%pubkey%'` (and `OR LIKE '%name%'`) attribution path with an exact-match lookup on a dedicated, indexed `transmissions.from_pubkey` column. This closes both holes documented in #1143: - **Hole 1** — same-name false positives via `OR LIKE '%name%'` - **Hole 2a** — adversarial spoofing: a malicious node names itself with another node's pubkey and gets attributed to the victim - **Hole 2b** — accidental false positive when any free-text field (path elements, channel names, message bodies) contains a 64-char hex substring matching a real pubkey - **Perf** — query now uses an index instead of a full-table scan against `LIKE '%substring%'` ## TDD Two-commit history shows red-then-green: | Commit | Status | Purpose | |---|---|---| | `7f0f08e` | RED — tests assertion-fail on master behaviour | Adversarial fixtures + spec | | `59327db` | GREEN — schema + ingestor + server + migration | Implementation | The red commit's test schema includes the new column so the file compiles, but the production code still uses LIKE — the assertions fail because the malicious / same-name / free-text rows are returned. The green commit changes the query plus adds the migration/ingest path. ## Changes ### Schema - new column `transmissions.from_pubkey TEXT` - new index `idx_transmissions_from_pubkey` ### Ingestor (`cmd/ingestor/`) - `PacketData.FromPubkey` populated from decoded ADVERT `pubKey` at write time. Cheap — already parsing `decoded_json`. Non-ADVERTs stay NULL. - `stmtInsertTransmission` writes the column. - Migration `from_pubkey_v1` ALTERs legacy DBs to add the column + index. - Bonus: rewrote the recipe in the gated one-shot `advert_count_unique_v1` migration to use `from_pubkey` (already marked done on existing DBs; kept correct for fresh installs). ### Server (`cmd/server/`) - `ensureFromPubkeyColumn` mirrors the ingestor migration so the server can boot against a DB the ingestor has never touched (e2e fixture, fresh installs). - `backfillFromPubkeyAsync` runs **after** HTTP starts. Scans `WHERE from_pubkey IS NULL AND payload_type = 4` in 5000-row chunks with a 100ms yield between chunks. Cannot block boot even on prod-sized DBs (100K+ transmissions). Queries handle NULL gracefully (return empty for that pubkey, same as today's unknown-pubkey path). - All in-scope LIKE call sites switched to exact match: | Site | Before | After | |---|---|---| | `buildPacketWhere` (was db.go:582) | `decoded_json LIKE '%pubkey%'` | `from_pubkey = ?` | | `buildTransmissionWhere` (was db.go:626) | `t.decoded_json LIKE '%pubkey%'` | `t.from_pubkey = ?` | | `GetRecentTransmissionsForNode` (was db.go:910) | `LIKE '%pubkey%' OR LIKE '%name%'` | `t.from_pubkey = ?` | | `QueryMultiNodePackets` (was db.go:1785) | `decoded_json LIKE '%pubkey%' OR ...` | `t.from_pubkey IN (?, ?, ...)` | | `advert_count_unique_v1` (was ingestor/db.go:257) | `decoded_json LIKE '%' \|\| nodes.public_key \|\| '%'` | `t.from_pubkey = nodes.public_key` | `GetRecentTransmissionsForNode` signature simplifies: the `name` parameter is gone (it was only ever used for the legacy `OR LIKE '%name%'` fallback). Sole caller in `routes.go:1243` updated. ### Tests - `cmd/server/from_pubkey_attribution_test.go` — adversarial fixtures + Hole 1/2a/2b/QueryMultiNodePackets exact-match assertions, EXPLAIN QUERY PLAN index check, migration backfill correctness. - `cmd/ingestor/from_pubkey_test.go` — write-time correctness (BuildPacketData populates FromPubkey for ADVERT only; InsertTransmission persists it; non-ADVERTs stay NULL). - Existing test schemas (server v2, server v3, coverage) get the new column **plus a SQLite trigger** that auto-populates `from_pubkey` from `decoded_json` on ADVERT inserts. This means existing fixtures (which only seed `decoded_json`) keep attributing correctly without per-test edits. - `seedTestData`'s ADVERTs explicitly set `from_pubkey`. ## Performance — index is used ``` $ EXPLAIN QUERY PLAN SELECT id FROM transmissions WHERE from_pubkey = ? SEARCH transmissions USING INDEX idx_transmissions_from_pubkey (from_pubkey=?) ``` Asserted in `TestFromPubkeyIndexUsed`. ## Migration approach - **Sync at boot**: `ALTER TABLE transmissions ADD COLUMN from_pubkey TEXT` is a metadata-only operation in SQLite — microseconds regardless of table size. `CREATE INDEX IF NOT EXISTS idx_transmissions_from_pubkey` is **not** metadata-only: it scans the table once. Empirically a few hundred ms on a 100K-row table; expect a few seconds on a 10M-row table (one-time cost, blocking boot during that window). Subsequent boots no-op via `IF NOT EXISTS`. If this boot delay becomes an operational concern at prod scale we can defer the `CREATE INDEX` to a goroutine — for now a few-second one-time delay is acceptable. - **Async**: row-level backfill of legacy NULL ADVERTs (chunked 5000 / 100ms yield). On a 100K-ADVERT prod DB, this completes in seconds in the background; HTTP is fully available throughout. - **Safety**: queries handle NULL gracefully — a node whose ADVERTs haven't backfilled yet returns empty, identical to today's behaviour for unknown pubkeys. No half-state regression. ## Out of scope (intentionally) The free-text `LIKE` paths the issue explicitly leaves alone (e.g. user-typed packet search) are untouched. Only the pubkey-attribution sites get the column treatment. ## Cycle-3 review fixes | Finding | Status | Commit | |---|---|---| | **M1c** — async-contract test was tautological (test's own `go`, not production's) | Fixed | `23ace71` (red) → `a05b50c` (green) | | **m1c** — package-global atomic resets unsafe under `t.Parallel()` | Fixed (`// DO NOT t.Parallel` comment + `Reset()` helper) | rolled into `23ace71` / `241ec69` | | **m2c** — `/api/healthz` read 3 atomics non-atomically (torn snapshot) | Fixed (single RWMutex-guarded snapshot + race test) | `241ec69` | | **n3c.m1** — vestigial OR-scaffolding in `QueryMultiNodePackets` | Fixed (cleanup) | `5a53ceb` | | **n3c.m2** — verify PR body language about `ALTER` vs `CREATE INDEX` | Verified accurate (already corrected in cycle 2) | (no change) | | **n3c.m3** — `json.Unmarshal` per row in backfill → could use SQL `json_extract` | **Deferred as known followup** — pure perf optimization (current per-row Unmarshal is correct, just slower); SQL rewrite would unwind the chunked-yield architecture and is non-trivial. Acceptable for one-time backfill at boot on legacy DBs. | ### M1c implementation detail `startFromPubkeyBackfill(dbPath, chunkSize, yieldDuration)` is now the single production entry point used by `main.go`. It internally does `go backfillFromPubkeyAsync(...)`. The test calls `startFromPubkeyBackfill` (no `go` prefix) and asserts the dispatch returns within 50ms — so if anyone removes the `go` keyword inside the wrapper, the test fails. **Manually verified**: removing the `go` keyword causes `TestBackfillFromPubkey_DoesNotBlockBoot` to fail with "backfill dispatch took ~1s (>50ms): not async — would block boot." ### m2c implementation detail `fromPubkeyBackfillTotal/Processed/Done` are now plain `int64`/`bool` package globals guarded by a single `sync.RWMutex`. `fromPubkeyBackfillSnapshot()` returns all three under one RLock. `TestHealthzFromPubkeyBackfillConsistentSnapshot` races a writer (lock-step total/processed updates with periodic done flips) against 8 readers hammering `/api/healthz`, asserting `processed<=total` and `(done => processed==total)` on every response. Verified the test catches torn reads (manually injected a 3-RLock implementation; test failed within milliseconds with "processed>total" and "done=true but processed!=total" errors). --------- Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: openclaw-bot <bot@openclaw.dev> |
||
|
|
74dffa2fb7 |
feat(perf): per-component disk I/O + write source metrics on Perf page (#1120) (#1123)
## Summary Implements per-component disk I/O + write source metrics on the Perf page so operators can self-diagnose write-volume anomalies (cf. the BackfillPathJSON loop debugged in #1119) without SSHing in to run iotop/fatrace. Partial fix for #1120 ## What's done (4/6 ACs) - ✅ `/api/perf/io` — server-process `/proc/self/io` delta rates (read/write bytes per sec, syscalls) - ✅ `/api/perf/sqlite` — WAL size, page count, page size, cache hit rate - ✅ `/api/perf/write-sources` — per-component counters from ingestor (tx/obs/upserts/backfill_*) - ✅ Frontend Perf page — three new sections with anomaly thresholds + per-second rate columns ## What's NOT done (deferred to follow-up) - ❌ `cancelledWriteBytesPerSec` field — issue #1120 lists this under server-process I/O ("writes the kernel discarded — interesting signal"); not exposed in this PR - ❌ Ingestor `/proc/<pid>/io` — issue #1120 says "Both ingestor and server"; only server-process I/O lands here. Adding ingestor I/O requires either a unix socket back to the server, or surfacing the ingestor pid through the stats file. Doable without changing the existing API shape. - ❌ Adaptive baselining — anomaly thresholds remain static (10×, 100 MB, 90%); steady-state baselining can come once we have enough deployed Perf-page telemetry Per AGENTS.md rule 34, this PR uses "Partial fix for #1120" rather than "Fixes #1120" so the issue stays open until the remaining ACs land. ## Backend **Server (`cmd/server/perf_io.go`)** - `GET /api/perf/io` — reads `/proc/self/io` and returns delta-rate `{readBytesPerSec, writeBytesPerSec, syscallsRead, syscallsWrite}` since last call (in-memory tracker, no allocation per sample). - `GET /api/perf/sqlite` — returns `{walSize, walSizeMB, pageCount, pageSize, cacheSize, cacheHitRate}`. `cacheHitRate` is proxied from the in-process row cache (closest available signal under the modernc sqlite driver). - `GET /api/perf/write-sources` — reads the ingestor's stats JSON file and returns a flat `{sources: {...}, sampleAt}` payload. **Ingestor (`cmd/ingestor/`)** - `DBStats` gains `WALCommits atomic.Int64` (incremented on every successful `tx.Commit()` and on every auto-commit `InsertTransmission` write) and `BackfillUpdates sync.Map` keyed by backfill name with `IncBackfill(name)` / `SnapshotBackfills()` helpers. - `BackfillPathJSONAsync` now increments `BackfillUpdates["path_json"]` per row write — the BackfillPathJSON-style infinite loop becomes immediately visible at `backfill_path_json` in the Write Sources table. - New `StartStatsFileWriter` publishes a JSON snapshot to `/tmp/corescope-ingestor-stats.json` (override via `CORESCOPE_INGESTOR_STATS`) every second using atomic tmp+rename. The tmp file is opened with `O_CREATE|O_WRONLY|O_TRUNC|O_NOFOLLOW` mode `0o600` so a pre-planted symlink in a world-writable `/tmp` cannot redirect the write to an arbitrary file. ## Frontend (`public/perf.js`) Three new sections on the Perf page, all auto-refreshed via the existing 5s interval: - **Disk I/O (server process)** — read/write rates (formatted B/KB/MB-per-sec) + syscall counts. Write rate >10 MB/s flags ⚠️. - **Write Sources** — sorted table of per-component counters with a per-second rate column derived from snapshot deltas. Backfill rows show ⚠️ only when `tx_inserted >= 100` (meaningful baseline) AND the backfill's per-second rate exceeds 10× the live tx rate. Avoids the startup-spurious-alarm where cumulative-vs-cumulative was a tautology. - **SQLite (WAL + Cache Hit)** — WAL size (⚠️ when >100 MB), page count, page size, cache hit rate (⚠️ when <90%). ## Tests - **Backend** (`cmd/server/perf_io_test.go`) — `TestPerfIOEndpoint_ReturnsValidJSON`, `TestPerfSqliteEndpoint_ReturnsValidJSON`, `TestPerfWriteSourcesEndpoint_ReturnsSources` exercise the three new endpoints. Skips the `/proc/self/io` non-zero-rate assertion when `/proc` is unavailable. - **Frontend** (`test-perf-disk-io-1120.js`) — vm-sandbox runs `perf.js` with stubbed `fetch`, asserts the three new sections render with their headings + values. E2E assertion added: test-perf-disk-io-1120.js:91 ## TDD 1. Red commit (`21abd22`) — added the three handlers as no-op stubs returning empty values; tests fail on assertion mismatches (non-zero rate, `pageSize > 0`, headings present). 2. Green commit (`d8da54c`) — fills in the real `/proc/self/io` parser, PRAGMA queries, ingestor stats writer, and Perf page rendering. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com> |
||
|
|
5fa3b56ccb |
fix(#662): GetRepeaterRelayInfo also looks up byPathHop by 1-byte prefix (#1086)
## Summary Partial fix for #662. `GetRepeaterRelayInfo` was reporting "never observed as relay hop" / `RelayCount24h=0` for nodes that clearly DO have packets passing through them — visible on the same node detail page in the "Paths seen through node" view. ## Root cause The `byPathHop` index is keyed by **both**: - full resolved pubkey (populated when neighbor-affinity resolution succeeds), and - raw 1-byte hop prefix from the wire (e.g. `"a3"`) `GetRepeaterRelayInfo` only looked up the full-pubkey key. Many ingested non-advert packets only carry the raw 1-byte hop — so any repeater whose path appearances are all raw-hop entries returned 0, even though the path-listing endpoint (which prefix-matches) renders them. Example node: an `a3…` repeater on staging has ~dozens of paths through it in the UI but the relay-info function returns 0. ## Fix Look up under both keys (full pubkey + 1-byte prefix) and de-dup by tx ID before counting. ## Trade-off The 1-byte prefix CAN over-count when multiple nodes share a first byte. This trades a possible over-count for clearly false zeros. The richer disambiguation done by the path-listing endpoint (resolved-path SQL post-filter via `confirmResolvedPathContains`) is out of scope for this partial fix — adding it here would mean disk I/O inside what is currently a pure in-memory lookup. Worth a follow-up if over-counting shows up in practice. ## TDD - Red commit (`test: failing test for relay-info prefix-hop mismatch`): adds `TestRepeaterRelayActivity_PrefixHop` that builds a non-advert packet with `PathJSON: ["a3"]`, indexes it via `addTxToPathHopIndex`, then asserts `RelayCount24h>=1` for the full pubkey starting with `a3…`. Fails on the assertion (got 0), not a build error. - Green commit (`fix: GetRepeaterRelayInfo also looks up byPathHop by 1-byte prefix`): the lookup change. All five `TestRepeaterRelayActivity_*` tests pass. ## Scope This is a **partial** fix — addresses the read-side prefix mismatch only. Issue #662 is a 4-axis epic (also covers ingest indexing consistency, UI surfacing, and schema). Leaving #662 open. --------- Co-authored-by: corescope-bot <bot@corescope> Co-authored-by: clawbot <clawbot@users.noreply.github.com> |
||
|
|
136e1d23c8 |
feat(#730): foreign-advert detection — flag instead of silent drop (#1084)
## Summary **Partial fix for #730 (M1 only — M2 frontend and M3 alerting deferred).** Today the ingestor **silently drops** ADVERTs whose GPS lies outside the configured `geo_filter` polygon. That's the wrong default for an analytics tool — operators get zero visibility into bridged or leaked meshes. This PR makes the new default **flag, don't drop**: foreign adverts are stored, the node row is tagged `foreign_advert=1`, and the API surfaces `"foreign": true` so dashboards / map overlays can be built on top. ## Behavior | Mode | What happens to an ADVERT outside `geo_filter` | |---|---| | (default) flag | Stored, marked `foreign_advert=1`, exposed via API | | drop (legacy) | Silently dropped (preserves old behavior for ops who want it) | ## What's done (M1 — Backend) - ingestor stores foreign adverts instead of dropping - `nodes.foreign_advert` column added (migration) - `/api/nodes` and `/api/nodes/{pk}` expose `foreign: true` field - Config: `geofilter.action: "flag"|"drop"` (default `flag`) - Tests + config docs ## What's NOT done (deferred to M2 + M3) - **M2 — Frontend:** Map overlay showing foreign adverts as distinct markers, foreign-advert filter on packets/nodes pages, dedicated foreign-advert dashboard - **M3 — Alerting:** Time-series detection of bridging events, alert when foreign advert rate spikes, identify bridge entry-point nodes Issue #730 remains open for M2 and M3. --------- Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
3ab404b545 |
feat(node-battery): voltage trend chart + /api/nodes/{pubkey}/battery (#663) (#1082)
## Summary Closes #663 (Phase 2 + 3 partial — time-series tracking + thresholds for nodes that are also observers). Adds a per-node battery voltage trend chart and `/api/nodes/{pubkey}/battery` endpoint, sourced from the existing `observer_metrics.battery_mv` samples populated by observer status messages. No new ingest or schema changes — purely surfaces data we were already collecting. ## Scope (TDD red→green) **RED commit:** test(node-battery) — DB query, endpoint shape (200/404/no-data), and config getters all asserted. **GREEN commit:** feat(node-battery) — implementation only. ## Changes ### Backend - `cmd/server/node_battery.go` (new): - `DB.GetNodeBatteryHistory(pubkey, since)` — pulls `(timestamp, battery_mv)` rows from `observer_metrics WHERE LOWER(observer_id) = LOWER(public_key) AND battery_mv IS NOT NULL`. Case-insensitive join tolerates historical pubkey casing variation (observers persist uppercase, nodes lowercase in this DB). - `Server.handleNodeBattery` — `GET /api/nodes/{pubkey}/battery?days=N` (default 7, max 365). Returns `{public_key, days, samples[], latest_mv, latest_ts, status, thresholds}`. - `Config.LowBatteryMv()` / `CriticalBatteryMv()` — defaults 3300 / 3000 mV. - `cmd/server/config.go` — `BatteryThresholds *BatteryThresholdsConfig` field. - `cmd/server/routes.go` — route registration alongside existing `/health`, `/analytics`. ### Frontend - `public/node-analytics.js` — new "Battery Voltage" chart card with status badge (🔋 OK / ⚠️ Low / 🪫 Critical / No data). Renders dashed threshold lines at `lowMv` and `criticalMv`. Empty-state message when no samples in window. ### Config - `config.example.json` — `batteryThresholds: { lowMv: 3300, criticalMv: 3000 }` with `_comment` per Config Documentation Rule. ## Status semantics | latest_mv | status | |-----------------------|------------| | no samples in window | `unknown` | | `>= lowMv` | `ok` | | `< lowMv`, `>= critMv`| `low` | | `< criticalMv` | `critical` | ## What this PR does NOT do (deferred) The issue's full Phase 1 (writing decoded sensor advert telemetry into `nodes.battery_mv` / `temperature_c` from server-side decoder) and Phase 4 (firmware/active polling for repeaters without observers) are out of scope here. This PR delivers the requested Phase 2/3 surfacing for the data path that already lands rows: `observer_metrics`. Repeaters that are also observers (i.e. publish status to MQTT) will get a voltage trend immediately; pure passive nodes won't until Phase 1 lands. ## Tests - `TestGetNodeBatteryHistory_FromObserverMetrics` — case-insensitive join, NULL skipping, ordering. - `TestNodeBatteryEndpoint` — full happy path with thresholds + status. - `TestNodeBatteryEndpoint_NoData` — 200 + status=unknown. - `TestNodeBatteryEndpoint_404` — unknown node. - `TestBatteryThresholds_ConfigOverride` — config getters + defaults. `cd cmd/server && go test ./...` — green. ## Performance Endpoint is per-pubkey (called once on analytics page open), indexed by `(observer_id, timestamp)` PK on `observer_metrics`. No hot-path impact. --------- Co-authored-by: bot <bot@corescope> |
||
|
|
f33801ecb4 |
feat(repeater): usefulness score — traffic axis (#672) (#1079)
## Summary Implements the **Traffic axis** of the repeater usefulness score (#672). Does NOT close #672 — Bridge, Coverage, and Redundancy axes are deferred to follow-up PRs. Adds `usefulness_score` (0..1) to repeater/room node API responses representing what fraction of non-advert traffic passes through this repeater as a relay hop. ## Why traffic-axis-first The issue proposes a 4-axis composite (Bridge, Coverage, Traffic, Redundancy). Bridge/Coverage/Redundancy require betweenness centrality and neighbor graph infrastructure (#773 Neighbor Graph V2). Traffic axis can ship independently using existing path-hop data. ## Remaining work for #672 - Bridge axis (betweenness centrality — depends on #773) - Coverage axis (observer reach comparison) - Redundancy axis (node-removal simulation — depends on #687) - Composite score combining all 4 axes Partial fix for #672. --------- Co-authored-by: meshcore-bot <bot@meshcore.local> |
||
|
|
d05e468598 |
feat(memlimit): GOMEMLIMIT support, derive from packetStore.maxMemoryMB (#836) (#1077)
## Summary Implements **part 1** of #836 — `GOMEMLIMIT` support so the Go runtime self-throttles GC under cgroup memory pressure instead of getting SIGKILLed. (Parts 2 & 3 — bounded cold-load batching + README ops docs — land in follow-up PRs.) ## Behavior On startup `cmd/server/main.go` now calls `applyMemoryLimit(maxMemoryMB, envSet)`: | Condition | Action | Log | |---|---|---| | `GOMEMLIMIT` env set | Honor the runtime's parse, do nothing | `[memlimit] using GOMEMLIMIT from environment (...)` | | env unset, `packetStore.maxMemoryMB > 0` | `debug.SetMemoryLimit(maxMB * 1.5 MiB)` | `[memlimit] derived from packetStore.maxMemoryMB=512 → 768 MiB (1.5x headroom)` | | env unset, `maxMemoryMB == 0` | No-op | `[memlimit] no soft memory limit set ... recommend setting one to avoid container OOM-kill` | The 1.5x headroom covers Go's NextGC trigger at ~2× live heap (per #836 heap profile: 680 MB live → 1.38 GB NextGC). ## Tests (TDD red→green visible in commit history) - `TestApplyMemoryLimit_FromEnv` — env wins, function does not override - `TestApplyMemoryLimit_DerivedFromMaxMemoryMB` — verifies bytes computation + `debug.SetMemoryLimit` actually applied at runtime - `TestApplyMemoryLimit_None` — no env, no config → reports `"none"`, no side effect Red commit: `7de3c62` (assertion failures, builds clean) Green commit: `454516d` ## Config docs `config.example.json` `packetStore._comment_gomemlimit` documents env/derived/override behavior. ## Out of scope - Cold-load transient bounding (item 2 in #836) - README container-size table (item 3) - QA §1.1 rewrite Closes part 1 of #836. --------- Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
45f30fcadc |
feat(repeater): liveness detection — distinguish actively relaying from advert-only (#662) (#1073)
## Summary Implements repeater liveness detection per #662 — distinguishes a repeater that is **actively relaying traffic** from one that is **alive but idle** (only sending its own adverts). ## Approach The backend already maintains a `byPathHop` index keyed by lowercase hop/pubkey for every transmission. Decode-window writes also key it by **resolved pubkey** for relay hops. We just weren't surfacing it. `GetRepeaterRelayInfo(pubkey, windowHours)`: - Reads `byPathHop[pubkey]`. - Skips packets whose `payload_type == 4` (advert) — a self-advert proves liveness, not relaying. - Returns the most recent `FirstSeen` as `lastRelayed`, plus `relayActive` (within window) and the `windowHours` actually used. ## Three states (per issue) | State | Indicator | Condition | |---|---|---| | 🟢 Relaying | green | `last_relayed` within `relayActiveHours` | | 🟡 Alive (idle) | yellow | repeater is in the DB but `relay_active=false` (no recent path-hop appearance, or none ever) | | ⚪ Stale | existing | falls out of the existing `getNodeStatus` logic | ## API - `GET /api/nodes` — repeater/room rows now include `last_relayed` (omitted if never observed) and `relay_active`. - `GET /api/nodes/{pubkey}` — same fields plus `relay_window_hours`. ## Config New optional field under `healthThresholds`: ```json "healthThresholds": { ..., "relayActiveHours": 24 } ``` Default 24h. Documented in `config.example.json`. ## Frontend Node detail page gains a **Last Relayed** row for repeaters/rooms with the 🟢/🟡 state badge. Tooltip explains the distinction from "Last Heard". ## TDD - **Red commit** `4445f91`: `repeater_liveness_test.go` + stub `GetRepeaterRelayInfo` returning zero. Active and Stale tests fail on assertion (LastRelayed empty / mismatched). Idle and IgnoresAdverts already match the desired behavior under the stub. Compiles, runs, fails on assertions — not on imports. - **Green commit** `5fcfb57`: Implementation. All four tests pass. Full `cmd/server` suite green (~22s). ## Performance `O(N)` over `byPathHop[pubkey]` per call. The index is bounded by store eviction; a single repeater has at most a few hundred entries on real data. The `/api/nodes` loop adds one map read + scan per repeater row — negligible against the existing enrichment work. ## Limitations (per issue body) 1. Observer coverage gaps — if no observer hears a repeater's relay, it'll show as idle even when actively relaying. This is inherent to passive observation. 2. Low-traffic networks — a repeater in a quiet area legitimately shows idle. The 🟡 indicator copy makes that explicit ("alive (idle)"). 3. Hash collisions are mitigated by the existing `resolveWithContext` path before pubkeys land in `byPathHop`. Fixes #662 --------- Co-authored-by: clawbot <bot@corescope.local> |
||
|
|
83881e6b71 |
fix(#688): auto-discover hashtag channels from message text (#1071)
## Summary Auto-discovers previously-unknown hashtag channels by scanning decoded channel message text for `#name` mentions and surfacing them via `GetChannels`. Workflow (per the issue): 1. New channel message arrives on a known channel 2. Decoded text is scanned for `#hashtag` mentions 3. Any mention that doesn't match an existing channel is surfaced as a discovered channel (`discovered: true`, `messageCount: 0`) 4. Future traffic on that channel will populate the entry once it has its own packets ## Changes - `cmd/server/discovered_channels.go` — new file. `extractHashtagsFromText` parses `#name` mentions from free text, deduped, order-preserving. Trailing punctuation is excluded by the character class. - `cmd/server/store.go` — `GetChannels` now scans CHAN packet text for hashtags after building the primary channel map, and appends any unseen hashtag mentions as discovered entries. - `cmd/server/discovered_channels_test.go` — new tests covering parser edge cases (single, multi, dedup, punctuation, none, bare `#`) and end-to-end discovery via `GetChannels`. ## TDD - Red: `34f1817` — stub returns `nil`, both new tests fail on assertion (verified). - Green: `d27b3ed` — real implementation, full `cmd/server` test suite passes (21.7s). ## Notes - Discovered channels carry `messageCount: 0` and `lastActivity` set to the most recent mention's `firstSeen`, so they sort naturally alongside real channels. - Names are matched against existing entries by both `#name` and bare `name` so a channel that already has decoded traffic isn't double-listed. - The existing `channelsCache` (15s) covers the new code path; no separate invalidation needed since the source data (`byPayloadType[5]`) drives both maps. Fixes #688 --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
d144764d38 |
fix(analytics): multiByteCapability missing under region filter → all rows 'unknown' (#1049)
## Bug `https://meshcore.meshat.se/#/analytics`: - Unfiltered → 0 adopter rows show "unknown" (correct). - Region filter `JKG` → 14 rows show "unknown" (wrong — same nodes, all confirmed when unfiltered). Multi-byte capability is a property of the NODE, derived from its own adverts (the full pubkey is in the advert payload, no prefix collision risk). The observing region should only control which nodes appear in the analytics list — it must not change a node's cap evidence. ## Root cause `PacketStore.GetAnalyticsHashSizes(region)` only attached `result["multiByteCapability"]` when `region == ""`. Under any region filter the field was absent. The frontend (`public/analytics.js:1011`) does `data.multiByteCapability || []`, so every adopter row falls through the merge with no cap status and renders as "unknown". ## Fix Always populate `multiByteCapability`. When a region filter is active, source the global adopter hash-size set from a no-region compute pass so out-of-region observers' adverts still count as evidence. ## TDD Red commit (`0968137`): adds `cmd/server/multibyte_region_filter_test.go`, asserts that `GetAnalyticsHashSizes("JKG")` returns a populated `multiByteCapability` with Node A as `confirmed`. Fails on the assertion (field missing) before the fix. Green commit (`6616730`): always compute capability against the global advert dataset. ## Files changed - `cmd/server/store.go` — `GetAnalyticsHashSizes`: drop the `region == ""` gate, always populate `multiByteCapability`. - `cmd/server/multibyte_region_filter_test.go` — new red→green test. ## Verification ``` go test ./... -count=1 # all server tests pass (21s) ``` --------- Co-authored-by: clawbot <bot@corescope.local> |
||
|
|
9f55ef802b |
fix(#804): attribute analytics by repeater home region, not observer (#1025)
Fixes #804. ## Problem Analytics filtered region purely by **observer** region: a multi-byte repeater whose home is PDX would leak into SJC results whenever its flood adverts were relayed past an SJC observer. Per-node groupings (`multiByteNodes`, `distributionByRepeaters`) inherited the same bug. ## Fix Two new helpers in `cmd/server/store.go`: - `iataMatchesRegion(iata, regionParam)` — case-insensitive IATA→region match using the existing `normalizeRegionCodes` parser. - `computeNodeHomeRegions()` — derives each node's HOME IATA from its zero-hop DIRECT adverts. Path byte for those packets is set locally on the originating radio and the packet has not been relayed, so the observer that hears it must be in direct RF range. Plurality vote when zero-hop adverts span multiple regions. `computeAnalyticsHashSizes` now applies these in two ways: 1. **Observer-region filter is relaxed for ADVERT packets** when the originator's home region matches the requested region. A flood advert from a PDX repeater that's only heard by an SJC observer still attributes to PDX. 2. **Per-node grouping** (`multiByteNodes`, `distributionByRepeaters`) excludes nodes whose HOME region disagrees with the requested region. Falls back to the observer-region filter when home is unknown. Adds `attributionMethod` to the response (`"observer"` or `"repeater"`) so operators can tell which method was applied. ## Backwards compatibility - No region filter requested → behavior unchanged (`attributionMethod` is `"observer"`). - Region filter requested but no zero-hop direct adverts seen for a node → falls back to the prior observer-region check for that node. - Operators without IATA-tagged observers see no change. ## TDD - **Red commit** (`c35d349`): adds `TestIssue804_AnalyticsAttributesByRepeaterRegion` with three subtests (PDX leak into SJC, attributionMethod field present, SJC leak into PDX). Compiles, runs, fails on assertions. - **Green commit** (`11b157f`): the implementation. All subtests pass, full `cmd/server` package green. ## Files changed - `cmd/server/store.go` — helpers + analytics filter logic (+236/-51) - `cmd/server/issue804_repeater_region_test.go` — new test (+147) --------- Co-authored-by: CoreScope Bot <bot@corescope.local> Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
1f4969c1a6 |
fix(#770): treat region 'All' as no-filter + document region behavior (#1026)
## Summary Fixes #770 — selecting "All" in the region filter dropdown produced an empty channel list. ## Root cause `normalizeRegionCodes` (cmd/server/db.go) treated any non-empty input as a literal IATA code. The frontend region filter labels its catch-all option **"All"**; while `region-filter.js` normally sends an empty string when "All" is selected, any code path that ends up sending `?region=All` (deep-link URLs, manual queries, future callers) caused the function to return `["ALL"]`. Downstream queries then filtered observers for `iata = 'ALL'`, which never matches anything → empty response. ## Fix `normalizeRegionCodes` now treats `All` / `ALL` / `all` (case-insensitive, with optional whitespace, mixed in CSV) as equivalent to an empty value, returning `nil` to signal "no filter". Real IATA codes (`SJC`, `PDX`, `sjc,PDX` → `[SJC PDX]`) still pass through unchanged. This is a defensive server-side fix: a single chokepoint that all region-aware endpoints already flow through (channels, packets, analytics, encrypted channels, observer ID resolution). ## Documentation Expanded `_comment_regions` in `config.example.json` to explain: - How IATA codes are resolved (payload > topic > source config — set in #1012) - What the `regions` map controls (display labels) vs runtime-discovered codes - That observers without an IATA tag only appear under "All Regions" - That the `All` sentinel is server-side safe ## TDD - **Red commit** (`4f65bf4`): `cmd/server/region_filter_test.go` — `TestNormalizeRegionCodes_AllIsNoFilter` asserts `All` / `ALL` / `all` / `""` / `"All,"` all collapse to `nil`. Compiles, runs, fails on assertion (`got [ALL], want nil`). Companion test `TestNormalizeRegionCodes_RealCodesPreserved` locks in that `sjc,PDX` still returns `[SJC PDX]`. - **Green commit** (`c9fb965`): two-line change in `normalizeRegionCodes` + docs update. ## Verification ``` $ go test -run TestNormalizeRegionCodes -count=1 ./cmd/server ok github.com/corescope/server 0.023s $ go test -count=1 ./cmd/server ok github.com/corescope/server 21.454s ``` Full suite green; no existing region tests regressed. Fixes #770 --------- Co-authored-by: Kpa-clawbot <bot@corescope> |
||
|
|
b06adf9f2a |
feat: /api/backup — one-click SQLite database export (#474) (#1022)
## Summary Implements `GET /api/backup` — one-click SQLite database export per #474. Operators can now grab a complete, consistent snapshot of the analyzer DB with a single authenticated request — no SSH, no scripts, no DB tooling. ## Endpoint ``` GET /api/backup X-API-Key: <key> # required → 200 OK Content-Type: application/octet-stream Content-Disposition: attachment; filename="corescope-backup-<unix>.db" <body: complete SQLite database file> ``` ## Approach Uses SQLite's `VACUUM INTO 'path'` to produce an atomic, defragmented copy of the database into a fresh file: - **Consistent**: VACUUM INTO runs at read isolation — the snapshot reflects a single point in time even while the ingestor is writing to the WAL. - **Non-blocking**: writers continue uninterrupted; we never hold a write lock. - **Works on read-only connections**: verified manually against a WAL-mode source DB (`mode=ro` connection successfully produces a snapshot). - **No corruption risk**: even if the live on-disk DB has issues, VACUUM INTO surfaces what the server can read rather than copying broken pages byte-for-byte. The snapshot is staged in `os.MkdirTemp(...)` and removed after the response body is fully streamed (deferred cleanup). Requesting client IP is logged for audit. The issue suggested an alternative in-memory rebuild path; `VACUUM INTO` is simpler, faster, and produces a strictly more accurate copy of what the server actually sees, so going with it. ## Security - Mounted under `requireAPIKey` middleware — same gate as other admin endpoints (`/api/admin/prune`, `/api/perf/reset`). - Returns 401 without a valid `X-API-Key` header. - Returns 403 if no API key is configured server-side. - `X-Content-Type-Options: nosniff` set on the response. ## TDD - **Red** (`99548f2`): `cmd/server/backup_test.go` adds `TestBackupRequiresAPIKey` + `TestBackupReturnsValidSQLiteSnapshot`. Stub handler returns 200 with no body so the tests fail on assertions (Content-Type / Content-Disposition / SQLite magic header), not on import or build errors. - **Green** (`837b2fe`): real implementation lands; both tests pass; full `go test ./...` suite stays green. ## Files - `cmd/server/backup.go` — handler implementation - `cmd/server/backup_test.go` — red-then-green tests - `cmd/server/routes.go` — route registration under `requireAPIKey` - `cmd/server/openapi.go` — OpenAPI metadata so `/api/openapi` advertises the endpoint ## Out of scope (follow-ups) - Rate limiting (issue suggested 1 req/min). Not added here — admin-key-gated endpoint with a fast snapshot path is acceptable for v1; happy to add a token-bucket limiter in a follow-up if operators report hammering. - UI button to trigger the download (frontend work — separate PR). Fixes #474 --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
51b9fed15e |
feat(roles): /#/roles page + /api/analytics/roles endpoint (Fixes #818) (#1023)
## Summary Implements `/#/roles` per QA #809 §5.4 / issue #818. The page previously showed "Page not yet implemented." ### Backend - New `GET /api/analytics/roles` returns `{ totalNodes, roles: [{ role, nodeCount, withSkew, meanAbsSkewSec, medianAbsSkewSec, okCount, warningCount, criticalCount, absurdCount, noClockCount }] }`. - Pure `computeRoleAnalytics(nodesByPubkey, skewByPubkey)` does the bucketing/aggregation — no store/lock dependency, fully unit-testable. - Roles are normalised (lowercased + trimmed; empty bucketed as `unknown`). ### Frontend - New `public/roles-page.js` renders a distribution table: count, share, distribution bar, w/ skew, median |skew|, mean |skew|, severity breakdown (OK / Warning / Critical / Absurd / No-clock). - Registered as the `roles` page in the SPA router and linked from the main nav. - Auto-refreshes every 60 s, with a manual refresh button. ### Tests (TDD) - **Red commit** (`9726d5b`): two assertion-failing tests against a stub `computeRoleAnalytics` that returns an empty result. Compiles, runs, fails on `TotalNodes = 0, want 5` and `len(Roles) = 0, want 1`. - **Green commit** (`7efb76a`): full implementation, route wiring, frontend page + nav, plus E2E test in `test-e2e-playwright.js` covering both the empty-state contract (no "Page not yet implemented" placeholder) and the populated-table case (header columns, body rows, API response shape). ### Verification - `go test ./cmd/server/...` green. - Local server with the e2e fixture: `GET /api/analytics/roles` returns `{"totalNodes":200,"roles":[{"role":"repeater","nodeCount":168,...},{"role":"room","nodeCount":23,...},{"role":"companion","nodeCount":9,...}]}`. Fixes #818 --------- Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
a56ee5c4fe |
feat(analytics): selectable timeframes via ?window/?from/?to (#842) (#1018)
## Summary Selectable analytics timeframes (#842). Adds backend support for `?window=1h|24h|7d|30d` and `?from=&to=` on the three main analytics endpoints (`/api/analytics/rf`, `/api/analytics/topology`, `/api/analytics/channels`), and a time-window picker in the Analytics page UI that drives them. Default behavior with no query params is unchanged. ## TDD trail - Red: `bbab04d` — adds `TimeWindow` + `ParseTimeWindow` stub and tests; tests fail on assertions because the stub returns the zero window. - Green: `75d27f9` — implements `ParseTimeWindow`, threads `TimeWindow` through `compute*` loops + caches, wires HTTP handlers, adds frontend picker + E2E. ## Backend changes - `cmd/server/time_window.go` — full `ParseTimeWindow` (`?window=` aliases + `?from=/&to=` RFC3339 absolute range; invalid input → zero window for backwards compatibility). - `cmd/server/store.go` — new `GetAnalytics{RF,Topology,Channels}WithWindow` wrappers; `compute*` loops skip transmissions whose `FirstSeen` (or per-obs `Timestamp` for the region+observer slice) falls outside the window. Cache key composes `region|window` so different windows do not poison each other. - `cmd/server/routes.go` — handlers call `ParseTimeWindow(r)` and dispatch to the `*WithWindow` methods. ## Frontend changes - `public/analytics.js` — new `<select id="analyticsTimeWindow">` rendered under the region filter (All / 1h / 24h / 7d / 30d). Selecting an option triggers `loadAnalytics()` which appends `&window=…` to every analytics fetch. ## Tests - `cmd/server/time_window_test.go` — covers all aliases, absolute range, no-params backwards compatibility, `Includes()` bounds, and `CacheKey()` distinctness. - `cmd/server/topology_dedup_test.go`, `cmd/server/channel_analytics_test.go` — updated callers to pass `TimeWindow{}`. ## E2E (rule 18) `test-e2e-playwright.js:592-611` — opens `/#/analytics`, asserts the picker is rendered with a `24h` option, then asserts that selecting `24h` triggers a network request to `/api/analytics/rf?…window=24h`. ## Backwards compatibility No params → zero `TimeWindow` → original code paths (no filter, region-only cache key). Verified by `TestParseTimeWindow_NoParams_BackwardsCompatible` and by the existing analytics tests still passing unchanged on `_wt-fix-842`. Fixes #842 --------- Co-authored-by: you <you@example.com> Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
df69a17718 |
feat(#772): short pubkey-prefix URLs for mesh sharing (#1016)
## Summary Fixes #772 — adds a short-URL form for node detail pages so operators can paste node links into a mesh chat without bringing along a 64-hex-char public key. ## Approach **Pubkey-prefix resolution** (no allocator, no lookup table). - The SPA hash route `#/nodes/<key>` already accepts whatever pubkey-shaped string the user pastes; the front end forwards it to `GET /api/nodes/<key>`. - When that lookup misses **and** the path is 8..63 hex chars, the backend now calls `DB.GetNodeByPrefix` and: - returns the matching node when exactly one node has that prefix, - returns **409 Conflict** when multiple nodes share the prefix (with a "use a longer prefix" hint), - falls through to the existing 404 otherwise. - 8 hex chars = 32 bits of entropy, which is enough for fleets in the low thousands. Operators can extend to 10–12 chars if collisions become common. - The full-screen node detail card gets a new **📡 Copy short URL** button that copies `…/#/nodes/<first 8 hex chars>`. ### Why not an opaque ID table (`/s/<id>`)? Considered and rejected: - Needs persistence + an allocator + cleanup story. - IDs aren't self-describing — operators can't sanity-check them. - IDs don't survive a DB rebuild. - 32 bits of pubkey already buys us collision resistance with zero moving parts. If the directory grows past the point where 8-char prefixes routinely collide, we can extend the minimum length without changing the URL shape. ## Changes - `cmd/server/db.go` — new `GetNodeByPrefix(prefix)` returning `(node, ambiguous, error)`. Validates hex; rejects <8 chars; `LIMIT 2` to detect collisions cheaply. - `cmd/server/routes.go` — `handleNodeDetail` falls back to prefix resolution; canonicalizes pubkey downstream; emits 409 on ambiguity; honors blacklist on the resolved pubkey. - `public/nodes.js` — adds **📡 Copy short URL** button + handler on the full-screen node detail card. - `cmd/server/short_url_test.go` — Go tests (red-then-green). - `test-e2e-playwright.js` — E2E: navigates via prefix-only URL and asserts the new button surfaces. ## TDD evidence - Red commit: `2dea97a` — tests added with a stub `GetNodeByPrefix` returning `(nil, false, nil)`. All four assertions failed (assertion failures, not build errors): expected node got nil; expected ambiguous=true got false; route 404 vs expected 200/409. - Green commit: `9b8f146` — implementation lands; `go test ./...` passes locally in `cmd/server`. ## Compatibility - Existing 64-char pubkey URLs are untouched (exact lookup runs first). - Blacklist is enforced both on the raw input and on the resolved pubkey. - No new config knobs. ## What I did **not** touch - `cmd/server/db_test.go`, other route tests — unchanged. - Packet-detail short URLs (issue scopes nodes; revisit in a follow-up if asked). Fixes #772 --------- Co-authored-by: clawbot <bot@corescope.local> |
||
|
|
c186129d47 |
feat: parse and display per-hop SNR values for TRACE packets (#1007)
## Summary Parse and display per-hop SNR values from TRACE packets in the Packet Byte Breakdown panel. ## Changes ### Backend (`cmd/server/decoder.go`) - Added `SNRValues []float64` field to Payload struct (`json:"snrValues,omitempty"`) - In the TRACE-specific block, extract SNR from header path bytes before they're overwritten with route hops - Each header path byte is `int8(SNR_dB * 4.0)` per firmware — decode by dividing by 4.0 ### Frontend (`public/packets.js`) - Added "SNR Path" section in `buildFieldTable()` showing per-hop SNR values in dB when packet type is TRACE - Added TRACE-specific payload rendering (trace tag, auth code, flags with hash_size, route hops) ## TDD - Red commit: `4dba4e8` — test asserts `Payload.SNRValues` field (compile fails, field doesn't exist) - Green commit: `5a496bd` — implementation passes all tests ## Testing - `go test ./...` passes (all existing + 2 new TRACE SNR tests) - No frontend test changes needed (no existing TRACE UI tests; rendering is additive) Fixes #979 --------- Co-authored-by: you <you@example.com> |
||
|
|
e86b5a3a0c |
feat: show multi-byte hash support indicator on map markers (#1002)
## Summary Show 2-byte hash support indicator on map markers. Fixes #903. ## What changed ### Backend (`cmd/server/store.go`, `cmd/server/routes.go`) - **`EnrichNodeWithMultiByte()`** — new enrichment function that adds `multi_byte_status` (confirmed/suspected/unknown), `multi_byte_evidence` (advert/path), and `multi_byte_max_hash_size` fields to node API responses - **`GetMultiByteCapMap()`** — cached (15s TTL) map of pubkey → `MultiByteCapEntry`, reusing the existing `computeMultiByteCapability()` logic that combines advert-based and path-hop-based evidence - Wired into both `/api/nodes` (list) and `/api/nodes/{pubkey}` (detail) endpoints ### Frontend (`public/map.js`) - Added **"Multi-byte support"** checkbox in the map Display controls section - When toggled on, repeater markers change color: - 🟢 Green (`#27ae60`) — **confirmed** (advertised with hash_size ≥ 2) - 🟡 Yellow (`#f39c12`) — **suspected** (seen as hop in multi-byte path) - 🔴 Red (`#e74c3c`) — **unknown** (no multi-byte evidence) - Popup tooltip shows multi-byte status and evidence for repeaters - State persisted in localStorage (`meshcore-map-multibyte-overlay`) ## TDD - Red commit: `2f49cbc` — failing test for `EnrichNodeWithMultiByte` - Green commit: `4957782` — implementation + passing tests ## Performance - `GetMultiByteCapMap()` uses a 15s TTL cache (same pattern as `GetNodeHashSizeInfo`) - Enrichment is O(n) over nodes, no per-item API calls - Frontend color override is computed inline during existing marker render loop — no additional DOM rebuilds --------- Co-authored-by: you <you@example.com> |
||
|
|
564d93d6aa |
fix: dedup topology analytics by resolved pubkey (#998)
## Fix topology analytics double-counting repeaters/pairs (#909) ### Problem `computeAnalyticsTopology()` aggregates by raw hop hex string. When firmware emits variable-length path hashes (1-3 bytes per hop), the same physical node appears multiple times with different prefix lengths (e.g. `"07"`, `"0735bc"`, `"0735bc6d"` all referring to the same node). This inflates repeater counts and creates duplicate pair entries. ### Solution Added a confidence-gated dedup pass after frequency counting: 1. **For each hop prefix**, check if it resolves unambiguously (exactly 1 candidate in the prefix map) 2. **Unambiguous prefixes** → group by resolved pubkey, sum counts, keep longest prefix as display identifier 3. **Ambiguous prefixes** (multiple candidates for that prefix) → left as separate entries (conservative) 4. **Same treatment for pairs**: canonicalize by sorted pubkey pair ### Addressing @efiten's collision concern At scale (~2000+ repeaters), 1-byte prefixes (256 buckets) WILL collide. This fix explicitly checks the prefix map candidate count. Ambiguous prefixes (where `len(pm.m[hop]) > 1`) are never merged — they remain as separate entries. Only prefixes with a single matching node are eligible for dedup. ### TDD - **Red commit**: `4dbf9c0` — added 3 failing tests - **Green commit**: `d6cae9a` — implemented dedup, all tests pass ### Tests added - `TestTopologyDedup_RepeatersMergeByPubkey` — verifies entries with different prefix lengths for same node merge to single entry with summed count - `TestTopologyDedup_AmbiguousPrefixNotMerged` — verifies colliding short prefix stays separate from unambiguous longer prefix - `TestTopologyDedup_PairsMergeByPubkey` — verifies pair entries merge by resolved pubkey pair Fixes #909 --------- Co-authored-by: you <you@example.com> |
||
|
|
b7c280c20a |
fix: drop/filter packets with null hash or timestamp (closes #871) (#993)
## Summary Closes #871 The `/api/packets` endpoint could return packets with `null` hash or timestamp fields. This was caused by legacy data in SQLite (rows with empty `hash` or `NULL`/empty `first_seen`) predating the ingestor's existing validation guard (`if hash == "" { return false, nil }` at `cmd/ingestor/db.go:610`). ## Root Cause `cmd/server/store.go` `filterPackets()` had no data-integrity guard. Legacy rows with empty `hash` or `first_seen` were loaded into the in-memory store and returned verbatim. The `strOrNil("")` helper then serialized these as JSON `null`. ## Fix Added a data-integrity predicate at the top of `filterPackets`'s scan callback (`cmd/server/store.go:2278`): ```go if tx.Hash == "" || tx.FirstSeen == "" { return false } ``` This filters bad legacy rows at query time. The write path (ingestor) already rejects empty hashes, so no new bad data enters. ## TDD Evidence - **Red commit:** `15774c3` — test `TestIssue871_NoNullHashOrTimestamp` asserts no packet in API response has null/empty hash or timestamp - **Green commit:** `281fd6f` — adds the filter guard, test passes ## Testing - `go test ./...` in `cmd/server` passes (full suite) - Client-side defensive filter from PR #868 remains as defense-in-depth --------- Co-authored-by: you <you@example.com> |
||
|
|
dd2f044f2b |
fix: cache RW SQLite connection + dedup DBConfig (closes #921) (#982)
Closes #921 ## Summary Follow-up to #920 (incremental auto-vacuum). Addresses both items from the adversarial review: ### 1. RW connection caching Previously, every call to `openRW(dbPath)` opened a new SQLite RW connection and closed it after use. This happened in: - `runIncrementalVacuum` (~4x/hour) - `PruneOldPackets`, `PruneOldMetrics`, `RemoveStaleObservers` - `buildAndPersistEdges`, `PruneNeighborEdges` - All neighbor persist operations Now a single `*sql.DB` handle (with `MaxOpenConns(1)`) is cached process-wide via `cachedRW(dbPath)`. The underlying connection pool manages serialization. The original `openRW()` function is retained for one-shot test usage. ### 2. DBConfig dedup `DBConfig` was defined identically in both `cmd/server/config.go` and `cmd/ingestor/config.go`. Extracted to `internal/dbconfig/` as a shared package; both binaries now use a type alias (`type DBConfig = dbconfig.DBConfig`). ## Tests added | Test | File | |------|------| | `TestCachedRW_ReturnsSameHandle` | `cmd/server/rw_cache_test.go` | | `TestCachedRW_100Calls_SingleConnection` | `cmd/server/rw_cache_test.go` | | `TestGetIncrementalVacuumPages_Default` | `internal/dbconfig/dbconfig_test.go` | | `TestGetIncrementalVacuumPages_Configured` | `internal/dbconfig/dbconfig_test.go` | ## Verification ``` ok github.com/corescope/server 20.069s ok github.com/corescope/ingestor 47.117s ok github.com/meshcore-analyzer/dbconfig 0.003s ``` Both binaries build cleanly. 100 sequential `cachedRW()` calls return the same handle with exactly 1 entry in the cache map. --------- Co-authored-by: you <you@example.com> |
||
|
|
fc57433f27 |
fix(analytics): merge channel buckets by hash byte; reject rainbow-table mismatches (closes #978) (#980)
## Summary Closes #978 — analytics channels duplicated by encrypted/decrypted split + rainbow-table collisions. ## Root cause Two distinct bugs in `computeAnalyticsChannels` (`cmd/server/store.go`): 1. **Encrypted/decrypted split**: The grouping key included the decoded channel name (`hash + "_" + channel`), so packets from observers that could decrypt a channel created a separate bucket from packets where decryption failed. Same physical channel, two entries. 2. **Rainbow-table collisions**: Some observers' lookup tables map hash bytes to wrong channel names. E.g., hash `72` incorrectly claimed to be `#wardriving` (real hash is `129`). This created ghost 1-message entries. ## Fix 1. **Always group by hash byte alone** (drop `_channel` suffix from `chKey`). When any packet decrypts successfully, upgrade the bucket's display name from placeholder (`chN`) to the real name (first-decrypter-wins for stability). 2. **Validate channel names** against the firmware hash invariant: `SHA256(SHA256("#name")[:16])[0] == channelHash`. Mismatches are treated as encrypted (placeholder name, no trust in decoded channel). Guard is in the analytics handler (not the ingestor) to avoid breaking other surfaces that use the decoded field for display. ## Verification (e2e-fixture.db) | Metric | BEFORE | AFTER | |--------|--------|-------| | Total channels | 22 | 19 | | Duplicate hash bytes | 3 (hashes 217, 202, 17) | 0 | ## Tests added - `TestComputeAnalyticsChannels_MergesEncryptedAndDecrypted` — same hash, mixed encrypted/decrypted → ONE bucket - `TestComputeAnalyticsChannels_RejectsRainbowTableMismatch` — hash 72 claimed as `#wardriving` (real=129) → rejected, stays `ch72` - `TestChannelNameMatchesHash` — unit test for hash validation helper - `TestIsPlaceholderName` — unit test for placeholder detection Anti-tautology gate: both main tests fail when their respective fix lines are reverted. Co-authored-by: you <you@example.com> |
||
|
|
4b8d8143f4 |
feat(server): explicit CORS policy with configurable origin allowlist (#883) (#971)
## Summary Adds explicit CORS policy support to the CoreScope API server, closing #883. ### Problem The API relied on browser same-origin defaults with no way for operators to configure cross-origin access. Operators running dashboards or third-party frontends on different origins had no supported way to make API calls. ### Solution **New config option:** `corsAllowedOrigins` (string array, default `[]`) **Middleware behavior:** | Config | Behavior | |--------|----------| | `[]` (default) | No `Access-Control-*` headers added — browsers enforce same-origin. **Preserves current behavior.** | | `["https://dashboard.example.com"]` | Echoes matching `Origin`, sets `Allow-Methods`/`Allow-Headers` | | `["*"]` | Sets `Access-Control-Allow-Origin: *` (explicit opt-in only) | **Headers set when origin matches:** - `Access-Control-Allow-Origin: <origin>` (or `*`) - `Access-Control-Allow-Methods: GET, POST, OPTIONS` - `Access-Control-Allow-Headers: Content-Type, X-API-Key` - `Vary: Origin` (non-wildcard only) **Preflight handling:** `OPTIONS` → `204 No Content` with CORS headers (or `403` if origin not in allowlist). ### Config example ```json { "corsAllowedOrigins": ["https://dashboard.example.com", "https://monitor.internal"] } ``` ### Files changed | File | Change | |------|--------| | `cmd/server/cors.go` | New CORS middleware | | `cmd/server/cors_test.go` | 7 unit tests covering all branches | | `cmd/server/config.go` | `CORSAllowedOrigins` field | | `cmd/server/routes.go` | Wire middleware before all routes | ### Testing **Unit tests (7):** - Default config → no CORS headers - Allowlist match → headers present with `Vary: Origin` - Allowlist miss → no CORS headers - Preflight allowed → 204 with headers - Preflight rejected → 403 - Wildcard → `*` without `Vary` - No `Origin` header → pass-through **Live verification (Rule 18):** ``` # Default (empty corsAllowedOrigins): $ curl -I -H "Origin: https://evil.example" localhost:19883/api/health HTTP/1.1 200 OK # No Access-Control-* headers ✓ # With corsAllowedOrigins: ["https://good.example"]: $ curl -I -H "Origin: https://good.example" localhost:19884/api/health Access-Control-Allow-Origin: https://good.example Access-Control-Allow-Methods: GET, POST, OPTIONS Access-Control-Allow-Headers: Content-Type, X-API-Key Vary: Origin ✓ $ curl -I -H "Origin: https://evil.example" localhost:19884/api/health # No Access-Control-* headers ✓ $ curl -I -X OPTIONS -H "Origin: https://good.example" localhost:19884/api/health HTTP/1.1 204 No Content Access-Control-Allow-Origin: https://good.example ✓ ``` Closes #883 Co-authored-by: you <you@example.com> |
||
|
|
3364eed303 |
feat: separate "Last Status Update" from "Last Packet Observation" for observers (v3 rebase) (#969)
Rebased version of #968 (which was itself a rebase of #905) — resolves merge conflict with #906 (clock-skew UI) that landed on master. ## Conflict resolution **`public/observers.js`** — master (#906) added "Clock Offset" column to observer table; #968 split "Last Seen" into "Last Status" + "Last Packet" columns. Combined both: the table now has Status | Name | Region | Last Status | Last Packet | Packets | Packets/Hour | Clock Offset | Uptime. ## What this PR adds (unchanged from #968/#905) - `last_packet_at` column in observers DB table - Separate "Last Status Update" and "Last Packet Observation" display in observers list and detail page - Server-side migration to add the column automatically - Backfill heuristic for existing data - Tests for ingestor and server ## Verification - All Go tests pass (`cmd/server`, `cmd/ingestor`) - Frontend tests pass (`test-packets.js`, `test-hash-color.js`) - Built server, hit `/api/observers` — `last_packet_at` field present in JSON - Observer table header has all 9 columns including both Last Packet and Clock Offset ## Prior PRs - #905 — original (conflicts with master) - #968 — first rebase (conflicts after #906 landed) - This PR — second rebase, resolves #906 conflict Supersedes #968. Closes #905. --------- Co-authored-by: you <you@example.com> |
||
|
|
40c3aa13f9 |
fix(paths): exclude false-positive paths from short-prefix collisions (#930)
Fixes #929 ## Summary - `handleNodePaths` pulls candidates from `byPathHop` using 2-char and 4-char prefix keys (e.g. `"7a"` for a node using 1-byte adverts) - When two nodes share the same short prefix, paths through the *other* node are included as candidates - The `resolved_path` post-filter covers decoded packets but falls through conservatively (`inIndex = true`) when `resolved_path` is NULL, letting false positives reach the response **Fix:** during the aggregation phase (which already calls `resolveHop` per hop), add a `containsTarget` check. If every hop resolves to a different node's pubkey, skip the path. Packets confirmed via the full-pubkey index key or via SQL bypass the check. Unresolvable hops are kept conservatively. ## Test plan - [x] `TestHandleNodePaths_PrefixCollisionExclusion`: two nodes sharing `"7a"` prefix; verifies the path with no `resolved_path` (false positive) is excluded and the SQL-confirmed path (true positive) is included - [x] Full test suite: `go test github.com/corescope/server` — all pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
b47587f031 |
feat(#690): expose observer skew + per-hash evidence in clock UI (#906)
## Summary UI completion of #690 — surfaces observer clock skew and per-hash evidence that the backend already computes but wasn't exposed in the frontend. **Not related to #845/PR #894** (bimodal detection) — this is the UI surface for the original #690 scope. ## Changes ### Backend: per-hash evidence in node clock-skew API (commit 1) - Extended `GET /api/nodes/{pubkey}/clock-skew` to return `recentHashEvidence` (most recent 10 hashes with per-observer raw/corrected skew and observer offset) and `calibrationSummary` (total/calibrated/uncalibrated counts). - Evidence is cached during `ClockSkewEngine.Recompute()` — route handler is cheap. - Fleet endpoint omits evidence to keep payload small. ### Frontend: observer list page — clock offset column (commit 2) - Added "Clock Offset" column to observers table. - Fetches `/api/observers/clock-skew` once on page load, joins by ObserverID. - Color-coded severity badge + sample count tooltip. - Singleton observers show "—" not "0". ### Frontend: observer-detail clock card (commit 3) - Added clock offset card mirroring node clock card style. - Shows: offset value, sample count, severity badge. - Inline explainer describing how offset is computed from multi-observer packets. ### Frontend: node clock card evidence panel (commit 4) - Collapsible "Evidence" section in existing node clock skew card. - Per-hash breakdown: observer count, median corrected skew, per-observer raw/corrected/offset. - Calibration summary line and plain-English severity reason at top. ## Test Results ``` go test ./... (cmd/server) — PASS (19.3s) go test ./... (cmd/ingestor) — PASS (31.6s) Frontend helpers: 610 passed, 0 failed ``` New test: `TestNodeClockSkew_EvidencePayload` — 3-observer scenario verifying per-hash array shape, corrected = raw + offset math, and median. No frontend JS smoke test added — no existing test harness for clock/observer rendering. Noted for future. ## Screenshots Screenshots TBD ## Perf justification Evidence is computed inside the existing `Recompute()` cycle (already O(n) on samples). The `hashEvidence` map adds ~32 bytes per sample of memory. Evidence is stripped from fleet responses. Per-node endpoint returns at most 10 evidence entries — bounded payload. --------- Co-authored-by: you <you@example.com> |
||
|
|
b3a9677c52 |
feat(ingestor + server): observerBlacklist config (#962) (#963)
## Summary Implements `observerBlacklist` config — mirrors the existing `nodeBlacklist` pattern for observers. Drop observers by pubkey at ingest, with defense-in-depth filtering on the server side. Closes #962 ## Changes ### Ingestor (`cmd/ingestor/`) - **`config.go`**: Added `ObserverBlacklist []string` field + `IsObserverBlacklisted()` method (case-insensitive, whitespace-trimmed) - **`main.go`**: Early return in `handleMessage` when `parts[2]` (observer ID from MQTT topic) matches blacklist — before status handling, before IATA filter. No UpsertObserver, no observations, no metrics insert. Log line: `observer <pubkey-short> blacklisted, dropping` ### Server (`cmd/server/`) - **`config.go`**: Same `ObserverBlacklist` field + `IsObserverBlacklisted()` with `sync.Once` cached set (same pattern as `nodeBlacklist`) - **`routes.go`**: Defense-in-depth filtering in `handleObservers` (skip blacklisted in list) and `handleObserverDetail` (404 for blacklisted ID) - **`main.go`**: Startup `softDeleteBlacklistedObservers()` marks matching rows `inactive=1` so historical data is hidden - **`neighbor_persist.go`**: `softDeleteBlacklistedObservers()` implementation ### Tests - `cmd/ingestor/observer_blacklist_test.go`: config method tests (case-insensitive, empty, nil) - `cmd/server/observer_blacklist_test.go`: config tests + HTTP handler tests (list excludes blacklisted, detail returns 404, no-blacklist passes all, concurrent safety) ## Config ```json { "observerBlacklist": [ "EE550DE547D7B94848A952C98F585881FCF946A128E72905E95517475F83CFB1" ] } ``` ## Verification (Rule 18 — actual server output) **Before blacklist** (no config): ``` Total: 31 DUBLIN in list: True ``` **After blacklist** (DUBLIN Observer pubkey in `observerBlacklist`): ``` [observer-blacklist] soft-deleted 1 blacklisted observer(s) Total: 30 DUBLIN in list: False ``` Detail endpoint for blacklisted observer returns **404**. All existing tests pass (`go test ./...` for both server and ingestor). --------- Co-authored-by: you <you@example.com> |