Commit Graph

280 Commits

Author SHA1 Message Date
Kpa-clawbot 2e28aa3e04 fix(#1229): source-diversity confidence weighting in neighbor-graph tier-1 resolver (#1235)
RED 235b65b4 (CI will surface URL after PR open) — `test(#1229): tier-1
must prefer multi-observer edges`. Green: 841fc5de.

## Summary
Implements **Option C** from issue #1229: edge source-diversity
confidence weighting. Each neighbor-graph edge already tracks the set of
distinct observers that contributed to it (`NeighborEdge.Observers`).
This PR is the first to consume that signal in the disambiguator.

Tier-1 score in `pm.resolveWithContext` becomes `Score(now) ×
Confidence()` where:

```
Confidence() = min(1.0, max(1, |Observers|) / 3.0)
```

- 1 observer → 1/3 weight (single-source, suspect)
- 2 observers → 2/3 weight
- ≥3 observers → 1.0 (saturated, full historical weight)

A 6-observer edge (30 obs) now beats a 1-observer edge (25 obs) by 3.6×
(vs. 1.2× before) — enough to clear `affinityConfidenceRatio` and skip
the tier-2 geo fallback that was misresolving in cross-region cases.
Stacks with the geo-rejection filter merged in #1228/#1230 to give two
independent defenses against cross-region prefix-collision pollution.

## Why C over A/B
- **A (per-observer graphs):** N×memory cost, biggest refactor surface.
- **B (per-region/IATA segmented):** requires region attribution on
every packet + per-region cache plumbing; deferred follow-up.
- **C:** smallest diff (~30 lines), no schema migration, leverages an
existing field, composes additively with #1228.

A and B remain valid follow-ups if C proves insufficient.

## Backward compatibility (persistence)
`neighbor_edges` schema is **unchanged**. `Observers` is rebuilt by
`BuildFromStoreWithOptions` from live observations on every graph
refresh (5-min TTL). Persisted rows carry an empty set only during the
post-restart warm-up; `Confidence()` defaults n→1 when `|Observers|==0`,
so legacy rows resolve as single-observer (degraded but non-zero)
confidence rather than disappearing. Defensive.

## Tests
- `cmd/server/hop_disambig_confidence_test.go:48` — RED-then-GREEN E2E:
two `8a` candidates from the same anchor, candX placed geo-near with 1
observer × 25 obs, candY placed geo-far with 6 observers × 5 obs.
Without confidence weighting tier-1 falls through (1.2× ratio) and
tier-2 picks the wrong (geo-near) candX. With confidence weighting
tier-1 fires and picks candY. Asserts `method == "neighbor_affinity"` to
pin the resolver path.
- `TestNeighborEdge_ObserverSetIsDistinct` — guards the source-diversity
counter against double-counting same-observer contributions and pins the
`Confidence()` formula at both endpoints (single → fractional, ≥3 →
1.0).

All existing tier-1 tests (`hop_disambig_tier1_test.go`) continue to
pass — they seed with a single observer, so their weights drop from 1.0
to 1/3 uniformly across candidates, preserving the ratio guard outcome.

Fixes #1229

---------

Co-authored-by: bot <bot@corescope.local>
2026-05-16 19:55:00 +00:00
Kpa-clawbot b21badbcbd fix(#1225): paginate channel messages at SQL level — 30s → <500ms (#1226)
## Summary
Fixes #1225 — channel messages endpoint took ~30s on staging.

## Root cause
`(*DB).GetChannelMessages` SELECTed every observation row for the
channel (one row per observation, not per transmission),
JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender,
packetHash)`, then sliced the tail in Go for pagination.

On staging `#wardriving`:
- `transmissions` rows with `channel_hash='#wardriving' AND
payload_type=5`: **5,703**
- `observations` joined to those: **274,632** (~48× amplification)
- `time curl /api/channels/%23wardriving/messages?limit=50`: **30.04s /
31.41s / 31.48s / 35.33s / 34.05s** (5 calls before I killed the loop)

`EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being
used — the cost was entirely in fetching, unmarshalling, and folding the
full observation set per request even for `limit=50`.

Hypothesis #1 from the issue (full table scan on `messages/decoded`) is
rejected; #2 (missing index) is rejected; the actual cause was
**pagination in Go instead of SQL** — request cost was O(observations)
not O(limit).

## Fix
Move pagination into SQL on the `transmissions` table. Because
`transmissions.hash` is `UNIQUE` and the original dedup key was
`(sender, hash)`, each transmission collapses to exactly one logical
message — paginating on transmissions is semantically equivalent to the
prior in-Go dedup + tail slice.

New shape:
1. `COUNT(*)` on transmissions for total (uses `idx_tx_channel_hash`).
2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ?
OFFSET ?` to pick the page of newest transmissions.
3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` —
typically 50 ids → a few hundred observation rows.
4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API
contract.

Region filtering, observation-count-as-`repeats`, and "first observation
wins for hops/snr/observer" semantics are preserved (observations are
scanned `ORDER BY o.id ASC`).

## Perf measurements
**Before** (staging `#wardriving`, limit=50, 5 samples killed mid-loop):
30.04s, 31.41s, 31.48s, 35.33s, 34.05s.
**Synthetic regression test**
(`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs.
- Broken impl: ~4.5s (test fails the 500ms budget — the RED commit).
- Fixed impl: well under 500ms (test passes).
**After (staging)**: will measure post-deploy and post-comment on issue
with numbers. Synthetic scaling: staging is ~2× the test's transmission
count, fixed-path cost scales with `limit` (50) + `COUNT(*)` (~5k rows
on index) — expect <100ms p99.

## TDD
- RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at
~4.5s.
- GREEN: `3f1f82d3` — fix; full suite green, perf test passes.

## Hypotheses status
| # | Hypothesis | Verdict |
|---|---|---|
| 1 | Endpoint slow on prod-sized data | **CONFIRMED** (different
mechanism — see root cause) |
| 2 | Missing channel_hash index | Rejected (`idx_tx_channel_hash`
exists & used) |
| 3 | Frontend re-render storm | Not investigated (backend was clearly
the bottleneck) |
| 4 | Decode in request path | Rejected (decode is at ingest time; JSON
unmarshal of cached `decoded_json` is the cost, addressed by reducing
row count) |
| 5 | WS subscription failure | Rejected |
| 6 | Staging artifact | Rejected (reproducible) |

## Out of scope
- The in-memory `(*PacketStore).GetChannelMessages` path (used when
`s.db == nil`) has the same shape but operates on bounded in-memory
data; not touched. If we ever fall back to it in production we'll
revisit.

---------

Co-authored-by: clawbot <bot@corescope>
2026-05-16 17:28:40 +00:00
Kpa-clawbot 7179afcfde feat(#1228): reject geo-implausible neighbor-graph edges at build time (#1230)
Fixes #1228 — geo-implausible neighbor-graph edges are rejected at build
time.

Red commit: `5a6d9660` — failing tests for 4 cases (reject SF↔Berlin,
accept local CA, accept no-GPS endpoint, counter increments). Live CI
run (latest commit):
https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1228

## Why

The disambiguator's tier-1 affinity graph is built blindly from path
co-occurrence. On wide-geo MQTT deployments, a single bad hop
disambiguation seeds an edge across geographically impossible distances
(e.g. Bay Area ↔ Berlin), which then reinforces the same wrong
resolution next time. Self-poisoning spiral.

## What changed

- `upsertEdge` now consults a per-graph GPS index. When **both**
endpoints have known GPS and their haversine distance exceeds the
threshold, the edge is dropped and `NeighborGraph.RejectedEdgesGeoFar`
(atomic) is incremented.
- Either endpoint missing GPS ⇒ accept (no signal to reject), per
acceptance criteria.
- Threshold is configurable via `neighborGraph.maxEdgeKm` (default **500
km** — well above any plausible terrestrial LoRa hop, including
satellite-assisted). 0 ⇒ use default; negative ⇒ disable the filter.
Exposed via `Config.NeighborMaxEdgeKm()`.
- New `BuildFromStoreWithOptions` carrying the threshold;
`BuildFromStore` and `BuildFromStoreWithLog` are kept as thin wrappers.
- Stats are surfaced under `GET /api/analytics/neighbor-graph` as
`stats.rejected_edges_geo_far`.
- All rejection logs PII-truncate pubkeys to 8 hex chars (public repo
discipline).
- `config.example.json` updated with the new field + comment.

## Follow-up

#1229 (per-region scoped affinity graphs) depends on this landing first.

---------

Co-authored-by: corescope-bot <bot@corescope.local>
2026-05-16 10:14:44 -07:00
Kpa-clawbot 170f0ac66d fix(#1212): MQTT per-attempt logging + stall watchdog — prevent silent reconnect-loop death (#1216)
RED commit: `1cd25f7b` — CI (failing on assertion):
https://github.com/Kpa-clawbot/CoreScope/actions?query=sha%3A1cd25f7b1bdd0091f689dd64ce1bfec6d031191f

Fixes #1212

## Root cause

NOT that `AutoReconnect` was off — it was set;
`MaxReconnectInterval=30s` was set (PR #949); a `SetReconnectingHandler`
was wired. The defect was an **observability gap**:

`SetReconnectingHandler` fires only INSIDE paho's reconnect goroutine.
If that goroutine never iterates (status race after the recovered
handler panic at 21:07:13, or an internal abort), operators see ONLY the
`disconnected: pingresp not received` line and then total silence. They
cannot distinguish "paho is patiently retrying" from "paho gave up and
the goroutine is gone." That ambiguity is what turned a 30s blip into 6h
of downtime.

## Changes

### `cmd/ingestor/main.go` — `SetConnectionAttemptHandler`
Fires on every TCP/TLS dial — the initial `Connect()` AND every
reconnect — independent of paho's internal reconnect-loop state. Logs:

```
MQTT [staging] connection attempt #1 to tcp://broker:1883
MQTT [staging] connection attempt #2 to tcp://broker:1883
```

Per-source attempt counter via `atomic.AddInt64`.

### `cmd/ingestor/mqtt_watchdog.go` (new) — per-source stall watchdog
Satisfies the watchdog acceptance criterion. Even when paho reports
`connected`, if no MQTT messages have flowed for >5m, log a WARN line
every 60s:

```
MQTT [staging] WATCHDOG: client reports connected to tcp://broker:1883 but no messages received for 7m30s (threshold 5m) — possible half-open socket or upstream stall
```

Catches half-open TCP and broker-accepted-but-not-forwarding scenarios
that look "connected" to paho.

Hot-path cost: one `atomic.StoreInt64` per inbound message. Watchdog
scans the registry once a minute.

### Tests (`cmd/ingestor/mqtt_reconnect_test.go`, new)
- `TestBuildMQTTOpts_InstrumentsConnectionAttempt` — asserts
`OnConnectAttempt` is wired in `buildMQTTOpts`.
- `TestMQTTStallWatchdog_FiresOnSilentSource` — connected + 10m silent +
5m threshold → stall flagged.
- `TestMQTTStallWatchdog_QuietWhenRecent` — recent message → no stall.
- `TestMQTTStallWatchdog_QuietWhenDisconnected` — disconnected → no
stall (paho's reconnect logging covers it).

## TDD
- RED `1cd25f7b` — 2 assertion failures (compile OK, stub returns
no-stall, `OnConnectAttempt` nil).
- GREEN `2527be6f` — implementation; all ingestor tests pass.

## Out of scope
- Slice-bounds decode panic (#1211, separate PR).
- A full in-process MQTT broker integration test would require a new dep
(mochi-mqtt) — the observability and watchdog behaviors are
independently verifiable by the unit tests above, and the reconnect path
itself is paho's responsibility (we already test it's configured via
`mqtt_opts_test.go`).

---------

Co-authored-by: bot <bot@example.com>
Co-authored-by: OpenClaw Bot <bot@openclaw.local>
Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com>
2026-05-15 22:46:29 -07:00
Kpa-clawbot eba9e89a72 fix(#1203): path-inspector — singleflight + stale-while-revalidate (#1208)
Red commit: c84a8f575a (CI run: pending
push)

Fixes #1203 — path-inspector 503 storm.

Three sub-fixes, each shipped as red→green per AGENTS TDD:

**A. Singleflight on rebuild** (`ensureNeighborGraph`)
Hand-rolled `sync.Mutex + chan` singleflight — no new deps (x/sync was
not in cmd/server's go.mod). Concurrent callers attach to one in-flight
rebuild instead of N parallel `BuildFromStore` goroutines.
- Red: `7340f23b` — test asserts ≤1 build under 10 concurrent callers
(saw 10 on master)
- Green: `abac6b3c`

**B. Stale-while-revalidate** (`handlePathInspect`)
Stale non-nil graph is served immediately with `"stale": true` while a
background rebuild runs (deduped by A). The 2s synchronous gate is gone.
Stale responses are not cached, so the next request after rebuild lands
fresh.
- Red: `c84a8f57` — test asserts 200+`stale:true`+rebuild-kickoff
(master returned 503)
- Green: `5eb86975`

**C. Cold-start 503 still kicks rebuild**
True cold start (`graph == nil`) is the only path that still returns 503
`{"retry": true}`, but it now spawns an async `ensureNeighborGraph` so
the very next request warms up.
- Green test: `f5ac7059` (passed on top of A+B)

Singleflight verified: `TestEnsureNeighborGraph_Singleflight`
Stale-while-revalidate verified:
`TestHandlePathInspect_StaleWhileRevalidate`
Cold-start verified: `TestHandlePathInspect_ColdStartKicksRebuild`

**Acceptance criteria (issue #1203):**
- [x] Concurrent requests share ONE rebuild
- [x] Stale non-nil graph served with `stale:true` async
- [x] 503 only on true cold-start
- [x] Cold-start 503 kicks rebuild → follow-up warm
- [ ] p99 < 500ms under load (not unit-testable; design satisfies it)
- [x] No regression in existing tests

**Out of scope (per issue):** 5-min TTL constant, `BuildFromStore` perf,
`/api/analytics/topology`, persist-lock contention.

No new deps.

---------

Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: corescope-bot <bot@corescope.dev>
2026-05-15 22:46:28 -07:00
efiten 11d2026bb1 feat(startup): hot startup — load hotStartupHours synchronously, fill retentionHours in background (#1187)
Closes #1183

## Summary

- Adds `packetStore.hotStartupHours` config key (float64, default 0 =
disabled). When set, `Load()` loads only that many hours of data
synchronously, reducing startup time on large DBs. Background goroutine
fills the remaining `retentionHours` window in daily chunks after
startup completes.
- A background goroutine (`loadBackgroundChunks`) fills the remaining
`retentionHours` window in daily chunks after startup completes.
Analytics indexes are rebuilt once at the end.
- `QueryPackets` and `QueryGroupedPackets` check `oldestLoaded` and fall
back to `db.QueryPackets()` for any query whose `Since`/`Until` predates
the in-memory window — covering days 8–30 permanently (beyond
`retentionHours`) and the background-fill gap during startup.
- `/api/perf` gains `hotStartupHours`, `backgroundLoadComplete`, and
`backgroundLoadProgress` fields inside `packetStore` so operators can
monitor the fill.

### Drive-by fixes

- E2E: added `gotoPackets` navigation helper used across packet-related
tests
- E2E: rewrote stripe assertion to check per-row stripe parity rather
than a fragile computed-style comparison
- E2E: theme test updated to use `#/home` as the initial route (was
`#/`)
- `db.go`: removed the RFC3339→unix-timestamp subquery path in
`buildTransmissionWhere`; `t.first_seen` is now always compared directly
as a string for both RFC3339 and non-RFC3339 inputs

## Configuration

```json
"packetStore": {
  "retentionHours": 168,
  "hotStartupHours": 24
}
```

`hotStartupHours: 0` (default) preserves existing behavior exactly.
Recommended for large DBs to reduce startup time; set to 0 to disable
(loads full retentionHours at startup, legacy behavior).

## Test plan

- [x] `TestHotStartupConfig_Clamp` — clamping when `hotStartupHours >
retentionHours`
- [x] `TestHotStartupConfig_ZeroIsDisabled` — zero leaves feature
disabled
- [x] `TestHotStartup_LoadsOnlyHotWindow` — only hot-window packets in
memory after `Load()`
- [x] `TestHotStartup_DisabledWhenZero` — all retention packets loaded
when disabled
- [x] `TestHotStartup_loadChunk_AddsOlderData` — chunk merges correctly,
ASC order maintained
- [x] `TestHotStartup_BackgroundFillsToRetention` — background goroutine
fills to `retentionHours`
- [x] `TestHotStartup_ChunkErrorRecovery` — chunk SQL failure logged and
skipped, loop terminates
- [x] `TestHotStartup_SQLFallback_TriggeredForOldDate` — query before
`oldestLoaded` routes to SQL
- [x] `TestHotStartup_SQLFallback_NotTriggeredForRecentDate` — recent
query stays in-memory
- [x] `TestHotStartup_PerfStats` — new fields present in
`GetPerfStoreStats()` (backs the perf endpoint)
- [x] `TestHotStartup_PerfStoreHTTP` — HTTP-level: GET /api/perf returns
`hotStartupHours`, `backgroundLoadComplete`, `backgroundLoadProgress` in
`packetStore`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: openclaw-bot <bot@openclaw.local>
Co-authored-by: CoreScope Bot <bot@corescope.local>
2026-05-15 22:46:25 -07:00
Kpa-clawbot 85e97d2f37 fix(#1211): bounds-check path length to prevent slice [218:15] panic in MQTT decode (#1214)
**RED commit:** `65d9f57b` (CI run will appear at
https://github.com/Kpa-clawbot/CoreScope/actions after PR opens)

Fixes #1211

## Root cause

`decodePath()` returns `bytesConsumed = hash_size * hash_count` where
both come straight from the wire-supplied `pathByte` (upper 2 bits →
`hash_size`, lower 6 bits → `hash_count`). Max claimable: 4 × 63 = 252
bytes.

A malformed packet on the wire claimed `pathByte=0xF6` (hash_size=4,
hash_count=54 → 216 path bytes) inside a 15-byte buffer. The inner
hop-extraction loop in `decodePath` did break early on overflow — but
`bytesConsumed` was still returned at face value (216). `DecodePacket`
then did `offset += 216` (offset=218) and `payloadBuf := buf[offset:]`
panicked with the prod-observed signature:

```
runtime error: slice bounds out of range [218:15]
```

The handler-level `defer/recover` at `cmd/ingestor/main.go:258-263`
caught it, but the message was silently dropped with no usable
diagnostic.

## Fix

Add a `if offset > len(buf)` guard at BOTH decoder sites (same pattern,
same panic potential):

- `cmd/ingestor/decoder.go` — DecodePacket after decodePath
- `cmd/server/decoder.go` — DecodePacket after decodePath

Return a descriptive error citing the claimed length and pathByte hex so
operators can reproduce.

Also: `cmd/ingestor/main.go` decode-error log now includes `topic`,
`observer`, and `rawHexLen` so future malformed packets are reproducible
without needing to attach a debugger.

## Tests (TDD red → green)

Both packages got two new tests:

- **`TestDecodePacketBoundsFromWire_Issue1211`** — feeds the exact wire
shape from the prod log (`pathByte=0xF6` inside a 15-byte buf). Asserts
`DecodePacket` does NOT panic and returns an error.
- **`TestDecodePacketFuzzTruncated_Issue1211`** — sweeps every `(header,
pathByte)` combination with tails 0..19 bytes (≈1.3M inputs). Asserts
zero panics.

### Red commit proof

On commit `65d9f57b` (RED), both tests fail with the panic:
```
=== RUN   TestDecodePacketBoundsFromWire_Issue1211
    decoder_test.go:1996: DecodePacket panicked on malformed input: runtime error: slice bounds out of range [218:15]
--- FAIL: TestDecodePacketBoundsFromWire_Issue1211 (0.00s)
=== RUN   TestDecodePacketFuzzTruncated_Issue1211
    decoder_test.go:2010: DecodePacket panicked during fuzz: runtime error: slice bounds out of range [3:2]
--- FAIL: TestDecodePacketFuzzTruncated_Issue1211 (0.01s)
```

On commit `7a6ae52c` (GREEN), full suites pass:
- `cmd/ingestor`: `ok 53.988s`
- `cmd/server`:   `ok 29.456s`

## Acceptance criteria

- [x] Identify the slice op producing `[218:15]` — `payloadBuf :=
buf[offset:]` in `DecodePacket` (decoder.go), where `offset` had been
advanced by an unchecked `bytesConsumed` from `decodePath()`.
- [x] Bounds check added at the identified site(s) — both ingestor and
server decoders.
- [x] Test with crafted payload (length-field > remaining buffer) —
`TestDecodePacketBoundsFromWire_Issue1211`.
- [x] Log topic, observer ID, payload byte length on drop — updated
`MQTT [%s] decode error` log line.
- [x] Existing tests stay green — confirmed both packages.

## Out of scope

Reconnect-after-disconnect (#1212) — handled by a separate subagent.
This PR touches NO reconnect logic.

---------

Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: openclaw-bot <bot@openclaw.local>
Co-authored-by: corescope-bot <bot@corescope>
2026-05-15 22:34:21 -07:00
Kpa-clawbot dbb013a6bf test(#1201): regression coverage for hop disambiguator tier-1 + end-to-end top-hops fixture (#1202)
Mutation test confirmed: reverting cmd/server/store.go:2975
(`setContext(buildHopContextPubkeys(tx, pm))` → `setContext(nil)`) in
`buildDistanceIndex` produces failing assertion in
`TestTopHopsRespectsContextAcrossAllCallSites`: top-hops ranking flips
to `72dddd→8acccc@13.0km` (Berlin↔Berlin misresolution), CA↔CA pair
absent. After reverting the mutation, the test passes again.

Fixes #1201

## Summary
Pure test addition. No production code changed. Adds regression coverage
for the hop disambiguator's tier-1 (neighbor affinity) path and an
end-to-end fixture that catches revert-to-nil-context regressions across
all 9 call sites of `pm.resolveWithContext`.

## Sub-tasks (all 4 landed)

1. **Tier-1 explicit** — `hop_disambig_tier1_test.go`:
   - `Tier1_StrongAffinityPicksX` (strong-X edge wins)
- `Tier1_StrongAffinityPicksY` (reverse weights — proves score is read)
   - `Tier1_AmbiguousEdgeSkipsToTier2` (`Ambiguous=true` → skip)
2. **Tier ordering** — `Tier1_BeatsTier2WhenBothSignal` (tier 1 wins
when both signal)
3. **Tier-1 fallback** —
   - `Tier1_EmptyGraphFallsThrough` (graph has no edges for context)
   - `Tier1_NilGraphFallsThrough` (graph is nil)
- `Tier1_ScoresTooCloseFallsThrough` (best < `affinityConfidenceRatio` ×
runner-up)
4. **End-to-end fixture** — `hop_disambig_e2e_test.go`:
- 9 nodes with intentional prefix collisions across SLO/LA/NYC/Berlin
(prefix `72`) and SF/CA/Berlin (prefix `8a`); Berlin candidates have
`obsCount=200` so they'd win tier-3 absent context.
   - 50 transmissions path `["72","8a"]`, sender + observer in CA.
- Affinity graph seeded with strong `sender↔72aa` and `sender↔8aaa`
edges.
- Asserts: CA↔CA hop present, no Berlin pubkeys in `distHops`, max
distance < 300 km cap.

## TDD exemption
Net-new regression-sentinel tests for behavior already correct on master
post-#1198. Each test passed on first run (no production bug surfaced).
The mutation test on sub-task 4 is the gating proof: forcing
`setContext(nil)` at `store.go:2975` makes the test fail with the exact
misresolution class the issue describes (Berlin↔Berlin leaks into
top-hops).

## Acceptance criteria
- [x] Tier-1 affinity test added with 3 cases
- [x] Tier-ordering test added
- [x] Tier-1 fallback tests added (nil / empty / scores-too-close)
- [x] End-to-end fixture added with multi-candidate-prefix nodes
- [x] End-to-end fixture fails if any call site reverts to `nil` context
(mutation-verified)
- [x] Test files live in `cmd/server/` alongside
`prefix_map_role_test.go`

---------

Co-authored-by: openclaw-bot <bot@openclaw.local>
Co-authored-by: corescope-bot <bot@corescope.local>
2026-05-15 20:24:55 -07:00
Kpa-clawbot 2beeb2b324 fix(#1199): 6 deferred quality items from PR #1198 r2 review (#1200)
Red commit: 75563ce (CI run: pending — pushed at branch open)

Follows up PR #1198 round-2 adversarial review (issue #1199). Six
robustness / perf-hot-path / maintenance items, one commit per logical
change. Stacked on top of `fix/issue-1197` (PR #1198) — base must move
to `master` after #1198 merges.

| # | Item | Commit(s) | Discipline |
|---|---|---|---|
| 1 | Brittle static-grep regex → go/parser AST walk in
`resolve_context_callsites_test.go` | 33d80b6 (RED) → 450236d (GREEN) |
red→green |
| 2 | `computeAnalyticsTopology` double-pass filter → materialize
`filteredTxs` once | 00005f6 | refactor |
| 3 | `BenchmarkBuildAggregateHopContextPubkeys` baseline + tiny smoke
test | b520048 | net-new bench/test |
| 4 | `hopResolverPerTx` CONCURRENCY doc — single-goroutine invariant |
155ff07 | doc-only |
| 5 | `schemaDegradationLogged` package-level `sync.Map` → PacketStore
field | 75563ce (RED) → 7dbf193 (GREEN) | red→green |
| 6 | `buildHopContextPubkeys` `out` slice cap hint (`make([]string, 0,
16)`) | 2040962 | refactor |

Items 2 & 6 are pure refactors — no test files modified for items 2 & 6
(per AGENTS.md exemption rule). Existing tests stay green and unaltered.

Item 4 is doc-only (CONCURRENCY: comment); no behavior change.

Item 3 adds a bench + a smoke assertion for the aggregate helper that
previously had no coverage. Local arm64 baseline: ~72ms/op, 130k allocs
at 5k txs.

Items 1 & 5 follow red→green: 33d80b6 demonstrates the regex blindspot
via a synthetic AST-detectable input the regex misses; 75563ce
demonstrates per-store log dedup leaks across instances. Both flips
visible in branch history.

Full `go test ./cmd/server/...` runs clean post-amend.

Fixes #1199

---------

Co-authored-by: openclaw-bot <bot@openclaw.local>
2026-05-15 16:21:14 +00:00
Kpa-clawbot 353c5264ad fix(#1197): plumb hop-context + observation-count tiebreak to disambiguator (#1198)
Red commit: 5ffdf6b07c (CI run: pending —
see PR Checks tab)

Fixes #1197

## What this changes

Two-part fix matching the issue spec:

1. **Tier-3/4 tiebreak by observation count, not slice order**
(`store.go` resolver + `getAllNodes`).
- Plumbs `nodes.advert_count` → new `nodeInfo.ObservationCount` field
via the existing `getAllNodes` query (graceful fallback when the column
is absent on legacy DBs).
- `resolveWithContext` tier 3 (GPS preference) now picks the GPS-having
candidate with the highest observation count.
- Tier 4 (no-GPS fallback) likewise picks by observation count instead
of `candidates[0]`.
2. **Plumb hop-context to the resolver** at all four call sites called
out in the issue.
- New `buildHopContextPubkeys(tx, pm)` collects: sender pubkey from
`tx.DecodedJSON.pubKey`, observer pubkey from `tx.ObserverID`, plus
unambiguous-prefix anchors (single-candidate prefixes in the path).
- Wired into the four sites: broadcast distance compute (~1707),
recompute-on-path-change (~2944), `buildDistanceIndex` (~2982),
`computeAnalyticsTopology` (~5125).
- Per-tx hop caches were moved inside the per-tx loop on the distance
paths since context now varies per tx (was safely shared before only
because every caller passed `nil`).
- `computeAnalyticsTopology` aggregates context across the analytics
scan rather than per-tx because `resolveHop` is called outside the scan
loop downstream.

## Tests

Red→green pairs visible in the commit history:

- Pair A — tier-3 observation-count tiebreak
(`TestResolveWithContext_Tier3_PicksHigherObservationCount`).
- Pair B — context plumbing
(`TestBuildHopContextPubkeys_IncludesSenderAndUnambiguousAnchors`) +
tier-2 geo-proximity
(`TestResolveWithContext_Tier2_PicksGeographicallyCloserCandidate`).

`go test ./...` green on `cmd/server`.

## Out of scope (per issue)

300 km hop cap, API confidence/alternative-count surfacing, firmware
prefix-collision space — all explicitly excluded in #1197.

---------

Co-authored-by: openclaw-bot <bot@openclaw.local>
Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: Kpa-clawbot <bot@kpa-clawbot.local>
2026-05-15 09:16:39 -07:00
Kpa-clawbot f4cf2acbc0 perf: cancelled writes + ingestor I/O + threshold tests (#1120 follow-up) (#1167)
Red commit: e964ec9c46 (CI run: pending —
workflow only triggers on PR open)

Partial fix for #1120 — finishes the four follow-up items left open
after PR #1123 (cancelled writes, ingestor I/O, threshold-flag tests,
docs).

## What's done

- **`cancelledWriteBytesPerSec`** — server `/proc/self/io` parser
handles `cancelled_write_bytes`; `/api/perf/io` exposes the per-second
rate; Perf page renders it next to Read/Write with ⚠️ when sustained >1
MB/s.
- **Ingestor `/proc/<pid>/io`** — `cmd/ingestor/stats_file.go` samples
its own `/proc/self/io` each tick and includes `procIO` in the snapshot.
The server's `/api/perf/io` reads it and surfaces `.ingestor`. Frontend
renders an `Ingestor process` Disk I/O block alongside the existing
`server process` block (issue mockup: "Both ingestor and server").
- **Threshold + anomaly tests** — `test-perf-disk-io-1120.js` now
asserts ⚠️ fires/suppresses on WAL>100MB, cache_hit<90%, and the
backfill-rate-vs-tx-rate guard with the `tx_inserted >= 100` baseline
floor. Drops the tautological `|| ... === false` short-circuits flagged
in MINOR m4.
- **Docs (m8)** — `config.example.json` adds `_comment_ingestorStats`
(env var, default path, shared-tmp security note);
`cmd/ingestor/README.md` adds `CORESCOPE_INGESTOR_STATS` to the env-var
table plus a `Stats file` section.

## What's NOT done (deferred)

m1 sync.Map → map+RWMutex, m2 perfIOMu rate caching, m3 negative
cacheSize translation, m5 deterministic-write test, m7 ctx-aware
shutdown — pure polish; will file a follow-up issue if the operator
wants them tracked.

## TDD

- Red: `e964ec9` — adds failing tests + stub field/handler shape
(cancelled missing from struct, ingestor stub returns nil, ingestor
procIO absent).
- Green: `1240703` — wires up the parser case, ingestor sampler,
frontend rendering, docs.

E2E assertion added: test-perf-disk-io-1120.js:108

---------

Co-authored-by: clawbot <clawbot@users.noreply.github.com>
Co-authored-by: Kpa-clawbot <bot@kpa-clawbot.local>
Co-authored-by: Kpa-clawbot <bot@kpa-clawbot>
2026-05-08 16:29:23 -07:00
Kpa-clawbot fb744d895f fix(#1143): structural pubkey attribution via from_pubkey column (#1152)
Fixes #1143.

## Summary

Replaces the structurally unsound `decoded_json LIKE '%pubkey%'` (and
`OR LIKE '%name%'`) attribution path with an exact-match lookup on a
dedicated, indexed `transmissions.from_pubkey` column.

This closes both holes documented in #1143:
- **Hole 1** — same-name false positives via `OR LIKE '%name%'`
- **Hole 2a** — adversarial spoofing: a malicious node names itself with
another node's pubkey and gets attributed to the victim
- **Hole 2b** — accidental false positive when any free-text field (path
elements, channel names, message bodies) contains a 64-char hex
substring matching a real pubkey
- **Perf** — query now uses an index instead of a full-table scan
against `LIKE '%substring%'`

## TDD

Two-commit history shows red-then-green:

| Commit | Status | Purpose |
|---|---|---|
| `7f0f08e` | RED — tests assertion-fail on master behaviour |
Adversarial fixtures + spec |
| `59327db` | GREEN — schema + ingestor + server + migration |
Implementation |

The red commit's test schema includes the new column so the file
compiles, but the production code still uses LIKE — the assertions fail
because the malicious / same-name / free-text rows are returned. The
green commit changes the query plus adds the migration/ingest path.

## Changes

### Schema
- new column `transmissions.from_pubkey TEXT`
- new index `idx_transmissions_from_pubkey`

### Ingestor (`cmd/ingestor/`)
- `PacketData.FromPubkey` populated from decoded ADVERT `pubKey` at
write time. Cheap — already parsing `decoded_json`. Non-ADVERTs stay
NULL.
- `stmtInsertTransmission` writes the column.
- Migration `from_pubkey_v1` ALTERs legacy DBs to add the column +
index.
- Bonus: rewrote the recipe in the gated one-shot
`advert_count_unique_v1` migration to use `from_pubkey` (already marked
done on existing DBs; kept correct for fresh installs).

### Server (`cmd/server/`)
- `ensureFromPubkeyColumn` mirrors the ingestor migration so the server
can boot against a DB the ingestor has never touched (e2e fixture, fresh
installs).
- `backfillFromPubkeyAsync` runs **after** HTTP starts. Scans `WHERE
from_pubkey IS NULL AND payload_type = 4` in 5000-row chunks with a
100ms yield between chunks. Cannot block boot even on prod-sized DBs
(100K+ transmissions). Queries handle NULL gracefully (return empty for
that pubkey, same as today's unknown-pubkey path).
- All in-scope LIKE call sites switched to exact match:

| Site | Before | After |
|---|---|---|
| `buildPacketWhere` (was db.go:582) | `decoded_json LIKE '%pubkey%'` |
`from_pubkey = ?` |
| `buildTransmissionWhere` (was db.go:626) | `t.decoded_json LIKE
'%pubkey%'` | `t.from_pubkey = ?` |
| `GetRecentTransmissionsForNode` (was db.go:910) | `LIKE '%pubkey%' OR
LIKE '%name%'` | `t.from_pubkey = ?` |
| `QueryMultiNodePackets` (was db.go:1785) | `decoded_json LIKE
'%pubkey%' OR ...` | `t.from_pubkey IN (?, ?, ...)` |
| `advert_count_unique_v1` (was ingestor/db.go:257) | `decoded_json LIKE
'%' \|\| nodes.public_key \|\| '%'` | `t.from_pubkey = nodes.public_key`
|

`GetRecentTransmissionsForNode` signature simplifies: the `name`
parameter is gone (it was only ever used for the legacy `OR LIKE
'%name%'` fallback). Sole caller in `routes.go:1243` updated.

### Tests
- `cmd/server/from_pubkey_attribution_test.go` — adversarial fixtures +
Hole 1/2a/2b/QueryMultiNodePackets exact-match assertions, EXPLAIN QUERY
PLAN index check, migration backfill correctness.
- `cmd/ingestor/from_pubkey_test.go` — write-time correctness
(BuildPacketData populates FromPubkey for ADVERT only;
InsertTransmission persists it; non-ADVERTs stay NULL).
- Existing test schemas (server v2, server v3, coverage) get the new
column **plus a SQLite trigger** that auto-populates `from_pubkey` from
`decoded_json` on ADVERT inserts. This means existing fixtures (which
only seed `decoded_json`) keep attributing correctly without per-test
edits.
- `seedTestData`'s ADVERTs explicitly set `from_pubkey`.

## Performance — index is used

```
$ EXPLAIN QUERY PLAN SELECT id FROM transmissions WHERE from_pubkey = ?
SEARCH transmissions USING INDEX idx_transmissions_from_pubkey (from_pubkey=?)
```

Asserted in `TestFromPubkeyIndexUsed`.

## Migration approach

- **Sync at boot**: `ALTER TABLE transmissions ADD COLUMN from_pubkey
TEXT` is a metadata-only operation in SQLite — microseconds regardless
of table size. `CREATE INDEX IF NOT EXISTS
idx_transmissions_from_pubkey` is **not** metadata-only: it scans the
table once. Empirically a few hundred ms on a 100K-row table; expect a
few seconds on a 10M-row table (one-time cost, blocking boot during that
window). Subsequent boots no-op via `IF NOT EXISTS`. If this boot delay
becomes an operational concern at prod scale we can defer the `CREATE
INDEX` to a goroutine — for now a few-second one-time delay is
acceptable.
- **Async**: row-level backfill of legacy NULL ADVERTs (chunked 5000 /
100ms yield). On a 100K-ADVERT prod DB, this completes in seconds in the
background; HTTP is fully available throughout.
- **Safety**: queries handle NULL gracefully — a node whose ADVERTs
haven't backfilled yet returns empty, identical to today's behaviour for
unknown pubkeys. No half-state regression.

## Out of scope (intentionally)

The free-text `LIKE` paths the issue explicitly leaves alone (e.g.
user-typed packet search) are untouched. Only the pubkey-attribution
sites get the column treatment.



## Cycle-3 review fixes

| Finding | Status | Commit |
|---|---|---|
| **M1c** — async-contract test was tautological (test's own `go`, not
production's) | Fixed | `23ace71` (red) → `a05b50c` (green) |
| **m1c** — package-global atomic resets unsafe under `t.Parallel()` |
Fixed (`// DO NOT t.Parallel` comment + `Reset()` helper) | rolled into
`23ace71` / `241ec69` |
| **m2c** — `/api/healthz` read 3 atomics non-atomically (torn snapshot)
| Fixed (single RWMutex-guarded snapshot + race test) | `241ec69` |
| **n3c.m1** — vestigial OR-scaffolding in `QueryMultiNodePackets` |
Fixed (cleanup) | `5a53ceb` |
| **n3c.m2** — verify PR body language about `ALTER` vs `CREATE INDEX` |
Verified accurate (already corrected in cycle 2) | (no change) |
| **n3c.m3** — `json.Unmarshal` per row in backfill → could use SQL
`json_extract` | **Deferred as known followup** — pure perf optimization
(current per-row Unmarshal is correct, just slower); SQL rewrite would
unwind the chunked-yield architecture and is non-trivial. Acceptable for
one-time backfill at boot on legacy DBs. |

### M1c implementation detail

`startFromPubkeyBackfill(dbPath, chunkSize, yieldDuration)` is now the
single production entry point used by `main.go`. It internally does `go
backfillFromPubkeyAsync(...)`. The test calls `startFromPubkeyBackfill`
(no `go` prefix) and asserts the dispatch returns within 50ms — so if
anyone removes the `go` keyword inside the wrapper, the test fails.
**Manually verified**: removing the `go` keyword causes
`TestBackfillFromPubkey_DoesNotBlockBoot` to fail with "backfill
dispatch took ~1s (>50ms): not async — would block boot."

### m2c implementation detail

`fromPubkeyBackfillTotal/Processed/Done` are now plain `int64`/`bool`
package globals guarded by a single `sync.RWMutex`.
`fromPubkeyBackfillSnapshot()` returns all three under one RLock.
`TestHealthzFromPubkeyBackfillConsistentSnapshot` races a writer
(lock-step total/processed updates with periodic done flips) against 8
readers hammering `/api/healthz`, asserting `processed<=total` and
`(done => processed==total)` on every response. Verified the test
catches torn reads (manually injected a 3-RLock implementation; test
failed within milliseconds with "processed>total" and "done=true but
processed!=total" errors).

---------

Co-authored-by: openclaw-bot <bot@openclaw.local>
Co-authored-by: openclaw-bot <bot@openclaw.dev>
2026-05-06 23:50:44 -07:00
Kpa-clawbot 5a5df5d92b revert: group commit M1 (#1117) — starves MQTT, refs #1129 (#1130)
## Why

Diagnostic on #1129 shows PR #1117 (group commit M1 for #1115) is
fundamentally broken: it starves the MQTT goroutine via `gcMu` lock
contention, causing pingresp disconnects and lost packets at modest
ingest rates.

## Three structural defects

1. **Lock held across `sql.Stmt.Exec`** — every concurrent
`InsertTransmission` blocks for the full SQLite write latency, not just
the brief queue mutation.
2. **Lock held across `tx.Commit`** — the WAL fsync runs *under* `gcMu`,
so any backlog blocks all ingest writers AND the flusher ticker,
snowballing under load.
3. **Single-conn DB** (`MaxOpenConns=1`) — the flusher and the ingest
path serialise on one connection, turning the lock into a global ingest
stall.

Net effect: at modest packet rates the MQTT client loop misses its own
pingresp deadline, the broker drops the connection, and packets received
during the stall are lost.

## What this PR removes

- `Store.SetGroupCommit`, `Store.FlushGroupTx`, `Store.flushLocked`,
`Store.GroupCommitMs`
- `gcMu`, `activeTx`, `pendingRows`, `groupCommitMs`,
`groupCommitMaxRows` Store fields
- `groupCommitMs` / `groupCommitMaxRows` config fields and
`GroupCommitMsOrDefault` / `GroupCommitMaxRowsOrDefault` accessors
- The flusher goroutine in `cmd/ingestor/main.go`
- `cmd/ingestor/group_commit_test.go`
- The `if s.activeTx != nil { … pendingRows … }` branch in
`InsertTransmission` — reverts to plain prepared-stmt usage

## What this PR keeps (merged after #1117)

- #1119 `BackfillPathJSON` `path_json='[]'` fix
- #1120/#1123 perf metrics endpoints — `WALCommits` counter retained
- `GroupCommitFlushes` JSON field on `/api/perf/write-sources` is kept
as always-0 for API stability (server `perf_io.go` references it as a
string field name; no client breakage)
- `DBStats.GroupCommitFlushes` atomic field is removed from the Go
struct

## Tests

`cd cmd/ingestor && go test ./... -run "Test"` → `ok` (47.8s).
`cd cmd/server && go build ./...` → clean.

## #1115 stays open

The group-commit *idea* is sound — batching observation INSERTs would
meaningfully reduce WAL fsync rate. But it needs a redesign that does
**not** hold a mutex across blocking SQLite calls. Suggested directions
for a future M1:
- Channel-fed writer goroutine (single owner of the tx, ingest path is
non-blocking enqueue)
- Per-batch DB handle so the flusher doesn't serialise the ingest
connection
- Bounded queue with backpressure rather than a shared lock

Refs #1117 #1129
2026-05-05 19:02:43 -07:00
Kpa-clawbot 74dffa2fb7 feat(perf): per-component disk I/O + write source metrics on Perf page (#1120) (#1123)
## Summary

Implements per-component disk I/O + write source metrics on the Perf
page so operators can self-diagnose write-volume anomalies (cf. the
BackfillPathJSON loop debugged in #1119) without SSHing in to run
iotop/fatrace.

Partial fix for #1120

## What's done (4/6 ACs)
-  `/api/perf/io` — server-process `/proc/self/io` delta rates
(read/write bytes per sec, syscalls)
-  `/api/perf/sqlite` — WAL size, page count, page size, cache hit rate
-  `/api/perf/write-sources` — per-component counters from ingestor
(tx/obs/upserts/backfill_*)
-  Frontend Perf page — three new sections with anomaly thresholds +
per-second rate columns

## What's NOT done (deferred to follow-up)
-  `cancelledWriteBytesPerSec` field — issue #1120 lists this under
server-process I/O ("writes the kernel discarded — interesting signal");
not exposed in this PR
-  Ingestor `/proc/<pid>/io` — issue #1120 says "Both ingestor and
server"; only server-process I/O lands here. Adding ingestor I/O
requires either a unix socket back to the server, or surfacing the
ingestor pid through the stats file. Doable without changing the
existing API shape.
-  Adaptive baselining — anomaly thresholds remain static (10×, 100 MB,
90%); steady-state baselining can come once we have enough deployed
Perf-page telemetry

Per AGENTS.md rule 34, this PR uses "Partial fix for #1120" rather than
"Fixes #1120" so the issue stays open until the remaining ACs land.

## Backend

**Server (`cmd/server/perf_io.go`)**
- `GET /api/perf/io` — reads `/proc/self/io` and returns delta-rate
`{readBytesPerSec, writeBytesPerSec, syscallsRead, syscallsWrite}` since
last call (in-memory tracker, no allocation per sample).
- `GET /api/perf/sqlite` — returns `{walSize, walSizeMB, pageCount,
pageSize, cacheSize, cacheHitRate}`. `cacheHitRate` is proxied from the
in-process row cache (closest available signal under the modernc sqlite
driver).
- `GET /api/perf/write-sources` — reads the ingestor's stats JSON file
and returns a flat `{sources: {...}, sampleAt}` payload.

**Ingestor (`cmd/ingestor/`)**
- `DBStats` gains `WALCommits atomic.Int64` (incremented on every
successful `tx.Commit()` and on every auto-commit `InsertTransmission`
write) and `BackfillUpdates sync.Map` keyed by backfill name with
`IncBackfill(name)` / `SnapshotBackfills()` helpers.
- `BackfillPathJSONAsync` now increments `BackfillUpdates["path_json"]`
per row write — the BackfillPathJSON-style infinite loop becomes
immediately visible at `backfill_path_json` in the Write Sources table.
- New `StartStatsFileWriter` publishes a JSON snapshot to
`/tmp/corescope-ingestor-stats.json` (override via
`CORESCOPE_INGESTOR_STATS`) every second using atomic tmp+rename. The
tmp file is opened with `O_CREATE|O_WRONLY|O_TRUNC|O_NOFOLLOW` mode
`0o600` so a pre-planted symlink in a world-writable `/tmp` cannot
redirect the write to an arbitrary file.

## Frontend (`public/perf.js`)

Three new sections on the Perf page, all auto-refreshed via the existing
5s interval:

- **Disk I/O (server process)** — read/write rates (formatted
B/KB/MB-per-sec) + syscall counts. Write rate >10 MB/s flags ⚠️.
- **Write Sources** — sorted table of per-component counters with a
per-second rate column derived from snapshot deltas. Backfill rows show
⚠️ only when `tx_inserted >= 100` (meaningful baseline) AND the
backfill's per-second rate exceeds 10× the live tx rate. Avoids the
startup-spurious-alarm where cumulative-vs-cumulative was a tautology.
- **SQLite (WAL + Cache Hit)** — WAL size (⚠️ when >100 MB), page count,
page size, cache hit rate (⚠️ when <90%).

## Tests

- **Backend** (`cmd/server/perf_io_test.go`) —
`TestPerfIOEndpoint_ReturnsValidJSON`,
`TestPerfSqliteEndpoint_ReturnsValidJSON`,
`TestPerfWriteSourcesEndpoint_ReturnsSources` exercise the three new
endpoints. Skips the `/proc/self/io` non-zero-rate assertion when
`/proc` is unavailable.
- **Frontend** (`test-perf-disk-io-1120.js`) — vm-sandbox runs `perf.js`
with stubbed `fetch`, asserts the three new sections render with their
headings + values.

E2E assertion added: test-perf-disk-io-1120.js:91

## TDD

1. Red commit (`21abd22`) — added the three handlers as no-op stubs
returning empty values; tests fail on assertion mismatches (non-zero
rate, `pageSize > 0`, headings present).
2. Green commit (`d8da54c`) — fills in the real `/proc/self/io` parser,
PRAGMA queries, ingestor stats writer, and Perf page rendering.

---------

Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com>
2026-05-05 17:56:56 -07:00
Kpa-clawbot 76d89e6578 fix(ingestor): exclude path_json='[]' rows from backfill WHERE (#1119) (#1121)
## Summary

`BackfillPathJSONAsync` re-selected observations whose `path_json` was
already `'[]'`, rewrote them to `'[]'`, and looped forever. The
`len(batch) == 0` exit condition was never reached, the migration marker
was never recorded, and the ingestor sustained 2–3 MB/s WAL writes at
idle (76% of CPU in `sqlite.Exec` per pprof).

## Fix

Drop `'[]'` from the WHERE clause:

```diff
WHERE o.raw_hex IS NOT NULL AND o.raw_hex != ''
- AND (o.path_json IS NULL OR o.path_json = '' OR o.path_json = '[]')
+ AND (o.path_json IS NULL OR o.path_json = '')
```

`'[]'` is the "already attempted, no hops" sentinel (still written at
line 994 of `cmd/ingestor/db.go` when `DecodePathFromRawHex` returns no
hops). Excluding it from the WHERE lets the loop terminate after one
full pass and the migration marker `backfill_path_json_from_raw_hex_v1`
to be recorded.

## TDD

- **Red commit** (`19f8004`):
`TestBackfillPathJSONAsync_BracketRowsTerminate` — seeds 100
observations with `path_json='[]'` and a `raw_hex` that decodes to zero
hops, asserts the migration marker is written within 5s. Fails on master
with *"backfill never recorded migration marker within 5s — infinite
loop on path_json='[]' rows"*.
- **Green commit** (`7019100`): WHERE-clause fix + updates
`TestBackfillPathJsonFromRawHex` row 1 expectation (the pre-seeded
`'[]'` row is now correctly skipped instead of being re-decoded).

## Test results

```
ok  	github.com/corescope/ingestor	49.656s
```

## Acceptance criteria from #1119

- [x] Backfill terminates within 1 polling cycle of having no progress
to make
- [x] Migration marker `backfill_path_json_from_raw_hex_v1` written
after termination
- [x] On restart, backfill recognizes migration done and exits
immediately (existing behavior — the migration check at the top of
`BackfillPathJSONAsync` was always correct; the bug was that the marker
never got written)
- [x] Test: seed DB with N observations all having `path_json = '[]'` →
backfill runs once → no UPDATEs issued, migration marker written
- [ ] Disk write rate on idle staging drops from 2–3 MB/s to <100 KB/s —
to be verified by the user post-deploy

Fixes #1119.

---------

Co-authored-by: OpenClaw Bot <bot@openclaw.local>
2026-05-05 17:35:16 -07:00
Kpa-clawbot 45f2607f75 perf(ingestor): group commit observation INSERTs by time window (M1, refs #1115) (#1117)
## Summary

Implements **M1 from #1115**: batches observation/transmission INSERTs
into a single SQLite `BEGIN/COMMIT` window instead of fsyncing per
packet. At ~250 obs/sec this drops WAL fsync rate from ~20/s to ~1/s and
eliminates the `obs-persist skipped` / `SQLITE_BUSY` log spam that the
issue documents.

This is a **partial fix** — it ships the group-commit mechanism.
Acceptance items 6–7 (measured fsync rate / measured `obs-persist
skipped` rate at staging steady-state) require post-deploy observation,
and M2 (per-`tx_hash` observation buffering) is intentionally deferred.
The issue stays open for the user to verify on staging.

> Partial fix for #1115 — does not auto-close. Refs #1115.

## Mechanism

- `Store` gains an active `*sql.Tx`, `pendingRows` counter, `gcMu`, and
the `groupCommitMs` / `groupCommitMaxRows` knobs. `SetGroupCommit(ms,
maxRows)` enables the mode; `FlushGroupTx()` commits the in-flight tx.
- `InsertTransmission` lazily opens a tx on the first call after each
flush, then issues all writes through `tx.Stmt()` bindings of the
existing prepared statements. With `MaxOpenConns(1)` the connection is
already serialized; `gcMu` serializes group-commit state without
contention.
- A goroutine in `cmd/ingestor/main.go` calls `FlushGroupTx()` every
`groupCommitMs` ms. `pendingRows >= groupCommitMaxRows` triggers an
eager flush. `Close()` flushes before the WAL checkpoint so no rows are
lost on graceful shutdown.
- `groupCommitMs == 0` short-circuits to the legacy per-call auto-commit
path (statements bound to `s.db`, no tx) — current behavior preserved
byte-for-byte for operators who opt out.

## Config

Two new optional fields (ingestor-only), both documented in
`config.example.json`:

| Field | Default | Effect |
|---|---|---|
| `groupCommitMs` | `1000` | Flush window in ms. `0` disables batching
(legacy per-packet auto-commit). |
| `groupCommitMaxRows` | `1000` | Safety cap; when exceeded the queue
flushes immediately to bound memory and the crash-loss window. |

No DB schema change. No required config change on upgrade.

## Tests (TDD red → green visible in commits)

`cmd/ingestor/group_commit_test.go` — three assertions, written first as
the red commit:

- `TestGroupCommit_BatchesInsertsIntoOneTx` — 50 `InsertTransmission`
calls inside a wide window produce **0** commits until `FlushGroupTx`,
then exactly **1**; all 50 rows visible after flush. (This is the spec's
"50 observations → 1 SQLite write transaction" assertion.)
- `TestGroupCommit_Disabled` — `groupCommitMs=0` keeps every insert
immediately visible and `GroupCommitFlushes` never advances. (Spec's
"groupCommitMs=0 reverts to per-packet behavior" assertion.)
- `TestGroupCommit_MaxRowsForcesEarlyFlush` — cap=3, 7 inserts → 2
auto-flushes from the cap + 1 final manual flush = 3 total.

Red commit: `e2b0370` (stubs `SetGroupCommit` / `FlushGroupTx` so the
tests compile and fail on **assertions**, not import errors).
Green commit: `73f3559`.

Full ingestor suite (`go test ./...` in `cmd/ingestor`) stays green, ~49
s.

## Performance

This PR is the perf change itself. Local micro-test (the new
`TestGroupCommit_BatchesInsertsIntoOneTx`) shows the structural
property: 50 inserts → 1 commit. The fsync-rate measurement called out
in the M1 acceptance criteria (`~20/s → ~1/s` at 250 obs/sec) requires
staging deployment to confirm — that's the remaining open item that
keeps #1115 open after this merges.

No hot-path regressions: when `groupCommitMs > 0` we acquire one mutex
per insert (uncontended in the steady state — the connection was already
single-threaded via `MaxOpenConns(1)`). When `groupCommitMs == 0` the
code path is identical to before plus one nil-tx check.

## What this PR does NOT do (per spec)

- Does not collapse "30 observations of one packet" into 1 row write —
that's M2.
- Does not eliminate dual-writer contention with `cmd/server`'s
`resolved_path` writes.
- Does not change observation ordering or live broadcast latency.

---------

Co-authored-by: corescope-bot <bot@corescope.local>
2026-05-05 16:38:43 -07:00
Kpa-clawbot 5fa3b56ccb fix(#662): GetRepeaterRelayInfo also looks up byPathHop by 1-byte prefix (#1086)
## Summary

Partial fix for #662.

`GetRepeaterRelayInfo` was reporting "never observed as relay hop" /
`RelayCount24h=0` for nodes that clearly DO have packets passing through
them — visible on the same node detail page in the "Paths seen through
node" view.

## Root cause

The `byPathHop` index is keyed by **both**:
- full resolved pubkey (populated when neighbor-affinity resolution
succeeds), and
- raw 1-byte hop prefix from the wire (e.g. `"a3"`)

`GetRepeaterRelayInfo` only looked up the full-pubkey key. Many ingested
non-advert packets only carry the raw 1-byte hop — so any repeater whose
path appearances are all raw-hop entries returned 0, even though the
path-listing endpoint (which prefix-matches) renders them.

Example node: an `a3…` repeater on staging has ~dozens of paths through
it in the UI but the relay-info function returns 0.

## Fix

Look up under both keys (full pubkey + 1-byte prefix) and de-dup by tx
ID before counting.

## Trade-off

The 1-byte prefix CAN over-count when multiple nodes share a first byte.
This trades a possible over-count for clearly false zeros. The richer
disambiguation done by the path-listing endpoint (resolved-path SQL
post-filter via `confirmResolvedPathContains`) is out of scope for this
partial fix — adding it here would mean disk I/O inside what is
currently a pure in-memory lookup. Worth a follow-up if over-counting
shows up in practice.

## TDD

- Red commit (`test: failing test for relay-info prefix-hop mismatch`):
adds `TestRepeaterRelayActivity_PrefixHop` that builds a non-advert
packet with `PathJSON: ["a3"]`, indexes it via `addTxToPathHopIndex`,
then asserts `RelayCount24h>=1` for the full pubkey starting with `a3…`.
Fails on the assertion (got 0), not a build error.
- Green commit (`fix: GetRepeaterRelayInfo also looks up byPathHop by
1-byte prefix`): the lookup change. All five
`TestRepeaterRelayActivity_*` tests pass.

## Scope

This is a **partial** fix — addresses the read-side prefix mismatch
only. Issue #662 is a 4-axis epic (also covers ingest indexing
consistency, UI surfacing, and schema). Leaving #662 open.

---------

Co-authored-by: corescope-bot <bot@corescope>
Co-authored-by: clawbot <clawbot@users.noreply.github.com>
2026-05-05 02:33:27 -07:00
Kpa-clawbot 136e1d23c8 feat(#730): foreign-advert detection — flag instead of silent drop (#1084)
## Summary

**Partial fix for #730 (M1 only — M2 frontend and M3 alerting
deferred).**

Today the ingestor **silently drops** ADVERTs whose GPS lies outside the
configured `geo_filter` polygon. That's the wrong default for an
analytics tool — operators get zero visibility into bridged or leaked
meshes.

This PR makes the new default **flag, don't drop**: foreign adverts are
stored, the node row is tagged `foreign_advert=1`, and the API surfaces
`"foreign": true` so dashboards / map overlays can be built on top.

## Behavior

| Mode | What happens to an ADVERT outside `geo_filter` |
|---|---|
| (default) flag | Stored, marked `foreign_advert=1`, exposed via API |
| drop (legacy) | Silently dropped (preserves old behavior for ops who
want it) |

## What's done (M1 — Backend)
- ingestor stores foreign adverts instead of dropping
- `nodes.foreign_advert` column added (migration)
- `/api/nodes` and `/api/nodes/{pk}` expose `foreign: true` field
- Config: `geofilter.action: "flag"|"drop"` (default `flag`)
- Tests + config docs

## What's NOT done (deferred to M2 + M3)

- **M2 — Frontend:** Map overlay showing foreign adverts as distinct
markers, foreign-advert filter on packets/nodes pages, dedicated
foreign-advert dashboard
- **M3 — Alerting:** Time-series detection of bridging events, alert
when foreign advert rate spikes, identify bridge entry-point nodes

Issue #730 remains open for M2 and M3.

---------

Co-authored-by: corescope-bot <bot@corescope>
2026-05-05 01:58:52 -07:00
Kpa-clawbot 3ab404b545 feat(node-battery): voltage trend chart + /api/nodes/{pubkey}/battery (#663) (#1082)
## Summary

Closes #663 (Phase 2 + 3 partial — time-series tracking + thresholds for
nodes that are also observers).

Adds a per-node battery voltage trend chart and
`/api/nodes/{pubkey}/battery` endpoint, sourced from the existing
`observer_metrics.battery_mv` samples populated by observer status
messages. No new ingest or schema changes — purely surfaces data we were
already collecting.

## Scope (TDD red→green)

**RED commit:** test(node-battery) — DB query, endpoint shape
(200/404/no-data), and config getters all asserted.
**GREEN commit:** feat(node-battery) — implementation only.

## Changes

### Backend
- `cmd/server/node_battery.go` (new):
- `DB.GetNodeBatteryHistory(pubkey, since)` — pulls `(timestamp,
battery_mv)` rows from `observer_metrics WHERE LOWER(observer_id) =
LOWER(public_key) AND battery_mv IS NOT NULL`. Case-insensitive join
tolerates historical pubkey casing variation (observers persist
uppercase, nodes lowercase in this DB).
- `Server.handleNodeBattery` — `GET /api/nodes/{pubkey}/battery?days=N`
(default 7, max 365). Returns `{public_key, days, samples[], latest_mv,
latest_ts, status, thresholds}`.
- `Config.LowBatteryMv()` / `CriticalBatteryMv()` — defaults 3300 / 3000
mV.
- `cmd/server/config.go` — `BatteryThresholds *BatteryThresholdsConfig`
field.
- `cmd/server/routes.go` — route registration alongside existing
`/health`, `/analytics`.

### Frontend
- `public/node-analytics.js` — new "Battery Voltage" chart card with
status badge (🔋 OK / ⚠️ Low / 🪫 Critical / No data). Renders dashed
threshold lines at `lowMv` and `criticalMv`. Empty-state message when no
samples in window.

### Config
- `config.example.json` — `batteryThresholds: { lowMv: 3300, criticalMv:
3000 }` with `_comment` per Config Documentation Rule.

## Status semantics

| latest_mv             | status     |
|-----------------------|------------|
| no samples in window  | `unknown`  |
| `>= lowMv`            | `ok`       |
| `< lowMv`, `>= critMv`| `low`      |
| `< criticalMv`        | `critical` |

## What this PR does NOT do (deferred)

The issue's full Phase 1 (writing decoded sensor advert telemetry into
`nodes.battery_mv` / `temperature_c` from server-side decoder) and Phase
4 (firmware/active polling for repeaters without observers) are out of
scope here. This PR delivers the requested Phase 2/3 surfacing for the
data path that already lands rows: `observer_metrics`. Repeaters that
are also observers (i.e. publish status to MQTT) will get a voltage
trend immediately; pure passive nodes won't until Phase 1 lands.

## Tests

- `TestGetNodeBatteryHistory_FromObserverMetrics` — case-insensitive
join, NULL skipping, ordering.
- `TestNodeBatteryEndpoint` — full happy path with thresholds + status.
- `TestNodeBatteryEndpoint_NoData` — 200 + status=unknown.
- `TestNodeBatteryEndpoint_404` — unknown node.
- `TestBatteryThresholds_ConfigOverride` — config getters + defaults.

`cd cmd/server && go test ./...` — green.

## Performance

Endpoint is per-pubkey (called once on analytics page open), indexed by
`(observer_id, timestamp)` PK on `observer_metrics`. No hot-path impact.

---------

Co-authored-by: bot <bot@corescope>
2026-05-05 01:41:00 -07:00
Kpa-clawbot f33801ecb4 feat(repeater): usefulness score — traffic axis (#672) (#1079)
## Summary

Implements the **Traffic axis** of the repeater usefulness score (#672).
Does NOT close #672 — Bridge, Coverage, and Redundancy axes are deferred
to follow-up PRs.

Adds `usefulness_score` (0..1) to repeater/room node API responses
representing what fraction of non-advert traffic passes through this
repeater as a relay hop.

## Why traffic-axis-first

The issue proposes a 4-axis composite (Bridge, Coverage, Traffic,
Redundancy). Bridge/Coverage/Redundancy require betweenness centrality
and neighbor graph infrastructure (#773 Neighbor Graph V2). Traffic axis
can ship independently using existing path-hop data.

## Remaining work for #672

- Bridge axis (betweenness centrality — depends on #773)
- Coverage axis (observer reach comparison)
- Redundancy axis (node-removal simulation — depends on #687)
- Composite score combining all 4 axes

Partial fix for #672.

---------

Co-authored-by: meshcore-bot <bot@meshcore.local>
2026-05-05 01:34:08 -07:00
Kpa-clawbot d05e468598 feat(memlimit): GOMEMLIMIT support, derive from packetStore.maxMemoryMB (#836) (#1077)
## Summary

Implements **part 1** of #836 — `GOMEMLIMIT` support so the Go runtime
self-throttles GC under cgroup memory pressure instead of getting
SIGKILLed.

(Parts 2 & 3 — bounded cold-load batching + README ops docs — land in
follow-up PRs.)

## Behavior

On startup `cmd/server/main.go` now calls `applyMemoryLimit(maxMemoryMB,
envSet)`:

| Condition | Action | Log |
|---|---|---|
| `GOMEMLIMIT` env set | Honor the runtime's parse, do nothing |
`[memlimit] using GOMEMLIMIT from environment (...)` |
| env unset, `packetStore.maxMemoryMB > 0` | `debug.SetMemoryLimit(maxMB
* 1.5 MiB)` | `[memlimit] derived from packetStore.maxMemoryMB=512 → 768
MiB (1.5x headroom)` |
| env unset, `maxMemoryMB == 0` | No-op | `[memlimit] no soft memory
limit set ... recommend setting one to avoid container OOM-kill` |

The 1.5x headroom covers Go's NextGC trigger at ~2× live heap (per #836
heap profile: 680 MB live → 1.38 GB NextGC).

## Tests (TDD red→green visible in commit history)

- `TestApplyMemoryLimit_FromEnv` — env wins, function does not override
- `TestApplyMemoryLimit_DerivedFromMaxMemoryMB` — verifies bytes
computation + `debug.SetMemoryLimit` actually applied at runtime
- `TestApplyMemoryLimit_None` — no env, no config → reports `"none"`, no
side effect

Red commit: `7de3c62` (assertion failures, builds clean)
Green commit: `454516d`

## Config docs

`config.example.json` `packetStore._comment_gomemlimit` documents
env/derived/override behavior.

## Out of scope

- Cold-load transient bounding (item 2 in #836)
- README container-size table (item 3)
- QA §1.1 rewrite

Closes part 1 of #836.

---------

Co-authored-by: corescope-bot <bot@corescope>
2026-05-05 01:33:23 -07:00
Kpa-clawbot 45f30fcadc feat(repeater): liveness detection — distinguish actively relaying from advert-only (#662) (#1073)
## Summary

Implements repeater liveness detection per #662 — distinguishes a
repeater that is **actively relaying traffic** from one that is **alive
but idle** (only sending its own adverts).

## Approach

The backend already maintains a `byPathHop` index keyed by lowercase
hop/pubkey for every transmission. Decode-window writes also key it by
**resolved pubkey** for relay hops. We just weren't surfacing it.

`GetRepeaterRelayInfo(pubkey, windowHours)`:
- Reads `byPathHop[pubkey]`.
- Skips packets whose `payload_type == 4` (advert) — a self-advert
proves liveness, not relaying.
- Returns the most recent `FirstSeen` as `lastRelayed`, plus
`relayActive` (within window) and the `windowHours` actually used.

## Three states (per issue)

| State | Indicator | Condition |
|---|---|---|
| 🟢 Relaying | green | `last_relayed` within `relayActiveHours` |
| 🟡 Alive (idle) | yellow | repeater is in the DB but
`relay_active=false` (no recent path-hop appearance, or none ever) |
|  Stale | existing | falls out of the existing `getNodeStatus` logic |

## API

- `GET /api/nodes` — repeater/room rows now include `last_relayed`
(omitted if never observed) and `relay_active`.
- `GET /api/nodes/{pubkey}` — same fields plus `relay_window_hours`.

## Config

New optional field under `healthThresholds`:

```json
"healthThresholds": {
  ...,
  "relayActiveHours": 24
}
```

Default 24h. Documented in `config.example.json`.

## Frontend

Node detail page gains a **Last Relayed** row for repeaters/rooms with
the 🟢/🟡 state badge. Tooltip explains the distinction from "Last Heard".

## TDD

- **Red commit** `4445f91`: `repeater_liveness_test.go` + stub
`GetRepeaterRelayInfo` returning zero. Active and Stale tests fail on
assertion (LastRelayed empty / mismatched). Idle and IgnoresAdverts
already match the desired behavior under the stub. Compiles, runs, fails
on assertions — not on imports.
- **Green commit** `5fcfb57`: Implementation. All four tests pass. Full
`cmd/server` suite green (~22s).

## Performance

`O(N)` over `byPathHop[pubkey]` per call. The index is bounded by store
eviction; a single repeater has at most a few hundred entries on real
data. The `/api/nodes` loop adds one map read + scan per repeater row —
negligible against the existing enrichment work.

## Limitations (per issue body)

1. Observer coverage gaps — if no observer hears a repeater's relay,
it'll show as idle even when actively relaying. This is inherent to
passive observation.
2. Low-traffic networks — a repeater in a quiet area legitimately shows
idle. The 🟡 indicator copy makes that explicit ("alive (idle)").
3. Hash collisions are mitigated by the existing `resolveWithContext`
path before pubkeys land in `byPathHop`.

Fixes #662

---------

Co-authored-by: clawbot <bot@corescope.local>
2026-05-05 01:17:52 -07:00
Kpa-clawbot 83881e6b71 fix(#688): auto-discover hashtag channels from message text (#1071)
## Summary

Auto-discovers previously-unknown hashtag channels by scanning decoded
channel message text for `#name` mentions and surfacing them via
`GetChannels`.

Workflow (per the issue):
1. New channel message arrives on a known channel
2. Decoded text is scanned for `#hashtag` mentions
3. Any mention that doesn't match an existing channel is surfaced as a
discovered channel (`discovered: true`, `messageCount: 0`)
4. Future traffic on that channel will populate the entry once it has
its own packets

## Changes

- `cmd/server/discovered_channels.go` — new file.
`extractHashtagsFromText` parses `#name` mentions from free text,
deduped, order-preserving. Trailing punctuation is excluded by the
character class.
- `cmd/server/store.go` — `GetChannels` now scans CHAN packet text for
hashtags after building the primary channel map, and appends any unseen
hashtag mentions as discovered entries.
- `cmd/server/discovered_channels_test.go` — new tests covering parser
edge cases (single, multi, dedup, punctuation, none, bare `#`) and
end-to-end discovery via `GetChannels`.

## TDD

- Red: `34f1817` — stub returns `nil`, both new tests fail on assertion
(verified).
- Green: `d27b3ed` — real implementation, full `cmd/server` test suite
passes (21.7s).

## Notes

- Discovered channels carry `messageCount: 0` and `lastActivity` set to
the most recent mention's `firstSeen`, so they sort naturally alongside
real channels.
- Names are matched against existing entries by both `#name` and bare
`name` so a channel that already has decoded traffic isn't
double-listed.
- The existing `channelsCache` (15s) covers the new code path; no
separate invalidation needed since the source data (`byPayloadType[5]`)
drives both maps.

Fixes #688

---------

Co-authored-by: corescope-bot <bot@corescope.local>
2026-05-05 01:16:57 -07:00
Kpa-clawbot d144764d38 fix(analytics): multiByteCapability missing under region filter → all rows 'unknown' (#1049)
## Bug

`https://meshcore.meshat.se/#/analytics`:

- Unfiltered → 0 adopter rows show "unknown" (correct).
- Region filter `JKG` → 14 rows show "unknown" (wrong — same nodes, all
confirmed when unfiltered).

Multi-byte capability is a property of the NODE, derived from its own
adverts (the full pubkey is in the advert payload, no prefix collision
risk). The observing region should only control which nodes appear in
the analytics list — it must not change a node's cap evidence.

## Root cause

`PacketStore.GetAnalyticsHashSizes(region)` only attached
`result["multiByteCapability"]` when `region == ""`. Under any region
filter the field was absent. The frontend (`public/analytics.js:1011`)
does `data.multiByteCapability || []`, so every adopter row falls
through the merge with no cap status and renders as "unknown".

## Fix

Always populate `multiByteCapability`. When a region filter is active,
source the global adopter hash-size set from a no-region compute pass so
out-of-region observers' adverts still count as evidence.

## TDD

Red commit (`0968137`): adds
`cmd/server/multibyte_region_filter_test.go`, asserts that
`GetAnalyticsHashSizes("JKG")` returns a populated `multiByteCapability`
with Node A as `confirmed`. Fails on the assertion (field missing)
before the fix.

Green commit (`6616730`): always compute capability against the global
advert dataset.

## Files changed

- `cmd/server/store.go` — `GetAnalyticsHashSizes`: drop the `region ==
""` gate, always populate `multiByteCapability`.
- `cmd/server/multibyte_region_filter_test.go` — new red→green test.

## Verification

```
go test ./... -count=1   # all server tests pass (21s)
```

---------

Co-authored-by: clawbot <bot@corescope.local>
2026-05-05 06:42:58 +00:00
Kpa-clawbot 227f375b4a test(ingestor): regression test for observer metadata persistence (#1044) (#1047)
Adds end-to-end test proving that `extractObserverMeta` +
`UpsertObserver` correctly stores model, firmware, battery_mv,
noise_floor, uptime_secs from a real MQTT status payload.

Test passes — confirms the code path works. #1044 was caused by upstream
observers not including metadata fields in their status payloads (older
`meshcoretomqtt` client versions), not a code bug.

Closes #1044

Co-authored-by: meshcore-bot <bot@meshcore.local>
2026-05-05 06:18:47 +00:00
Kpa-clawbot c9301fee9c fix(ingestor): extract per-hop SNR for TRACE packets at ingest time (#1028)
## Problem

PR #1007 added per-hop SNR extraction (`snrValues`) for TRACE packets to
`cmd/server/decoder.go`. That code path is only hit by the on-demand
re-decode endpoint (packet detail). The actual ingest pipeline runs
`cmd/ingestor/decoder.go`, decodes the packet once, and persists
`decoded_json` into SQLite. The server then serves `decoded_json` as-is
for list/feed queries.

Net effect: `snrValues` never appears in any production response,
because the ingestor's decoder was never updated.

Confirmed empirically: `strings /app/corescope-ingestor | grep snrVal`
returns nothing.

## Fix

Port the SNR extraction logic from `cmd/server/decoder.go` (lines
410–422) into `cmd/ingestor/decoder.go`. For TRACE packets, the header
path bytes are int8 SNR values in quarter-dB encoding; extract them into
`payload.SNRValues` **before** `path.Hops` is overwritten with
payload-derived hop IDs.

Also adds the matching `SNRValues []float64` field to the ingestor's
`Payload` struct so it serializes into `decoded_json`.

## TDD

- **Red commit** (`6ae4c07`): adds `TestDecodeTraceExtractsSNRValues` +
`SNRValues` field stub. Compiles, fails on assertion (`len(SNRValues)=0,
want 2`).
- **Green commit** (`4a4f3f3`): adds extraction loop. Test passes.

Test packet: `26022FF8116A23A80000000001C0DE1000DEDE`
- header `0x26` = TRACE + DIRECT
- pathByte `0x02` = hash_size 1, hash_count 2
- header path `2F F8` → SNR `[int8(0x2F)/4, int8(0xF8)/4]` = `[11.75,
-2.0]`

## Files

- `cmd/ingestor/decoder.go` — `+16` (field + extraction)
- `cmd/ingestor/decoder_test.go` — `+29` (red test)

## Out of scope

- `cmd/server/decoder.go` is already correct (PR #1007). Untouched.
- Backfill of historical `decoded_json` rows. New TRACE packets get SNR;
old rows do not until re-decoded.

---------

Co-authored-by: corescope-bot <bot@corescope.local>
2026-05-03 21:42:14 -07:00
Kpa-clawbot 9f55ef802b fix(#804): attribute analytics by repeater home region, not observer (#1025)
Fixes #804.

## Problem
Analytics filtered region purely by **observer** region: a multi-byte
repeater whose home is PDX would leak into SJC results whenever its
flood
adverts were relayed past an SJC observer. Per-node groupings
(`multiByteNodes`, `distributionByRepeaters`) inherited the same bug.

## Fix

Two new helpers in `cmd/server/store.go`:

- `iataMatchesRegion(iata, regionParam)` — case-insensitive IATA→region
  match using the existing `normalizeRegionCodes` parser.
- `computeNodeHomeRegions()` — derives each node's HOME IATA from its
  zero-hop DIRECT adverts. Path byte for those packets is set locally on
  the originating radio and the packet has not been relayed, so the
  observer that hears it must be in direct RF range. Plurality vote when
  zero-hop adverts span multiple regions.

`computeAnalyticsHashSizes` now applies these in two ways:

1. **Observer-region filter is relaxed for ADVERT packets** when the
   originator's home region matches the requested region. A flood advert
   from a PDX repeater that's only heard by an SJC observer still
   attributes to PDX.
2. **Per-node grouping** (`multiByteNodes`, `distributionByRepeaters`)
   excludes nodes whose HOME region disagrees with the requested region.
   Falls back to the observer-region filter when home is unknown.

Adds `attributionMethod` to the response (`"observer"` or `"repeater"`)
so operators can tell which method was applied.

## Backwards compatibility

- No region filter requested → behavior unchanged (`attributionMethod`
  is `"observer"`).
- Region filter requested but no zero-hop direct adverts seen for a node
  → falls back to the prior observer-region check for that node.
- Operators without IATA-tagged observers see no change.

## TDD

- **Red commit** (`c35d349`): adds
`TestIssue804_AnalyticsAttributesByRepeaterRegion`
with three subtests (PDX leak into SJC, attributionMethod field present,
  SJC leak into PDX). Compiles, runs, fails on assertions.
- **Green commit** (`11b157f`): the implementation. All subtests pass,
  full `cmd/server` package green.

## Files changed
- `cmd/server/store.go` — helpers + analytics filter logic (+236/-51)
- `cmd/server/issue804_repeater_region_test.go` — new test (+147)

---------

Co-authored-by: CoreScope Bot <bot@corescope.local>
Co-authored-by: openclaw-bot <bot@openclaw.local>
2026-05-03 20:10:02 -07:00
Kpa-clawbot 1f4969c1a6 fix(#770): treat region 'All' as no-filter + document region behavior (#1026)
## Summary

Fixes #770 — selecting "All" in the region filter dropdown produced an
empty channel list.

## Root cause

`normalizeRegionCodes` (cmd/server/db.go) treated any non-empty input as
a literal IATA code. The frontend region filter labels its catch-all
option **"All"**; while `region-filter.js` normally sends an empty
string when "All" is selected, any code path that ends up sending
`?region=All` (deep-link URLs, manual queries, future callers) caused
the function to return `["ALL"]`. Downstream queries then filtered
observers for `iata = 'ALL'`, which never matches anything → empty
response.

## Fix

`normalizeRegionCodes` now treats `All` / `ALL` / `all`
(case-insensitive, with optional whitespace, mixed in CSV) as equivalent
to an empty value, returning `nil` to signal "no filter". Real IATA
codes (`SJC`, `PDX`, `sjc,PDX` → `[SJC PDX]`) still pass through
unchanged.

This is a defensive server-side fix: a single chokepoint that all
region-aware endpoints already flow through (channels, packets,
analytics, encrypted channels, observer ID resolution).

## Documentation

Expanded `_comment_regions` in `config.example.json` to explain:
- How IATA codes are resolved (payload > topic > source config — set in
#1012)
- What the `regions` map controls (display labels) vs runtime-discovered
codes
- That observers without an IATA tag only appear under "All Regions"
- That the `All` sentinel is server-side safe

## TDD

- **Red commit** (`4f65bf4`): `cmd/server/region_filter_test.go` —
`TestNormalizeRegionCodes_AllIsNoFilter` asserts `All` / `ALL` / `all` /
`""` / `"All,"` all collapse to `nil`. Compiles, runs, fails on
assertion (`got [ALL], want nil`). Companion test
`TestNormalizeRegionCodes_RealCodesPreserved` locks in that `sjc,PDX`
still returns `[SJC PDX]`.
- **Green commit** (`c9fb965`): two-line change in
`normalizeRegionCodes` + docs update.

## Verification

```
$ go test -run TestNormalizeRegionCodes -count=1 ./cmd/server
ok      github.com/corescope/server     0.023s

$ go test -count=1 ./cmd/server
ok      github.com/corescope/server    21.454s
```

Full suite green; no existing region tests regressed.

Fixes #770

---------

Co-authored-by: Kpa-clawbot <bot@corescope>
2026-05-03 19:50:01 -07:00
Kpa-clawbot b06adf9f2a feat: /api/backup — one-click SQLite database export (#474) (#1022)
## Summary

Implements `GET /api/backup` — one-click SQLite database export per
#474.

Operators can now grab a complete, consistent snapshot of the analyzer
DB with a single authenticated request — no SSH, no scripts, no DB
tooling.

## Endpoint

```
GET /api/backup
X-API-Key: <key>            # required
→ 200 OK
  Content-Type: application/octet-stream
  Content-Disposition: attachment; filename="corescope-backup-<unix>.db"
  <body: complete SQLite database file>
```

## Approach

Uses SQLite's `VACUUM INTO 'path'` to produce an atomic, defragmented
copy of the database into a fresh file:

- **Consistent**: VACUUM INTO runs at read isolation — the snapshot
reflects a single point in time even while the ingestor is writing to
the WAL.
- **Non-blocking**: writers continue uninterrupted; we never hold a
write lock.
- **Works on read-only connections**: verified manually against a
WAL-mode source DB (`mode=ro` connection successfully produces a
snapshot).
- **No corruption risk**: even if the live on-disk DB has issues, VACUUM
INTO surfaces what the server can read rather than copying broken pages
byte-for-byte.

The snapshot is staged in `os.MkdirTemp(...)` and removed after the
response body is fully streamed (deferred cleanup). Requesting client IP
is logged for audit.

The issue suggested an alternative in-memory rebuild path; `VACUUM INTO`
is simpler, faster, and produces a strictly more accurate copy of what
the server actually sees, so going with it.

## Security

- Mounted under `requireAPIKey` middleware — same gate as other admin
endpoints (`/api/admin/prune`, `/api/perf/reset`).
- Returns 401 without a valid `X-API-Key` header.
- Returns 403 if no API key is configured server-side.
- `X-Content-Type-Options: nosniff` set on the response.

## TDD

- **Red** (`99548f2`): `cmd/server/backup_test.go` adds
`TestBackupRequiresAPIKey` + `TestBackupReturnsValidSQLiteSnapshot`.
Stub handler returns 200 with no body so the tests fail on assertions
(Content-Type / Content-Disposition / SQLite magic header), not on
import or build errors.
- **Green** (`837b2fe`): real implementation lands; both tests pass;
full `go test ./...` suite stays green.

## Files

- `cmd/server/backup.go` — handler implementation
- `cmd/server/backup_test.go` — red-then-green tests
- `cmd/server/routes.go` — route registration under `requireAPIKey`
- `cmd/server/openapi.go` — OpenAPI metadata so `/api/openapi`
advertises the endpoint

## Out of scope (follow-ups)

- Rate limiting (issue suggested 1 req/min). Not added here —
admin-key-gated endpoint with a fast snapshot path is acceptable for v1;
happy to add a token-bucket limiter in a follow-up if operators report
hammering.
- UI button to trigger the download (frontend work — separate PR).

Fixes #474

---------

Co-authored-by: corescope-bot <bot@corescope.local>
2026-05-03 17:56:42 -07:00
Kpa-clawbot 51b9fed15e feat(roles): /#/roles page + /api/analytics/roles endpoint (Fixes #818) (#1023)
## Summary

Implements `/#/roles` per QA #809 §5.4 / issue #818. The page previously
showed "Page not yet implemented."

### Backend
- New `GET /api/analytics/roles` returns `{ totalNodes, roles: [{ role,
nodeCount, withSkew, meanAbsSkewSec, medianAbsSkewSec, okCount,
warningCount, criticalCount, absurdCount, noClockCount }] }`.
- Pure `computeRoleAnalytics(nodesByPubkey, skewByPubkey)` does the
bucketing/aggregation — no store/lock dependency, fully unit-testable.
- Roles are normalised (lowercased + trimmed; empty bucketed as
`unknown`).

### Frontend
- New `public/roles-page.js` renders a distribution table: count, share,
distribution bar, w/ skew, median |skew|, mean |skew|, severity
breakdown (OK / Warning / Critical / Absurd / No-clock).
- Registered as the `roles` page in the SPA router and linked from the
main nav.
- Auto-refreshes every 60 s, with a manual refresh button.

### Tests (TDD)
- **Red commit** (`9726d5b`): two assertion-failing tests against a stub
`computeRoleAnalytics` that returns an empty result. Compiles, runs,
fails on `TotalNodes = 0, want 5` and `len(Roles) = 0, want 1`.
- **Green commit** (`7efb76a`): full implementation, route wiring,
frontend page + nav, plus E2E test in `test-e2e-playwright.js` covering
both the empty-state contract (no "Page not yet implemented"
placeholder) and the populated-table case (header columns, body rows,
API response shape).

### Verification
- `go test ./cmd/server/...` green.
- Local server with the e2e fixture: `GET /api/analytics/roles` returns
`{"totalNodes":200,"roles":[{"role":"repeater","nodeCount":168,...},{"role":"room","nodeCount":23,...},{"role":"companion","nodeCount":9,...}]}`.

Fixes #818

---------

Co-authored-by: corescope-bot <bot@corescope>
2026-05-03 17:56:12 -07:00
Kpa-clawbot a56ee5c4fe feat(analytics): selectable timeframes via ?window/?from/?to (#842) (#1018)
## Summary
Selectable analytics timeframes (#842). Adds backend support for
`?window=1h|24h|7d|30d` and `?from=&to=` on the three main analytics
endpoints (`/api/analytics/rf`, `/api/analytics/topology`,
`/api/analytics/channels`), and a time-window picker in the Analytics
page UI that drives them. Default behavior with no query params is
unchanged.

## TDD trail
- Red: `bbab04d` — adds `TimeWindow` + `ParseTimeWindow` stub and tests;
tests fail on assertions because the stub returns the zero window.
- Green: `75d27f9` — implements `ParseTimeWindow`, threads `TimeWindow`
through `compute*` loops + caches, wires HTTP handlers, adds frontend
picker + E2E.

## Backend changes
- `cmd/server/time_window.go` — full `ParseTimeWindow` (`?window=`
aliases + `?from=/&to=` RFC3339 absolute range; invalid input → zero
window for backwards compatibility).
- `cmd/server/store.go` — new
`GetAnalytics{RF,Topology,Channels}WithWindow` wrappers; `compute*`
loops skip transmissions whose `FirstSeen` (or per-obs `Timestamp` for
the region+observer slice) falls outside the window. Cache key composes
`region|window` so different windows do not poison each other.
- `cmd/server/routes.go` — handlers call `ParseTimeWindow(r)` and
dispatch to the `*WithWindow` methods.

## Frontend changes
- `public/analytics.js` — new `<select id="analyticsTimeWindow">`
rendered under the region filter (All / 1h / 24h / 7d / 30d). Selecting
an option triggers `loadAnalytics()` which appends `&window=…` to every
analytics fetch.

## Tests
- `cmd/server/time_window_test.go` — covers all aliases, absolute range,
no-params backwards compatibility, `Includes()` bounds, and `CacheKey()`
distinctness.
- `cmd/server/topology_dedup_test.go`,
`cmd/server/channel_analytics_test.go` — updated callers to pass
`TimeWindow{}`.

## E2E (rule 18)
`test-e2e-playwright.js:592-611` — opens `/#/analytics`, asserts the
picker is rendered with a `24h` option, then asserts that selecting
`24h` triggers a network request to `/api/analytics/rf?…window=24h`.

## Backwards compatibility
No params → zero `TimeWindow` → original code paths (no filter,
region-only cache key). Verified by
`TestParseTimeWindow_NoParams_BackwardsCompatible` and by the existing
analytics tests still passing unchanged on `_wt-fix-842`.

Fixes #842

---------

Co-authored-by: you <you@example.com>
Co-authored-by: corescope-bot <bot@corescope>
2026-05-03 17:41:22 -07:00
Kpa-clawbot df69a17718 feat(#772): short pubkey-prefix URLs for mesh sharing (#1016)
## Summary

Fixes #772 — adds a short-URL form for node detail pages so operators
can paste node links into a mesh chat without bringing along a
64-hex-char public key.

## Approach

**Pubkey-prefix resolution** (no allocator, no lookup table).

- The SPA hash route `#/nodes/<key>` already accepts whatever
pubkey-shaped string the user pastes; the front end forwards it to `GET
/api/nodes/<key>`.
- When that lookup misses **and** the path is 8..63 hex chars, the
backend now calls `DB.GetNodeByPrefix` and:
  - returns the matching node when exactly one node has that prefix,
- returns **409 Conflict** when multiple nodes share the prefix (with a
"use a longer prefix" hint),
  - falls through to the existing 404 otherwise.
- 8 hex chars = 32 bits of entropy, which is enough for fleets in the
low thousands. Operators can extend to 10–12 chars if collisions become
common.
- The full-screen node detail card gets a new **📡 Copy short URL**
button that copies `…/#/nodes/<first 8 hex chars>`.

### Why not an opaque ID table (`/s/<id>`)?

Considered and rejected:

- Needs persistence + an allocator + cleanup story.
- IDs aren't self-describing — operators can't sanity-check them.
- IDs don't survive a DB rebuild.
- 32 bits of pubkey already buys us collision resistance with zero
moving parts.

If the directory grows past the point where 8-char prefixes routinely
collide, we can extend the minimum length without changing the URL
shape.

## Changes

- `cmd/server/db.go` — new `GetNodeByPrefix(prefix)` returning `(node,
ambiguous, error)`. Validates hex; rejects <8 chars; `LIMIT 2` to detect
collisions cheaply.
- `cmd/server/routes.go` — `handleNodeDetail` falls back to prefix
resolution; canonicalizes pubkey downstream; emits 409 on ambiguity;
honors blacklist on the resolved pubkey.
- `public/nodes.js` — adds **📡 Copy short URL** button + handler on the
full-screen node detail card.
- `cmd/server/short_url_test.go` — Go tests (red-then-green).
- `test-e2e-playwright.js` — E2E: navigates via prefix-only URL and
asserts the new button surfaces.

## TDD evidence

- Red commit: `2dea97a` — tests added with a stub `GetNodeByPrefix`
returning `(nil, false, nil)`. All four assertions failed (assertion
failures, not build errors): expected node got nil; expected
ambiguous=true got false; route 404 vs expected 200/409.
- Green commit: `9b8f146` — implementation lands; `go test ./...` passes
locally in `cmd/server`.

## Compatibility

- Existing 64-char pubkey URLs are untouched (exact lookup runs first).
- Blacklist is enforced both on the raw input and on the resolved
pubkey.
- No new config knobs.

## What I did **not** touch

- `cmd/server/db_test.go`, other route tests — unchanged.
- Packet-detail short URLs (issue scopes nodes; revisit in a follow-up
if asked).

Fixes #772

---------

Co-authored-by: clawbot <bot@corescope.local>
2026-05-03 17:40:54 -07:00
Kpa-clawbot 5e01de0d52 fix: make path_json backfill async to unblock MQTT startup (#1013)
## Summary

**P0 fix**: The `path_json` backfill migration (PR #983) ran
synchronously in `applySchema`, blocking the ingestor main goroutine. On
staging (~502K observations), MQTT never connected — no new packets
ingested for 15+ hours.

## Fix

Extract the backfill into `BackfillPathJSONAsync()` — a method on
`*Store` that launches the work in a background goroutine. Called from
`main.go` before MQTT connect, it runs concurrently without blocking
subscription.

**Pattern**: identical to `backfillResolvedPathsAsync` in the server
(same lesson learned).

## Safety

- Idempotent: checks `_migrations` table, skips if already recorded
- Only touches `path_json IS NULL` rows — no conflict with live ingest
(new observations get `path_json` at write time)
- Panic-recovered goroutine with start/completion logging
- Batched (1000 rows per iteration) to avoid memory pressure

## TDD

- **Red commit**: `c6e1375` — test asserts `BackfillPathJSONAsync`
method exists + OpenStore doesn't block
- **Green commit**: `015871f` — implements async method, all tests pass

## Files changed

- `cmd/ingestor/db.go` — removed sync backfill from `applySchema`, added
`BackfillPathJSONAsync()`
- `cmd/ingestor/main.go` — call `store.BackfillPathJSONAsync()` after
store creation
- `cmd/ingestor/db_test.go` — new async tests + updated existing test to
use async API

---------

Co-authored-by: you <you@example.com>
2026-05-03 11:29:56 -07:00
Kpa-clawbot b0e4d2fa18 feat: add optional MQTT region field (#788) (#1012)
## Summary

Add optional `region` field to MQTT source config and JSON payload,
enabling publishers to explicitly provide region data without relying
solely on topic path structure.

## Changes

- **`MQTTSource.Region`** — new optional config field. When set, acts as
default region for all messages from that source (useful when a broker
serves a single region).
- **`MQTTPacketMessage.Region`** — new optional JSON payload field.
Publishers can include `"region": "PDX"` in their MQTT messages.
- **`PacketData.Region`** — carries the resolved region through to
storage.
- **Priority resolution**: payload `region` > topic-derived region >
source config `region`
- Observer IATA is updated with the effective region on every packet.

## Config example

```json
{
  "mqttSources": [
    {
      "name": "cascadia",
      "broker": "tcp://cascadia-broker:1883",
      "topics": ["meshcore/#"],
      "region": "PDX"
    }
  ]
}
```

## Payload example

```json
{"raw": "0a1b2c...", "SNR": 5.2, "region": "PDX"}
```

## TDD

- Red commit: `980304c` (tests fail at compile — fields don't exist)
- Green commit: `4caf88b` (implementation, all tests pass)

## Unblocks

- #804, #770, #730 (all depend on region being available on
observations)

Fixes #788

---------

Co-authored-by: you <you@example.com>
2026-05-03 11:21:54 -07:00
Kpa-clawbot c186129d47 feat: parse and display per-hop SNR values for TRACE packets (#1007)
## Summary

Parse and display per-hop SNR values from TRACE packets in the Packet
Byte Breakdown panel.

## Changes

### Backend (`cmd/server/decoder.go`)
- Added `SNRValues []float64` field to Payload struct
(`json:"snrValues,omitempty"`)
- In the TRACE-specific block, extract SNR from header path bytes before
they're overwritten with route hops
- Each header path byte is `int8(SNR_dB * 4.0)` per firmware — decode by
dividing by 4.0

### Frontend (`public/packets.js`)
- Added "SNR Path" section in `buildFieldTable()` showing per-hop SNR
values in dB when packet type is TRACE
- Added TRACE-specific payload rendering (trace tag, auth code, flags
with hash_size, route hops)

## TDD

- Red commit: `4dba4e8` — test asserts `Payload.SNRValues` field
(compile fails, field doesn't exist)
- Green commit: `5a496bd` — implementation passes all tests

## Testing

- `go test ./...` passes (all existing + 2 new TRACE SNR tests)
- No frontend test changes needed (no existing TRACE UI tests; rendering
is additive)

Fixes #979

---------

Co-authored-by: you <you@example.com>
2026-05-03 11:17:25 -07:00
Kpa-clawbot 153308134e feat: add global observer IATA whitelist config (#1001)
## Summary

Adds a global `observerIATAWhitelist` config field that restricts which
observer IATA regions are processed by the ingestor.

## Problem

Operators running regional instances (e.g., Sweden) want to ensure only
observers physically in their region contribute data. The existing
per-source `iataFilter` only filters packet messages but still allows
status messages through, meaning observers from other regions appear in
the database.

## Solution

New top-level config field `observerIATAWhitelist`:
- When non-empty, **all** messages (status + packets) from observers
outside the whitelist are silently dropped
- Case-insensitive matching
- Empty list = all regions allowed (fully backwards compatible)
- Lazy O(1) lookup via cached uppercase set (same pattern as
`observerBlacklist`)

### Config example
```json
{
  "observerIATAWhitelist": ["ARN", "GOT"]
}
```

## TDD

- **Red commit:** `f19c2b2` — tests for `ObserverIATAWhitelist` field
and `IsObserverIATAAllowed` method (build fails)
- **Green commit:** `782f516` — implementation + integration test

## Files changed
- `cmd/ingestor/config.go` — new field, new method
`IsObserverIATAAllowed`
- `cmd/ingestor/main.go` — whitelist check in `handleMessage` before
status processing
- `cmd/ingestor/config_test.go` — unit tests for config parsing and
matching
- `cmd/ingestor/main_test.go` — integration test for handleMessage
filtering

Fixes #914

---------

Co-authored-by: you <you@example.com>
2026-05-03 10:23:35 -07:00
Kpa-clawbot e86b5a3a0c feat: show multi-byte hash support indicator on map markers (#1002)
## Summary

Show 2-byte hash support indicator on map markers. Fixes #903.

## What changed

### Backend (`cmd/server/store.go`, `cmd/server/routes.go`)

- **`EnrichNodeWithMultiByte()`** — new enrichment function that adds
`multi_byte_status` (confirmed/suspected/unknown), `multi_byte_evidence`
(advert/path), and `multi_byte_max_hash_size` fields to node API
responses
- **`GetMultiByteCapMap()`** — cached (15s TTL) map of pubkey →
`MultiByteCapEntry`, reusing the existing `computeMultiByteCapability()`
logic that combines advert-based and path-hop-based evidence
- Wired into both `/api/nodes` (list) and `/api/nodes/{pubkey}` (detail)
endpoints

### Frontend (`public/map.js`)

- Added **"Multi-byte support"** checkbox in the map Display controls
section
- When toggled on, repeater markers change color:
  - 🟢 Green (`#27ae60`) — **confirmed** (advertised with hash_size ≥ 2)
- 🟡 Yellow (`#f39c12`) — **suspected** (seen as hop in multi-byte path)
  - 🔴 Red (`#e74c3c`) — **unknown** (no multi-byte evidence)
- Popup tooltip shows multi-byte status and evidence for repeaters
- State persisted in localStorage (`meshcore-map-multibyte-overlay`)

## TDD

- Red commit: `2f49cbc` — failing test for `EnrichNodeWithMultiByte`
- Green commit: `4957782` — implementation + passing tests

## Performance

- `GetMultiByteCapMap()` uses a 15s TTL cache (same pattern as
`GetNodeHashSizeInfo`)
- Enrichment is O(n) over nodes, no per-item API calls
- Frontend color override is computed inline during existing marker
render loop — no additional DOM rebuilds

---------

Co-authored-by: you <you@example.com>
2026-05-03 08:56:09 -07:00
Kpa-clawbot 2e3a94b86d chore(db): one-time cleanup of legacy packets with empty hash or null timestamp (closes #994) (#997)
## Summary

One-time startup migration that deletes legacy packets (transmissions +
observations) with empty hash or empty `first_seen` timestamp. This is
the write-side cleanup following #993's read-side filter.

### Migration: `cleanup_legacy_null_hash_ts`

- Checks `_migrations` table for marker
- If not present: deletes observations referencing bad transmissions,
then deletes the transmissions themselves
- Logs count of deleted rows
- Records marker for idempotency

### TDD

- **Red commit:** `b1a24a1` — test asserts migration deletes bad rows
(fails without implementation)
- **Green commit:** `2b94522` — implements the migration, all tests pass

Fixes #994

---------

Co-authored-by: you <you@example.com>
2026-05-02 23:15:20 -07:00
Kpa-clawbot 564d93d6aa fix: dedup topology analytics by resolved pubkey (#998)
## Fix topology analytics double-counting repeaters/pairs (#909)

### Problem

`computeAnalyticsTopology()` aggregates by raw hop hex string. When
firmware emits variable-length path hashes (1-3 bytes per hop), the same
physical node appears multiple times with different prefix lengths (e.g.
`"07"`, `"0735bc"`, `"0735bc6d"` all referring to the same node). This
inflates repeater counts and creates duplicate pair entries.

### Solution

Added a confidence-gated dedup pass after frequency counting:

1. **For each hop prefix**, check if it resolves unambiguously (exactly
1 candidate in the prefix map)
2. **Unambiguous prefixes** → group by resolved pubkey, sum counts, keep
longest prefix as display identifier
3. **Ambiguous prefixes** (multiple candidates for that prefix) → left
as separate entries (conservative)
4. **Same treatment for pairs**: canonicalize by sorted pubkey pair

### Addressing @efiten's collision concern

At scale (~2000+ repeaters), 1-byte prefixes (256 buckets) WILL collide.
This fix explicitly checks the prefix map candidate count. Ambiguous
prefixes (where `len(pm.m[hop]) > 1`) are never merged — they remain as
separate entries. Only prefixes with a single matching node are eligible
for dedup.

### TDD

- **Red commit**: `4dbf9c0` — added 3 failing tests
- **Green commit**: `d6cae9a` — implemented dedup, all tests pass

### Tests added

- `TestTopologyDedup_RepeatersMergeByPubkey` — verifies entries with
different prefix lengths for same node merge to single entry with summed
count
- `TestTopologyDedup_AmbiguousPrefixNotMerged` — verifies colliding
short prefix stays separate from unambiguous longer prefix
- `TestTopologyDedup_PairsMergeByPubkey` — verifies pair entries merge
by resolved pubkey pair

Fixes #909

---------

Co-authored-by: you <you@example.com>
2026-05-02 22:19:49 -07:00
Kpa-clawbot b7c280c20a fix: drop/filter packets with null hash or timestamp (closes #871) (#993)
## Summary

Closes #871

The `/api/packets` endpoint could return packets with `null` hash or
timestamp fields. This was caused by legacy data in SQLite (rows with
empty `hash` or `NULL`/empty `first_seen`) predating the ingestor's
existing validation guard (`if hash == "" { return false, nil }` at
`cmd/ingestor/db.go:610`).

## Root Cause

`cmd/server/store.go` `filterPackets()` had no data-integrity guard.
Legacy rows with empty `hash` or `first_seen` were loaded into the
in-memory store and returned verbatim. The `strOrNil("")` helper then
serialized these as JSON `null`.

## Fix

Added a data-integrity predicate at the top of `filterPackets`'s scan
callback (`cmd/server/store.go:2278`):

```go
if tx.Hash == "" || tx.FirstSeen == "" {
    return false
}
```

This filters bad legacy rows at query time. The write path (ingestor)
already rejects empty hashes, so no new bad data enters.

## TDD Evidence

- **Red commit:** `15774c3` — test `TestIssue871_NoNullHashOrTimestamp`
asserts no packet in API response has null/empty hash or timestamp
- **Green commit:** `281fd6f` — adds the filter guard, test passes

## Testing

- `go test ./...` in `cmd/server` passes (full suite)
- Client-side defensive filter from PR #868 remains as defense-in-depth

---------

Co-authored-by: you <you@example.com>
2026-05-02 20:35:15 -07:00
Kpa-clawbot d43c95a4bb fix(ingestor): warn when TRACE payload decode fails but observation stored (closes #889) (#992)
## Summary

Closes #889.

When a TRACE packet's payload is too short to decode (< 9 bytes),
`decodeTrace` returns an error in `Payload.Error` but the observation is
still stored with empty `Path.Hops`. Previously this was completely
silent — no log, no anomaly flag, no indication the row is degraded.

This fix populates `DecodedPacket.Anomaly` with the decode error message
(e.g., `"TRACE payload decode failed: too short"`) so operators and
downstream consumers can identify degraded observations.

## TDD Commit History

1. **Red commit** `04e0165` — failing test asserting `Anomaly` is set
when TRACE payload decode fails
2. **Green commit** `d3e72d1` — 3-line fix in `decoder.go` line 601-603:
check `payload.Error != ""` for TRACE packets and set anomaly

## What Changed

`cmd/ingestor/decoder.go` (lines 601-603): Added a check before the
existing TRACE path-parsing block. If `payload.Error` is non-empty for a
TRACE packet, `anomaly` is set to `"TRACE payload decode failed:
<error>"`.

`cmd/ingestor/decoder_test.go`: Added
`TestDecodeTracePayloadFailSetsAnomaly` — constructs a TRACE packet with
a 4-byte payload (too short), asserts the packet is still returned
(observation stored) and `Anomaly` is populated.

## Verification

- `go build ./...` ✓
- `go test ./...` ✓ (all pass including new test)
- Anti-tautology: reverting the fix causes the new test to fail (asserts
`pkt.Anomaly == ""` → error)

---------

Co-authored-by: you <you@example.com>
2026-05-02 20:34:27 -07:00
Kpa-clawbot dd2f044f2b fix: cache RW SQLite connection + dedup DBConfig (closes #921) (#982)
Closes #921

## Summary

Follow-up to #920 (incremental auto-vacuum). Addresses both items from
the adversarial review:

### 1. RW connection caching

Previously, every call to `openRW(dbPath)` opened a new SQLite RW
connection and closed it after use. This happened in:
- `runIncrementalVacuum` (~4x/hour)
- `PruneOldPackets`, `PruneOldMetrics`, `RemoveStaleObservers`
- `buildAndPersistEdges`, `PruneNeighborEdges`
- All neighbor persist operations

Now a single `*sql.DB` handle (with `MaxOpenConns(1)`) is cached
process-wide via `cachedRW(dbPath)`. The underlying connection pool
manages serialization. The original `openRW()` function is retained for
one-shot test usage.

### 2. DBConfig dedup

`DBConfig` was defined identically in both `cmd/server/config.go` and
`cmd/ingestor/config.go`. Extracted to `internal/dbconfig/` as a shared
package; both binaries now use a type alias (`type DBConfig =
dbconfig.DBConfig`).

## Tests added

| Test | File |
|------|------|
| `TestCachedRW_ReturnsSameHandle` | `cmd/server/rw_cache_test.go` |
| `TestCachedRW_100Calls_SingleConnection` |
`cmd/server/rw_cache_test.go` |
| `TestGetIncrementalVacuumPages_Default` |
`internal/dbconfig/dbconfig_test.go` |
| `TestGetIncrementalVacuumPages_Configured` |
`internal/dbconfig/dbconfig_test.go` |

## Verification

```
ok  github.com/corescope/server    20.069s
ok  github.com/corescope/ingestor  47.117s
ok  github.com/meshcore-analyzer/dbconfig  0.003s
```

Both binaries build cleanly. 100 sequential `cachedRW()` calls return
the same handle with exactly 1 entry in the cache map.

---------

Co-authored-by: you <you@example.com>
2026-05-02 20:15:30 -07:00
Kpa-clawbot 58484ad924 feat(ingestor): backfill observations.path_json from raw_hex (closes #888) (#983)
## Summary

Adds an idempotent startup migration to the ingestor that backfills
`observations.path_json` from per-observation `raw_hex` (added in #882).

**Approach: Server-side migration (Option B)** — runs automatically at
startup, chunked in batches of 1000, tracked via `_migrations` table.
Chosen over a standalone script because:
1. Follows existing migration pattern (channel_hash, last_packet_at,
etc.)
2. Zero operator action required — just deploy
3. Idempotent — safe to restart mid-migration (uncommitted rows get
picked up next run)

## What it does

- Selects observations where `raw_hex` is populated but `path_json` is
NULL/empty/`[]`
- Excludes TRACE packets (`payload_type = 9`) at the SQL level — their
header bytes are SNR values, not hops
- Decodes hops via `packetpath.DecodePathFromRawHex` (reuses existing
helper)
- Updates `path_json` with the decoded JSON array
- Marks rows with undecoded/empty hops as `'[]'` to prevent infinite
re-scanning
- Records `backfill_path_json_from_raw_hex_v1` in `_migrations` when
complete

## Safety

- **Never overwrites** existing non-empty `path_json` — only fills where
missing
- **Batched** (1000 rows per iteration) — won't OOM on large DBs
- **TRACE-safe** — excluded at query level per
`packetpath.PathBytesAreHops` semantics

## Test

`TestBackfillPathJsonFromRawHex` — creates synthetic observations with:
- Empty path_json + valid raw_hex → verifies backfill populates
correctly
- NULL path_json → verifies backfill populates
- Existing path_json → verifies NO overwrite
- TRACE packet → verifies skip

Anti-tautology: test asserts specific decoded values (`["AABB","CCDD"]`)
from known raw_hex input, not just "something changed."

Closes #888

Co-authored-by: you <you@example.com>
2026-05-02 19:52:43 -07:00
Kpa-clawbot fc57433f27 fix(analytics): merge channel buckets by hash byte; reject rainbow-table mismatches (closes #978) (#980)
## Summary

Closes #978 — analytics channels duplicated by encrypted/decrypted split
+ rainbow-table collisions.

## Root cause

Two distinct bugs in `computeAnalyticsChannels` (`cmd/server/store.go`):

1. **Encrypted/decrypted split**: The grouping key included the decoded
channel name (`hash + "_" + channel`), so packets from observers that
could decrypt a channel created a separate bucket from packets where
decryption failed. Same physical channel, two entries.

2. **Rainbow-table collisions**: Some observers' lookup tables map hash
bytes to wrong channel names. E.g., hash `72` incorrectly claimed to be
`#wardriving` (real hash is `129`). This created ghost 1-message
entries.

## Fix

1. **Always group by hash byte alone** (drop `_channel` suffix from
`chKey`). When any packet decrypts successfully, upgrade the bucket's
display name from placeholder (`chN`) to the real name
(first-decrypter-wins for stability).

2. **Validate channel names** against the firmware hash invariant:
`SHA256(SHA256("#name")[:16])[0] == channelHash`. Mismatches are treated
as encrypted (placeholder name, no trust in decoded channel). Guard is
in the analytics handler (not the ingestor) to avoid breaking other
surfaces that use the decoded field for display.

## Verification (e2e-fixture.db)

| Metric | BEFORE | AFTER |
|--------|--------|-------|
| Total channels | 22 | 19 |
| Duplicate hash bytes | 3 (hashes 217, 202, 17) | 0 |

## Tests added

- `TestComputeAnalyticsChannels_MergesEncryptedAndDecrypted` — same
hash, mixed encrypted/decrypted → ONE bucket
- `TestComputeAnalyticsChannels_RejectsRainbowTableMismatch` — hash 72
claimed as `#wardriving` (real=129) → rejected, stays `ch72`
- `TestChannelNameMatchesHash` — unit test for hash validation helper
- `TestIsPlaceholderName` — unit test for placeholder detection

Anti-tautology gate: both main tests fail when their respective fix
lines are reverted.

Co-authored-by: you <you@example.com>
2026-05-02 16:05:56 -07:00
Kpa-clawbot 5aa8f795cd feat(ingestor): per-source MQTT connect timeout (#931) (#977)
## Summary

Per-source MQTT connect timeout, correctly targeting the `WaitTimeout`
startup gate (#931).

## What changed

- Added `connectTimeoutSec` field to `MQTTSource` struct (per-source,
not global) — `config.go:24`
- Added `ConnectTimeoutOrDefault()` helper returning configured value or
30 (default from #926) — `config.go:29`
- Replaced hardcoded `WaitTimeout(30 * time.Second)` with
`WaitTimeout(time.Duration(connectTimeout) * time.Second)` —
`main.go:173`
- Updated `config.example.json` with field at source level
- Unit tests for default (30) and custom values

## Why this supersedes #976

PR #976 made paho's `SetConnectTimeout` (per-TCP-dial, was 10s)
configurable via a **global** `mqttConnectTimeoutSeconds` field. Issue
#931 explicitly references the **30s timeout** — which is
`WaitTimeout(30s)`, the startup gate from #926. It also requests
**per-source** config, not global.

This PR targets the correct timeout at the correct granularity.

## Live verification (Rule 18)

Two sources pointed at unreachable brokers:
- `fast` (`connectTimeoutSec: 5`): timed out in 5s 
- `default` (unset): timed out in 30s 

```
19:00:35 MQTT [fast] connect timeout: 5s
19:00:40 MQTT [fast] initial connection timed out — retrying in background
19:00:40 MQTT [default] connect timeout: 30s
19:01:10 MQTT [default] initial connection timed out — retrying in background
```

Closes #931
Supersedes #976

Co-authored-by: you <you@example.com>
2026-05-02 12:08:25 -07:00
Kpa-clawbot 1e7c187521 fix(ingestor): address review BLOCKERs from PR #926 (goroutine leak + guard semantics) [v2] (#974)
## fix(ingestor): address review BLOCKERs from PR #926 (goroutine leak +
guard semantics)

Supersedes #970. Rebased onto current master to resolve merge conflicts.

### Changes (same as #970)
- **BL1 (goroutine leak):** Call `client.Disconnect(0)` on the error
path after `Connect()` fails with `ConnectRetry=true`, preventing Paho's
internal retry goroutines from leaking.
- **BL2 (guard semantics):** Use `connectedCount == 0` instead of
`len(clients) == 0` to detect zero-connected state, since timed-out
clients are appended to the slice.
- **Tests:** `TestBL1_GoroutineLeakOnHardFailure` and
`TestBL2_ZeroConnectedFatals` covering both blockers.

### Context
- Fixes blockers raised in review of #926
- Related: #910 (original hang bug)

Co-authored-by: you <you@example.com>
2026-05-02 12:05:02 -07:00
Kpa-clawbot 4b8d8143f4 feat(server): explicit CORS policy with configurable origin allowlist (#883) (#971)
## Summary

Adds explicit CORS policy support to the CoreScope API server, closing
#883.

### Problem

The API relied on browser same-origin defaults with no way for operators
to configure cross-origin access. Operators running dashboards or
third-party frontends on different origins had no supported way to make
API calls.

### Solution

**New config option:** `corsAllowedOrigins` (string array, default `[]`)

**Middleware behavior:**
| Config | Behavior |
|--------|----------|
| `[]` (default) | No `Access-Control-*` headers added — browsers
enforce same-origin. **Preserves current behavior.** |
| `["https://dashboard.example.com"]` | Echoes matching `Origin`, sets
`Allow-Methods`/`Allow-Headers` |
| `["*"]` | Sets `Access-Control-Allow-Origin: *` (explicit opt-in only)
|

**Headers set when origin matches:**
- `Access-Control-Allow-Origin: <origin>` (or `*`)
- `Access-Control-Allow-Methods: GET, POST, OPTIONS`
- `Access-Control-Allow-Headers: Content-Type, X-API-Key`
- `Vary: Origin` (non-wildcard only)

**Preflight handling:** `OPTIONS` → `204 No Content` with CORS headers
(or `403` if origin not in allowlist).

### Config example

```json
{
  "corsAllowedOrigins": ["https://dashboard.example.com", "https://monitor.internal"]
}
```

### Files changed

| File | Change |
|------|--------|
| `cmd/server/cors.go` | New CORS middleware |
| `cmd/server/cors_test.go` | 7 unit tests covering all branches |
| `cmd/server/config.go` | `CORSAllowedOrigins` field |
| `cmd/server/routes.go` | Wire middleware before all routes |

### Testing

**Unit tests (7):**
- Default config → no CORS headers
- Allowlist match → headers present with `Vary: Origin`
- Allowlist miss → no CORS headers
- Preflight allowed → 204 with headers
- Preflight rejected → 403
- Wildcard → `*` without `Vary`
- No `Origin` header → pass-through

**Live verification (Rule 18):**

```
# Default (empty corsAllowedOrigins):
$ curl -I -H "Origin: https://evil.example" localhost:19883/api/health
HTTP/1.1 200 OK
# No Access-Control-* headers ✓

# With corsAllowedOrigins: ["https://good.example"]:
$ curl -I -H "Origin: https://good.example" localhost:19884/api/health
Access-Control-Allow-Origin: https://good.example
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-API-Key
Vary: Origin ✓

$ curl -I -H "Origin: https://evil.example" localhost:19884/api/health
# No Access-Control-* headers ✓

$ curl -I -X OPTIONS -H "Origin: https://good.example" localhost:19884/api/health
HTTP/1.1 204 No Content
Access-Control-Allow-Origin: https://good.example ✓
```

Closes #883

Co-authored-by: you <you@example.com>
2026-05-02 12:04:37 -07:00
Kpa-clawbot 3364eed303 feat: separate "Last Status Update" from "Last Packet Observation" for observers (v3 rebase) (#969)
Rebased version of #968 (which was itself a rebase of #905) — resolves
merge conflict with #906 (clock-skew UI) that landed on master.

## Conflict resolution

**`public/observers.js`** — master (#906) added "Clock Offset" column to
observer table; #968 split "Last Seen" into "Last Status" + "Last
Packet" columns. Combined both: the table now has Status | Name | Region
| Last Status | Last Packet | Packets | Packets/Hour | Clock Offset |
Uptime.

## What this PR adds (unchanged from #968/#905)

- `last_packet_at` column in observers DB table
- Separate "Last Status Update" and "Last Packet Observation" display in
observers list and detail page
- Server-side migration to add the column automatically
- Backfill heuristic for existing data
- Tests for ingestor and server

## Verification

- All Go tests pass (`cmd/server`, `cmd/ingestor`)
- Frontend tests pass (`test-packets.js`, `test-hash-color.js`)
- Built server, hit `/api/observers` — `last_packet_at` field present in
JSON
- Observer table header has all 9 columns including both Last Packet and
Clock Offset

## Prior PRs

- #905 — original (conflicts with master)
- #968 — first rebase (conflicts after #906 landed)
- This PR — second rebase, resolves #906 conflict

Supersedes #968. Closes #905.

---------

Co-authored-by: you <you@example.com>
2026-05-02 12:03:42 -07:00
efiten d65122491e fix(ingestor): unblock startup when one of multiple MQTT sources is unreachable (#926)
## Summary

- With `ConnectRetry=true`, paho's `token.Wait()` only returns on
success — it blocks forever for unreachable brokers, stalling the entire
startup loop before any other source connects
- Switches to `token.WaitTimeout(30s)`: on timeout the client is still
tracked so `ConnectRetry` keeps retrying in background; `OnConnect`
fires and subscribes when it eventually connects
- Adds `TestMQTTConnectRetryTimeoutDoesNotBlock` to confirm
`WaitTimeout` returns within deadline for unreachable brokers
(regression guard for this exact failure mode)

Fixes #910

## Test plan

- [x] Two MQTT sources configured, one unreachable: ingestor reaches
`Running` status and ingests from the reachable source immediately on
startup
- [x] Unreachable source logs `initial connection timed out — retrying
in background` and reconnects automatically when the broker comes back
- [x] Single source, reachable: behaviour unchanged (`Running — 1 MQTT
source(s) connected`)
- [x] Single source, unreachable: `Running — 0 MQTT source(s) connected,
1 retrying in background`; ingestion starts once broker is available
- [x] `go test ./...` passes (excluding pre-existing
`TestOpenStoreInvalidPath` failure on master)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:31:51 -07:00
efiten 40c3aa13f9 fix(paths): exclude false-positive paths from short-prefix collisions (#930)
Fixes #929

## Summary

- `handleNodePaths` pulls candidates from `byPathHop` using 2-char and
4-char prefix keys (e.g. `"7a"` for a node using 1-byte adverts)
- When two nodes share the same short prefix, paths through the *other*
node are included as candidates
- The `resolved_path` post-filter covers decoded packets but falls
through conservatively (`inIndex = true`) when `resolved_path` is NULL,
letting false positives reach the response

**Fix:** during the aggregation phase (which already calls `resolveHop`
per hop), add a `containsTarget` check. If every hop resolves to a
different node's pubkey, skip the path. Packets confirmed via the
full-pubkey index key or via SQL bypass the check. Unresolvable hops are
kept conservatively.

## Test plan
- [x] `TestHandleNodePaths_PrefixCollisionExclusion`: two nodes sharing
`"7a"` prefix; verifies the path with no `resolved_path` (false
positive) is excluded and the SQL-confirmed path (true positive) is
included
- [x] Full test suite: `go test github.com/corescope/server` — all pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:15:25 -07:00