Commit Graph

224 Commits

Author SHA1 Message Date
Kpa-clawbot 74dffa2fb7 feat(perf): per-component disk I/O + write source metrics on Perf page (#1120) (#1123)
## Summary

Implements per-component disk I/O + write source metrics on the Perf
page so operators can self-diagnose write-volume anomalies (cf. the
BackfillPathJSON loop debugged in #1119) without SSHing in to run
iotop/fatrace.

Partial fix for #1120

## What's done (4/6 ACs)
-  `/api/perf/io` — server-process `/proc/self/io` delta rates
(read/write bytes per sec, syscalls)
-  `/api/perf/sqlite` — WAL size, page count, page size, cache hit rate
-  `/api/perf/write-sources` — per-component counters from ingestor
(tx/obs/upserts/backfill_*)
-  Frontend Perf page — three new sections with anomaly thresholds +
per-second rate columns

## What's NOT done (deferred to follow-up)
-  `cancelledWriteBytesPerSec` field — issue #1120 lists this under
server-process I/O ("writes the kernel discarded — interesting signal");
not exposed in this PR
-  Ingestor `/proc/<pid>/io` — issue #1120 says "Both ingestor and
server"; only server-process I/O lands here. Adding ingestor I/O
requires either a unix socket back to the server, or surfacing the
ingestor pid through the stats file. Doable without changing the
existing API shape.
-  Adaptive baselining — anomaly thresholds remain static (10×, 100 MB,
90%); steady-state baselining can come once we have enough deployed
Perf-page telemetry

Per AGENTS.md rule 34, this PR uses "Partial fix for #1120" rather than
"Fixes #1120" so the issue stays open until the remaining ACs land.

## Backend

**Server (`cmd/server/perf_io.go`)**
- `GET /api/perf/io` — reads `/proc/self/io` and returns delta-rate
`{readBytesPerSec, writeBytesPerSec, syscallsRead, syscallsWrite}` since
last call (in-memory tracker, no allocation per sample).
- `GET /api/perf/sqlite` — returns `{walSize, walSizeMB, pageCount,
pageSize, cacheSize, cacheHitRate}`. `cacheHitRate` is proxied from the
in-process row cache (closest available signal under the modernc sqlite
driver).
- `GET /api/perf/write-sources` — reads the ingestor's stats JSON file
and returns a flat `{sources: {...}, sampleAt}` payload.

**Ingestor (`cmd/ingestor/`)**
- `DBStats` gains `WALCommits atomic.Int64` (incremented on every
successful `tx.Commit()` and on every auto-commit `InsertTransmission`
write) and `BackfillUpdates sync.Map` keyed by backfill name with
`IncBackfill(name)` / `SnapshotBackfills()` helpers.
- `BackfillPathJSONAsync` now increments `BackfillUpdates["path_json"]`
per row write — the BackfillPathJSON-style infinite loop becomes
immediately visible at `backfill_path_json` in the Write Sources table.
- New `StartStatsFileWriter` publishes a JSON snapshot to
`/tmp/corescope-ingestor-stats.json` (override via
`CORESCOPE_INGESTOR_STATS`) every second using atomic tmp+rename. The
tmp file is opened with `O_CREATE|O_WRONLY|O_TRUNC|O_NOFOLLOW` mode
`0o600` so a pre-planted symlink in a world-writable `/tmp` cannot
redirect the write to an arbitrary file.

## Frontend (`public/perf.js`)

Three new sections on the Perf page, all auto-refreshed via the existing
5s interval:

- **Disk I/O (server process)** — read/write rates (formatted
B/KB/MB-per-sec) + syscall counts. Write rate >10 MB/s flags ⚠️.
- **Write Sources** — sorted table of per-component counters with a
per-second rate column derived from snapshot deltas. Backfill rows show
⚠️ only when `tx_inserted >= 100` (meaningful baseline) AND the
backfill's per-second rate exceeds 10× the live tx rate. Avoids the
startup-spurious-alarm where cumulative-vs-cumulative was a tautology.
- **SQLite (WAL + Cache Hit)** — WAL size (⚠️ when >100 MB), page count,
page size, cache hit rate (⚠️ when <90%).

## Tests

- **Backend** (`cmd/server/perf_io_test.go`) —
`TestPerfIOEndpoint_ReturnsValidJSON`,
`TestPerfSqliteEndpoint_ReturnsValidJSON`,
`TestPerfWriteSourcesEndpoint_ReturnsSources` exercise the three new
endpoints. Skips the `/proc/self/io` non-zero-rate assertion when
`/proc` is unavailable.
- **Frontend** (`test-perf-disk-io-1120.js`) — vm-sandbox runs `perf.js`
with stubbed `fetch`, asserts the three new sections render with their
headings + values.

E2E assertion added: test-perf-disk-io-1120.js:91

## TDD

1. Red commit (`21abd22`) — added the three handlers as no-op stubs
returning empty values; tests fail on assertion mismatches (non-zero
rate, `pageSize > 0`, headings present).
2. Green commit (`d8da54c`) — fills in the real `/proc/self/io` parser,
PRAGMA queries, ingestor stats writer, and Perf page rendering.

---------

Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com>
2026-05-05 17:56:56 -07:00
Kpa-clawbot 5fa3b56ccb fix(#662): GetRepeaterRelayInfo also looks up byPathHop by 1-byte prefix (#1086)
## Summary

Partial fix for #662.

`GetRepeaterRelayInfo` was reporting "never observed as relay hop" /
`RelayCount24h=0` for nodes that clearly DO have packets passing through
them — visible on the same node detail page in the "Paths seen through
node" view.

## Root cause

The `byPathHop` index is keyed by **both**:
- full resolved pubkey (populated when neighbor-affinity resolution
succeeds), and
- raw 1-byte hop prefix from the wire (e.g. `"a3"`)

`GetRepeaterRelayInfo` only looked up the full-pubkey key. Many ingested
non-advert packets only carry the raw 1-byte hop — so any repeater whose
path appearances are all raw-hop entries returned 0, even though the
path-listing endpoint (which prefix-matches) renders them.

Example node: an `a3…` repeater on staging has ~dozens of paths through
it in the UI but the relay-info function returns 0.

## Fix

Look up under both keys (full pubkey + 1-byte prefix) and de-dup by tx
ID before counting.

## Trade-off

The 1-byte prefix CAN over-count when multiple nodes share a first byte.
This trades a possible over-count for clearly false zeros. The richer
disambiguation done by the path-listing endpoint (resolved-path SQL
post-filter via `confirmResolvedPathContains`) is out of scope for this
partial fix — adding it here would mean disk I/O inside what is
currently a pure in-memory lookup. Worth a follow-up if over-counting
shows up in practice.

## TDD

- Red commit (`test: failing test for relay-info prefix-hop mismatch`):
adds `TestRepeaterRelayActivity_PrefixHop` that builds a non-advert
packet with `PathJSON: ["a3"]`, indexes it via `addTxToPathHopIndex`,
then asserts `RelayCount24h>=1` for the full pubkey starting with `a3…`.
Fails on the assertion (got 0), not a build error.
- Green commit (`fix: GetRepeaterRelayInfo also looks up byPathHop by
1-byte prefix`): the lookup change. All five
`TestRepeaterRelayActivity_*` tests pass.

## Scope

This is a **partial** fix — addresses the read-side prefix mismatch
only. Issue #662 is a 4-axis epic (also covers ingest indexing
consistency, UI surfacing, and schema). Leaving #662 open.

---------

Co-authored-by: corescope-bot <bot@corescope>
Co-authored-by: clawbot <clawbot@users.noreply.github.com>
2026-05-05 02:33:27 -07:00
Kpa-clawbot 136e1d23c8 feat(#730): foreign-advert detection — flag instead of silent drop (#1084)
## Summary

**Partial fix for #730 (M1 only — M2 frontend and M3 alerting
deferred).**

Today the ingestor **silently drops** ADVERTs whose GPS lies outside the
configured `geo_filter` polygon. That's the wrong default for an
analytics tool — operators get zero visibility into bridged or leaked
meshes.

This PR makes the new default **flag, don't drop**: foreign adverts are
stored, the node row is tagged `foreign_advert=1`, and the API surfaces
`"foreign": true` so dashboards / map overlays can be built on top.

## Behavior

| Mode | What happens to an ADVERT outside `geo_filter` |
|---|---|
| (default) flag | Stored, marked `foreign_advert=1`, exposed via API |
| drop (legacy) | Silently dropped (preserves old behavior for ops who
want it) |

## What's done (M1 — Backend)
- ingestor stores foreign adverts instead of dropping
- `nodes.foreign_advert` column added (migration)
- `/api/nodes` and `/api/nodes/{pk}` expose `foreign: true` field
- Config: `geofilter.action: "flag"|"drop"` (default `flag`)
- Tests + config docs

## What's NOT done (deferred to M2 + M3)

- **M2 — Frontend:** Map overlay showing foreign adverts as distinct
markers, foreign-advert filter on packets/nodes pages, dedicated
foreign-advert dashboard
- **M3 — Alerting:** Time-series detection of bridging events, alert
when foreign advert rate spikes, identify bridge entry-point nodes

Issue #730 remains open for M2 and M3.

---------

Co-authored-by: corescope-bot <bot@corescope>
2026-05-05 01:58:52 -07:00
Kpa-clawbot 3ab404b545 feat(node-battery): voltage trend chart + /api/nodes/{pubkey}/battery (#663) (#1082)
## Summary

Closes #663 (Phase 2 + 3 partial — time-series tracking + thresholds for
nodes that are also observers).

Adds a per-node battery voltage trend chart and
`/api/nodes/{pubkey}/battery` endpoint, sourced from the existing
`observer_metrics.battery_mv` samples populated by observer status
messages. No new ingest or schema changes — purely surfaces data we were
already collecting.

## Scope (TDD red→green)

**RED commit:** test(node-battery) — DB query, endpoint shape
(200/404/no-data), and config getters all asserted.
**GREEN commit:** feat(node-battery) — implementation only.

## Changes

### Backend
- `cmd/server/node_battery.go` (new):
- `DB.GetNodeBatteryHistory(pubkey, since)` — pulls `(timestamp,
battery_mv)` rows from `observer_metrics WHERE LOWER(observer_id) =
LOWER(public_key) AND battery_mv IS NOT NULL`. Case-insensitive join
tolerates historical pubkey casing variation (observers persist
uppercase, nodes lowercase in this DB).
- `Server.handleNodeBattery` — `GET /api/nodes/{pubkey}/battery?days=N`
(default 7, max 365). Returns `{public_key, days, samples[], latest_mv,
latest_ts, status, thresholds}`.
- `Config.LowBatteryMv()` / `CriticalBatteryMv()` — defaults 3300 / 3000
mV.
- `cmd/server/config.go` — `BatteryThresholds *BatteryThresholdsConfig`
field.
- `cmd/server/routes.go` — route registration alongside existing
`/health`, `/analytics`.

### Frontend
- `public/node-analytics.js` — new "Battery Voltage" chart card with
status badge (🔋 OK / ⚠️ Low / 🪫 Critical / No data). Renders dashed
threshold lines at `lowMv` and `criticalMv`. Empty-state message when no
samples in window.

### Config
- `config.example.json` — `batteryThresholds: { lowMv: 3300, criticalMv:
3000 }` with `_comment` per Config Documentation Rule.

## Status semantics

| latest_mv             | status     |
|-----------------------|------------|
| no samples in window  | `unknown`  |
| `>= lowMv`            | `ok`       |
| `< lowMv`, `>= critMv`| `low`      |
| `< criticalMv`        | `critical` |

## What this PR does NOT do (deferred)

The issue's full Phase 1 (writing decoded sensor advert telemetry into
`nodes.battery_mv` / `temperature_c` from server-side decoder) and Phase
4 (firmware/active polling for repeaters without observers) are out of
scope here. This PR delivers the requested Phase 2/3 surfacing for the
data path that already lands rows: `observer_metrics`. Repeaters that
are also observers (i.e. publish status to MQTT) will get a voltage
trend immediately; pure passive nodes won't until Phase 1 lands.

## Tests

- `TestGetNodeBatteryHistory_FromObserverMetrics` — case-insensitive
join, NULL skipping, ordering.
- `TestNodeBatteryEndpoint` — full happy path with thresholds + status.
- `TestNodeBatteryEndpoint_NoData` — 200 + status=unknown.
- `TestNodeBatteryEndpoint_404` — unknown node.
- `TestBatteryThresholds_ConfigOverride` — config getters + defaults.

`cd cmd/server && go test ./...` — green.

## Performance

Endpoint is per-pubkey (called once on analytics page open), indexed by
`(observer_id, timestamp)` PK on `observer_metrics`. No hot-path impact.

---------

Co-authored-by: bot <bot@corescope>
2026-05-05 01:41:00 -07:00
Kpa-clawbot f33801ecb4 feat(repeater): usefulness score — traffic axis (#672) (#1079)
## Summary

Implements the **Traffic axis** of the repeater usefulness score (#672).
Does NOT close #672 — Bridge, Coverage, and Redundancy axes are deferred
to follow-up PRs.

Adds `usefulness_score` (0..1) to repeater/room node API responses
representing what fraction of non-advert traffic passes through this
repeater as a relay hop.

## Why traffic-axis-first

The issue proposes a 4-axis composite (Bridge, Coverage, Traffic,
Redundancy). Bridge/Coverage/Redundancy require betweenness centrality
and neighbor graph infrastructure (#773 Neighbor Graph V2). Traffic axis
can ship independently using existing path-hop data.

## Remaining work for #672

- Bridge axis (betweenness centrality — depends on #773)
- Coverage axis (observer reach comparison)
- Redundancy axis (node-removal simulation — depends on #687)
- Composite score combining all 4 axes

Partial fix for #672.

---------

Co-authored-by: meshcore-bot <bot@meshcore.local>
2026-05-05 01:34:08 -07:00
Kpa-clawbot d05e468598 feat(memlimit): GOMEMLIMIT support, derive from packetStore.maxMemoryMB (#836) (#1077)
## Summary

Implements **part 1** of #836 — `GOMEMLIMIT` support so the Go runtime
self-throttles GC under cgroup memory pressure instead of getting
SIGKILLed.

(Parts 2 & 3 — bounded cold-load batching + README ops docs — land in
follow-up PRs.)

## Behavior

On startup `cmd/server/main.go` now calls `applyMemoryLimit(maxMemoryMB,
envSet)`:

| Condition | Action | Log |
|---|---|---|
| `GOMEMLIMIT` env set | Honor the runtime's parse, do nothing |
`[memlimit] using GOMEMLIMIT from environment (...)` |
| env unset, `packetStore.maxMemoryMB > 0` | `debug.SetMemoryLimit(maxMB
* 1.5 MiB)` | `[memlimit] derived from packetStore.maxMemoryMB=512 → 768
MiB (1.5x headroom)` |
| env unset, `maxMemoryMB == 0` | No-op | `[memlimit] no soft memory
limit set ... recommend setting one to avoid container OOM-kill` |

The 1.5x headroom covers Go's NextGC trigger at ~2× live heap (per #836
heap profile: 680 MB live → 1.38 GB NextGC).

## Tests (TDD red→green visible in commit history)

- `TestApplyMemoryLimit_FromEnv` — env wins, function does not override
- `TestApplyMemoryLimit_DerivedFromMaxMemoryMB` — verifies bytes
computation + `debug.SetMemoryLimit` actually applied at runtime
- `TestApplyMemoryLimit_None` — no env, no config → reports `"none"`, no
side effect

Red commit: `7de3c62` (assertion failures, builds clean)
Green commit: `454516d`

## Config docs

`config.example.json` `packetStore._comment_gomemlimit` documents
env/derived/override behavior.

## Out of scope

- Cold-load transient bounding (item 2 in #836)
- README container-size table (item 3)
- QA §1.1 rewrite

Closes part 1 of #836.

---------

Co-authored-by: corescope-bot <bot@corescope>
2026-05-05 01:33:23 -07:00
Kpa-clawbot 45f30fcadc feat(repeater): liveness detection — distinguish actively relaying from advert-only (#662) (#1073)
## Summary

Implements repeater liveness detection per #662 — distinguishes a
repeater that is **actively relaying traffic** from one that is **alive
but idle** (only sending its own adverts).

## Approach

The backend already maintains a `byPathHop` index keyed by lowercase
hop/pubkey for every transmission. Decode-window writes also key it by
**resolved pubkey** for relay hops. We just weren't surfacing it.

`GetRepeaterRelayInfo(pubkey, windowHours)`:
- Reads `byPathHop[pubkey]`.
- Skips packets whose `payload_type == 4` (advert) — a self-advert
proves liveness, not relaying.
- Returns the most recent `FirstSeen` as `lastRelayed`, plus
`relayActive` (within window) and the `windowHours` actually used.

## Three states (per issue)

| State | Indicator | Condition |
|---|---|---|
| 🟢 Relaying | green | `last_relayed` within `relayActiveHours` |
| 🟡 Alive (idle) | yellow | repeater is in the DB but
`relay_active=false` (no recent path-hop appearance, or none ever) |
|  Stale | existing | falls out of the existing `getNodeStatus` logic |

## API

- `GET /api/nodes` — repeater/room rows now include `last_relayed`
(omitted if never observed) and `relay_active`.
- `GET /api/nodes/{pubkey}` — same fields plus `relay_window_hours`.

## Config

New optional field under `healthThresholds`:

```json
"healthThresholds": {
  ...,
  "relayActiveHours": 24
}
```

Default 24h. Documented in `config.example.json`.

## Frontend

Node detail page gains a **Last Relayed** row for repeaters/rooms with
the 🟢/🟡 state badge. Tooltip explains the distinction from "Last Heard".

## TDD

- **Red commit** `4445f91`: `repeater_liveness_test.go` + stub
`GetRepeaterRelayInfo` returning zero. Active and Stale tests fail on
assertion (LastRelayed empty / mismatched). Idle and IgnoresAdverts
already match the desired behavior under the stub. Compiles, runs, fails
on assertions — not on imports.
- **Green commit** `5fcfb57`: Implementation. All four tests pass. Full
`cmd/server` suite green (~22s).

## Performance

`O(N)` over `byPathHop[pubkey]` per call. The index is bounded by store
eviction; a single repeater has at most a few hundred entries on real
data. The `/api/nodes` loop adds one map read + scan per repeater row —
negligible against the existing enrichment work.

## Limitations (per issue body)

1. Observer coverage gaps — if no observer hears a repeater's relay,
it'll show as idle even when actively relaying. This is inherent to
passive observation.
2. Low-traffic networks — a repeater in a quiet area legitimately shows
idle. The 🟡 indicator copy makes that explicit ("alive (idle)").
3. Hash collisions are mitigated by the existing `resolveWithContext`
path before pubkeys land in `byPathHop`.

Fixes #662

---------

Co-authored-by: clawbot <bot@corescope.local>
2026-05-05 01:17:52 -07:00
Kpa-clawbot 83881e6b71 fix(#688): auto-discover hashtag channels from message text (#1071)
## Summary

Auto-discovers previously-unknown hashtag channels by scanning decoded
channel message text for `#name` mentions and surfacing them via
`GetChannels`.

Workflow (per the issue):
1. New channel message arrives on a known channel
2. Decoded text is scanned for `#hashtag` mentions
3. Any mention that doesn't match an existing channel is surfaced as a
discovered channel (`discovered: true`, `messageCount: 0`)
4. Future traffic on that channel will populate the entry once it has
its own packets

## Changes

- `cmd/server/discovered_channels.go` — new file.
`extractHashtagsFromText` parses `#name` mentions from free text,
deduped, order-preserving. Trailing punctuation is excluded by the
character class.
- `cmd/server/store.go` — `GetChannels` now scans CHAN packet text for
hashtags after building the primary channel map, and appends any unseen
hashtag mentions as discovered entries.
- `cmd/server/discovered_channels_test.go` — new tests covering parser
edge cases (single, multi, dedup, punctuation, none, bare `#`) and
end-to-end discovery via `GetChannels`.

## TDD

- Red: `34f1817` — stub returns `nil`, both new tests fail on assertion
(verified).
- Green: `d27b3ed` — real implementation, full `cmd/server` test suite
passes (21.7s).

## Notes

- Discovered channels carry `messageCount: 0` and `lastActivity` set to
the most recent mention's `firstSeen`, so they sort naturally alongside
real channels.
- Names are matched against existing entries by both `#name` and bare
`name` so a channel that already has decoded traffic isn't
double-listed.
- The existing `channelsCache` (15s) covers the new code path; no
separate invalidation needed since the source data (`byPayloadType[5]`)
drives both maps.

Fixes #688

---------

Co-authored-by: corescope-bot <bot@corescope.local>
2026-05-05 01:16:57 -07:00
Kpa-clawbot d144764d38 fix(analytics): multiByteCapability missing under region filter → all rows 'unknown' (#1049)
## Bug

`https://meshcore.meshat.se/#/analytics`:

- Unfiltered → 0 adopter rows show "unknown" (correct).
- Region filter `JKG` → 14 rows show "unknown" (wrong — same nodes, all
confirmed when unfiltered).

Multi-byte capability is a property of the NODE, derived from its own
adverts (the full pubkey is in the advert payload, no prefix collision
risk). The observing region should only control which nodes appear in
the analytics list — it must not change a node's cap evidence.

## Root cause

`PacketStore.GetAnalyticsHashSizes(region)` only attached
`result["multiByteCapability"]` when `region == ""`. Under any region
filter the field was absent. The frontend (`public/analytics.js:1011`)
does `data.multiByteCapability || []`, so every adopter row falls
through the merge with no cap status and renders as "unknown".

## Fix

Always populate `multiByteCapability`. When a region filter is active,
source the global adopter hash-size set from a no-region compute pass so
out-of-region observers' adverts still count as evidence.

## TDD

Red commit (`0968137`): adds
`cmd/server/multibyte_region_filter_test.go`, asserts that
`GetAnalyticsHashSizes("JKG")` returns a populated `multiByteCapability`
with Node A as `confirmed`. Fails on the assertion (field missing)
before the fix.

Green commit (`6616730`): always compute capability against the global
advert dataset.

## Files changed

- `cmd/server/store.go` — `GetAnalyticsHashSizes`: drop the `region ==
""` gate, always populate `multiByteCapability`.
- `cmd/server/multibyte_region_filter_test.go` — new red→green test.

## Verification

```
go test ./... -count=1   # all server tests pass (21s)
```

---------

Co-authored-by: clawbot <bot@corescope.local>
2026-05-05 06:42:58 +00:00
Kpa-clawbot 9f55ef802b fix(#804): attribute analytics by repeater home region, not observer (#1025)
Fixes #804.

## Problem
Analytics filtered region purely by **observer** region: a multi-byte
repeater whose home is PDX would leak into SJC results whenever its
flood
adverts were relayed past an SJC observer. Per-node groupings
(`multiByteNodes`, `distributionByRepeaters`) inherited the same bug.

## Fix

Two new helpers in `cmd/server/store.go`:

- `iataMatchesRegion(iata, regionParam)` — case-insensitive IATA→region
  match using the existing `normalizeRegionCodes` parser.
- `computeNodeHomeRegions()` — derives each node's HOME IATA from its
  zero-hop DIRECT adverts. Path byte for those packets is set locally on
  the originating radio and the packet has not been relayed, so the
  observer that hears it must be in direct RF range. Plurality vote when
  zero-hop adverts span multiple regions.

`computeAnalyticsHashSizes` now applies these in two ways:

1. **Observer-region filter is relaxed for ADVERT packets** when the
   originator's home region matches the requested region. A flood advert
   from a PDX repeater that's only heard by an SJC observer still
   attributes to PDX.
2. **Per-node grouping** (`multiByteNodes`, `distributionByRepeaters`)
   excludes nodes whose HOME region disagrees with the requested region.
   Falls back to the observer-region filter when home is unknown.

Adds `attributionMethod` to the response (`"observer"` or `"repeater"`)
so operators can tell which method was applied.

## Backwards compatibility

- No region filter requested → behavior unchanged (`attributionMethod`
  is `"observer"`).
- Region filter requested but no zero-hop direct adverts seen for a node
  → falls back to the prior observer-region check for that node.
- Operators without IATA-tagged observers see no change.

## TDD

- **Red commit** (`c35d349`): adds
`TestIssue804_AnalyticsAttributesByRepeaterRegion`
with three subtests (PDX leak into SJC, attributionMethod field present,
  SJC leak into PDX). Compiles, runs, fails on assertions.
- **Green commit** (`11b157f`): the implementation. All subtests pass,
  full `cmd/server` package green.

## Files changed
- `cmd/server/store.go` — helpers + analytics filter logic (+236/-51)
- `cmd/server/issue804_repeater_region_test.go` — new test (+147)

---------

Co-authored-by: CoreScope Bot <bot@corescope.local>
Co-authored-by: openclaw-bot <bot@openclaw.local>
2026-05-03 20:10:02 -07:00
Kpa-clawbot 1f4969c1a6 fix(#770): treat region 'All' as no-filter + document region behavior (#1026)
## Summary

Fixes #770 — selecting "All" in the region filter dropdown produced an
empty channel list.

## Root cause

`normalizeRegionCodes` (cmd/server/db.go) treated any non-empty input as
a literal IATA code. The frontend region filter labels its catch-all
option **"All"**; while `region-filter.js` normally sends an empty
string when "All" is selected, any code path that ends up sending
`?region=All` (deep-link URLs, manual queries, future callers) caused
the function to return `["ALL"]`. Downstream queries then filtered
observers for `iata = 'ALL'`, which never matches anything → empty
response.

## Fix

`normalizeRegionCodes` now treats `All` / `ALL` / `all`
(case-insensitive, with optional whitespace, mixed in CSV) as equivalent
to an empty value, returning `nil` to signal "no filter". Real IATA
codes (`SJC`, `PDX`, `sjc,PDX` → `[SJC PDX]`) still pass through
unchanged.

This is a defensive server-side fix: a single chokepoint that all
region-aware endpoints already flow through (channels, packets,
analytics, encrypted channels, observer ID resolution).

## Documentation

Expanded `_comment_regions` in `config.example.json` to explain:
- How IATA codes are resolved (payload > topic > source config — set in
#1012)
- What the `regions` map controls (display labels) vs runtime-discovered
codes
- That observers without an IATA tag only appear under "All Regions"
- That the `All` sentinel is server-side safe

## TDD

- **Red commit** (`4f65bf4`): `cmd/server/region_filter_test.go` —
`TestNormalizeRegionCodes_AllIsNoFilter` asserts `All` / `ALL` / `all` /
`""` / `"All,"` all collapse to `nil`. Compiles, runs, fails on
assertion (`got [ALL], want nil`). Companion test
`TestNormalizeRegionCodes_RealCodesPreserved` locks in that `sjc,PDX`
still returns `[SJC PDX]`.
- **Green commit** (`c9fb965`): two-line change in
`normalizeRegionCodes` + docs update.

## Verification

```
$ go test -run TestNormalizeRegionCodes -count=1 ./cmd/server
ok      github.com/corescope/server     0.023s

$ go test -count=1 ./cmd/server
ok      github.com/corescope/server    21.454s
```

Full suite green; no existing region tests regressed.

Fixes #770

---------

Co-authored-by: Kpa-clawbot <bot@corescope>
2026-05-03 19:50:01 -07:00
Kpa-clawbot b06adf9f2a feat: /api/backup — one-click SQLite database export (#474) (#1022)
## Summary

Implements `GET /api/backup` — one-click SQLite database export per
#474.

Operators can now grab a complete, consistent snapshot of the analyzer
DB with a single authenticated request — no SSH, no scripts, no DB
tooling.

## Endpoint

```
GET /api/backup
X-API-Key: <key>            # required
→ 200 OK
  Content-Type: application/octet-stream
  Content-Disposition: attachment; filename="corescope-backup-<unix>.db"
  <body: complete SQLite database file>
```

## Approach

Uses SQLite's `VACUUM INTO 'path'` to produce an atomic, defragmented
copy of the database into a fresh file:

- **Consistent**: VACUUM INTO runs at read isolation — the snapshot
reflects a single point in time even while the ingestor is writing to
the WAL.
- **Non-blocking**: writers continue uninterrupted; we never hold a
write lock.
- **Works on read-only connections**: verified manually against a
WAL-mode source DB (`mode=ro` connection successfully produces a
snapshot).
- **No corruption risk**: even if the live on-disk DB has issues, VACUUM
INTO surfaces what the server can read rather than copying broken pages
byte-for-byte.

The snapshot is staged in `os.MkdirTemp(...)` and removed after the
response body is fully streamed (deferred cleanup). Requesting client IP
is logged for audit.

The issue suggested an alternative in-memory rebuild path; `VACUUM INTO`
is simpler, faster, and produces a strictly more accurate copy of what
the server actually sees, so going with it.

## Security

- Mounted under `requireAPIKey` middleware — same gate as other admin
endpoints (`/api/admin/prune`, `/api/perf/reset`).
- Returns 401 without a valid `X-API-Key` header.
- Returns 403 if no API key is configured server-side.
- `X-Content-Type-Options: nosniff` set on the response.

## TDD

- **Red** (`99548f2`): `cmd/server/backup_test.go` adds
`TestBackupRequiresAPIKey` + `TestBackupReturnsValidSQLiteSnapshot`.
Stub handler returns 200 with no body so the tests fail on assertions
(Content-Type / Content-Disposition / SQLite magic header), not on
import or build errors.
- **Green** (`837b2fe`): real implementation lands; both tests pass;
full `go test ./...` suite stays green.

## Files

- `cmd/server/backup.go` — handler implementation
- `cmd/server/backup_test.go` — red-then-green tests
- `cmd/server/routes.go` — route registration under `requireAPIKey`
- `cmd/server/openapi.go` — OpenAPI metadata so `/api/openapi`
advertises the endpoint

## Out of scope (follow-ups)

- Rate limiting (issue suggested 1 req/min). Not added here —
admin-key-gated endpoint with a fast snapshot path is acceptable for v1;
happy to add a token-bucket limiter in a follow-up if operators report
hammering.
- UI button to trigger the download (frontend work — separate PR).

Fixes #474

---------

Co-authored-by: corescope-bot <bot@corescope.local>
2026-05-03 17:56:42 -07:00
Kpa-clawbot 51b9fed15e feat(roles): /#/roles page + /api/analytics/roles endpoint (Fixes #818) (#1023)
## Summary

Implements `/#/roles` per QA #809 §5.4 / issue #818. The page previously
showed "Page not yet implemented."

### Backend
- New `GET /api/analytics/roles` returns `{ totalNodes, roles: [{ role,
nodeCount, withSkew, meanAbsSkewSec, medianAbsSkewSec, okCount,
warningCount, criticalCount, absurdCount, noClockCount }] }`.
- Pure `computeRoleAnalytics(nodesByPubkey, skewByPubkey)` does the
bucketing/aggregation — no store/lock dependency, fully unit-testable.
- Roles are normalised (lowercased + trimmed; empty bucketed as
`unknown`).

### Frontend
- New `public/roles-page.js` renders a distribution table: count, share,
distribution bar, w/ skew, median |skew|, mean |skew|, severity
breakdown (OK / Warning / Critical / Absurd / No-clock).
- Registered as the `roles` page in the SPA router and linked from the
main nav.
- Auto-refreshes every 60 s, with a manual refresh button.

### Tests (TDD)
- **Red commit** (`9726d5b`): two assertion-failing tests against a stub
`computeRoleAnalytics` that returns an empty result. Compiles, runs,
fails on `TotalNodes = 0, want 5` and `len(Roles) = 0, want 1`.
- **Green commit** (`7efb76a`): full implementation, route wiring,
frontend page + nav, plus E2E test in `test-e2e-playwright.js` covering
both the empty-state contract (no "Page not yet implemented"
placeholder) and the populated-table case (header columns, body rows,
API response shape).

### Verification
- `go test ./cmd/server/...` green.
- Local server with the e2e fixture: `GET /api/analytics/roles` returns
`{"totalNodes":200,"roles":[{"role":"repeater","nodeCount":168,...},{"role":"room","nodeCount":23,...},{"role":"companion","nodeCount":9,...}]}`.

Fixes #818

---------

Co-authored-by: corescope-bot <bot@corescope>
2026-05-03 17:56:12 -07:00
Kpa-clawbot a56ee5c4fe feat(analytics): selectable timeframes via ?window/?from/?to (#842) (#1018)
## Summary
Selectable analytics timeframes (#842). Adds backend support for
`?window=1h|24h|7d|30d` and `?from=&to=` on the three main analytics
endpoints (`/api/analytics/rf`, `/api/analytics/topology`,
`/api/analytics/channels`), and a time-window picker in the Analytics
page UI that drives them. Default behavior with no query params is
unchanged.

## TDD trail
- Red: `bbab04d` — adds `TimeWindow` + `ParseTimeWindow` stub and tests;
tests fail on assertions because the stub returns the zero window.
- Green: `75d27f9` — implements `ParseTimeWindow`, threads `TimeWindow`
through `compute*` loops + caches, wires HTTP handlers, adds frontend
picker + E2E.

## Backend changes
- `cmd/server/time_window.go` — full `ParseTimeWindow` (`?window=`
aliases + `?from=/&to=` RFC3339 absolute range; invalid input → zero
window for backwards compatibility).
- `cmd/server/store.go` — new
`GetAnalytics{RF,Topology,Channels}WithWindow` wrappers; `compute*`
loops skip transmissions whose `FirstSeen` (or per-obs `Timestamp` for
the region+observer slice) falls outside the window. Cache key composes
`region|window` so different windows do not poison each other.
- `cmd/server/routes.go` — handlers call `ParseTimeWindow(r)` and
dispatch to the `*WithWindow` methods.

## Frontend changes
- `public/analytics.js` — new `<select id="analyticsTimeWindow">`
rendered under the region filter (All / 1h / 24h / 7d / 30d). Selecting
an option triggers `loadAnalytics()` which appends `&window=…` to every
analytics fetch.

## Tests
- `cmd/server/time_window_test.go` — covers all aliases, absolute range,
no-params backwards compatibility, `Includes()` bounds, and `CacheKey()`
distinctness.
- `cmd/server/topology_dedup_test.go`,
`cmd/server/channel_analytics_test.go` — updated callers to pass
`TimeWindow{}`.

## E2E (rule 18)
`test-e2e-playwright.js:592-611` — opens `/#/analytics`, asserts the
picker is rendered with a `24h` option, then asserts that selecting
`24h` triggers a network request to `/api/analytics/rf?…window=24h`.

## Backwards compatibility
No params → zero `TimeWindow` → original code paths (no filter,
region-only cache key). Verified by
`TestParseTimeWindow_NoParams_BackwardsCompatible` and by the existing
analytics tests still passing unchanged on `_wt-fix-842`.

Fixes #842

---------

Co-authored-by: you <you@example.com>
Co-authored-by: corescope-bot <bot@corescope>
2026-05-03 17:41:22 -07:00
Kpa-clawbot df69a17718 feat(#772): short pubkey-prefix URLs for mesh sharing (#1016)
## Summary

Fixes #772 — adds a short-URL form for node detail pages so operators
can paste node links into a mesh chat without bringing along a
64-hex-char public key.

## Approach

**Pubkey-prefix resolution** (no allocator, no lookup table).

- The SPA hash route `#/nodes/<key>` already accepts whatever
pubkey-shaped string the user pastes; the front end forwards it to `GET
/api/nodes/<key>`.
- When that lookup misses **and** the path is 8..63 hex chars, the
backend now calls `DB.GetNodeByPrefix` and:
  - returns the matching node when exactly one node has that prefix,
- returns **409 Conflict** when multiple nodes share the prefix (with a
"use a longer prefix" hint),
  - falls through to the existing 404 otherwise.
- 8 hex chars = 32 bits of entropy, which is enough for fleets in the
low thousands. Operators can extend to 10–12 chars if collisions become
common.
- The full-screen node detail card gets a new **📡 Copy short URL**
button that copies `…/#/nodes/<first 8 hex chars>`.

### Why not an opaque ID table (`/s/<id>`)?

Considered and rejected:

- Needs persistence + an allocator + cleanup story.
- IDs aren't self-describing — operators can't sanity-check them.
- IDs don't survive a DB rebuild.
- 32 bits of pubkey already buys us collision resistance with zero
moving parts.

If the directory grows past the point where 8-char prefixes routinely
collide, we can extend the minimum length without changing the URL
shape.

## Changes

- `cmd/server/db.go` — new `GetNodeByPrefix(prefix)` returning `(node,
ambiguous, error)`. Validates hex; rejects <8 chars; `LIMIT 2` to detect
collisions cheaply.
- `cmd/server/routes.go` — `handleNodeDetail` falls back to prefix
resolution; canonicalizes pubkey downstream; emits 409 on ambiguity;
honors blacklist on the resolved pubkey.
- `public/nodes.js` — adds **📡 Copy short URL** button + handler on the
full-screen node detail card.
- `cmd/server/short_url_test.go` — Go tests (red-then-green).
- `test-e2e-playwright.js` — E2E: navigates via prefix-only URL and
asserts the new button surfaces.

## TDD evidence

- Red commit: `2dea97a` — tests added with a stub `GetNodeByPrefix`
returning `(nil, false, nil)`. All four assertions failed (assertion
failures, not build errors): expected node got nil; expected
ambiguous=true got false; route 404 vs expected 200/409.
- Green commit: `9b8f146` — implementation lands; `go test ./...` passes
locally in `cmd/server`.

## Compatibility

- Existing 64-char pubkey URLs are untouched (exact lookup runs first).
- Blacklist is enforced both on the raw input and on the resolved
pubkey.
- No new config knobs.

## What I did **not** touch

- `cmd/server/db_test.go`, other route tests — unchanged.
- Packet-detail short URLs (issue scopes nodes; revisit in a follow-up
if asked).

Fixes #772

---------

Co-authored-by: clawbot <bot@corescope.local>
2026-05-03 17:40:54 -07:00
Kpa-clawbot c186129d47 feat: parse and display per-hop SNR values for TRACE packets (#1007)
## Summary

Parse and display per-hop SNR values from TRACE packets in the Packet
Byte Breakdown panel.

## Changes

### Backend (`cmd/server/decoder.go`)
- Added `SNRValues []float64` field to Payload struct
(`json:"snrValues,omitempty"`)
- In the TRACE-specific block, extract SNR from header path bytes before
they're overwritten with route hops
- Each header path byte is `int8(SNR_dB * 4.0)` per firmware — decode by
dividing by 4.0

### Frontend (`public/packets.js`)
- Added "SNR Path" section in `buildFieldTable()` showing per-hop SNR
values in dB when packet type is TRACE
- Added TRACE-specific payload rendering (trace tag, auth code, flags
with hash_size, route hops)

## TDD

- Red commit: `4dba4e8` — test asserts `Payload.SNRValues` field
(compile fails, field doesn't exist)
- Green commit: `5a496bd` — implementation passes all tests

## Testing

- `go test ./...` passes (all existing + 2 new TRACE SNR tests)
- No frontend test changes needed (no existing TRACE UI tests; rendering
is additive)

Fixes #979

---------

Co-authored-by: you <you@example.com>
2026-05-03 11:17:25 -07:00
Kpa-clawbot e86b5a3a0c feat: show multi-byte hash support indicator on map markers (#1002)
## Summary

Show 2-byte hash support indicator on map markers. Fixes #903.

## What changed

### Backend (`cmd/server/store.go`, `cmd/server/routes.go`)

- **`EnrichNodeWithMultiByte()`** — new enrichment function that adds
`multi_byte_status` (confirmed/suspected/unknown), `multi_byte_evidence`
(advert/path), and `multi_byte_max_hash_size` fields to node API
responses
- **`GetMultiByteCapMap()`** — cached (15s TTL) map of pubkey →
`MultiByteCapEntry`, reusing the existing `computeMultiByteCapability()`
logic that combines advert-based and path-hop-based evidence
- Wired into both `/api/nodes` (list) and `/api/nodes/{pubkey}` (detail)
endpoints

### Frontend (`public/map.js`)

- Added **"Multi-byte support"** checkbox in the map Display controls
section
- When toggled on, repeater markers change color:
  - 🟢 Green (`#27ae60`) — **confirmed** (advertised with hash_size ≥ 2)
- 🟡 Yellow (`#f39c12`) — **suspected** (seen as hop in multi-byte path)
  - 🔴 Red (`#e74c3c`) — **unknown** (no multi-byte evidence)
- Popup tooltip shows multi-byte status and evidence for repeaters
- State persisted in localStorage (`meshcore-map-multibyte-overlay`)

## TDD

- Red commit: `2f49cbc` — failing test for `EnrichNodeWithMultiByte`
- Green commit: `4957782` — implementation + passing tests

## Performance

- `GetMultiByteCapMap()` uses a 15s TTL cache (same pattern as
`GetNodeHashSizeInfo`)
- Enrichment is O(n) over nodes, no per-item API calls
- Frontend color override is computed inline during existing marker
render loop — no additional DOM rebuilds

---------

Co-authored-by: you <you@example.com>
2026-05-03 08:56:09 -07:00
Kpa-clawbot 564d93d6aa fix: dedup topology analytics by resolved pubkey (#998)
## Fix topology analytics double-counting repeaters/pairs (#909)

### Problem

`computeAnalyticsTopology()` aggregates by raw hop hex string. When
firmware emits variable-length path hashes (1-3 bytes per hop), the same
physical node appears multiple times with different prefix lengths (e.g.
`"07"`, `"0735bc"`, `"0735bc6d"` all referring to the same node). This
inflates repeater counts and creates duplicate pair entries.

### Solution

Added a confidence-gated dedup pass after frequency counting:

1. **For each hop prefix**, check if it resolves unambiguously (exactly
1 candidate in the prefix map)
2. **Unambiguous prefixes** → group by resolved pubkey, sum counts, keep
longest prefix as display identifier
3. **Ambiguous prefixes** (multiple candidates for that prefix) → left
as separate entries (conservative)
4. **Same treatment for pairs**: canonicalize by sorted pubkey pair

### Addressing @efiten's collision concern

At scale (~2000+ repeaters), 1-byte prefixes (256 buckets) WILL collide.
This fix explicitly checks the prefix map candidate count. Ambiguous
prefixes (where `len(pm.m[hop]) > 1`) are never merged — they remain as
separate entries. Only prefixes with a single matching node are eligible
for dedup.

### TDD

- **Red commit**: `4dbf9c0` — added 3 failing tests
- **Green commit**: `d6cae9a` — implemented dedup, all tests pass

### Tests added

- `TestTopologyDedup_RepeatersMergeByPubkey` — verifies entries with
different prefix lengths for same node merge to single entry with summed
count
- `TestTopologyDedup_AmbiguousPrefixNotMerged` — verifies colliding
short prefix stays separate from unambiguous longer prefix
- `TestTopologyDedup_PairsMergeByPubkey` — verifies pair entries merge
by resolved pubkey pair

Fixes #909

---------

Co-authored-by: you <you@example.com>
2026-05-02 22:19:49 -07:00
Kpa-clawbot b7c280c20a fix: drop/filter packets with null hash or timestamp (closes #871) (#993)
## Summary

Closes #871

The `/api/packets` endpoint could return packets with `null` hash or
timestamp fields. This was caused by legacy data in SQLite (rows with
empty `hash` or `NULL`/empty `first_seen`) predating the ingestor's
existing validation guard (`if hash == "" { return false, nil }` at
`cmd/ingestor/db.go:610`).

## Root Cause

`cmd/server/store.go` `filterPackets()` had no data-integrity guard.
Legacy rows with empty `hash` or `first_seen` were loaded into the
in-memory store and returned verbatim. The `strOrNil("")` helper then
serialized these as JSON `null`.

## Fix

Added a data-integrity predicate at the top of `filterPackets`'s scan
callback (`cmd/server/store.go:2278`):

```go
if tx.Hash == "" || tx.FirstSeen == "" {
    return false
}
```

This filters bad legacy rows at query time. The write path (ingestor)
already rejects empty hashes, so no new bad data enters.

## TDD Evidence

- **Red commit:** `15774c3` — test `TestIssue871_NoNullHashOrTimestamp`
asserts no packet in API response has null/empty hash or timestamp
- **Green commit:** `281fd6f` — adds the filter guard, test passes

## Testing

- `go test ./...` in `cmd/server` passes (full suite)
- Client-side defensive filter from PR #868 remains as defense-in-depth

---------

Co-authored-by: you <you@example.com>
2026-05-02 20:35:15 -07:00
Kpa-clawbot dd2f044f2b fix: cache RW SQLite connection + dedup DBConfig (closes #921) (#982)
Closes #921

## Summary

Follow-up to #920 (incremental auto-vacuum). Addresses both items from
the adversarial review:

### 1. RW connection caching

Previously, every call to `openRW(dbPath)` opened a new SQLite RW
connection and closed it after use. This happened in:
- `runIncrementalVacuum` (~4x/hour)
- `PruneOldPackets`, `PruneOldMetrics`, `RemoveStaleObservers`
- `buildAndPersistEdges`, `PruneNeighborEdges`
- All neighbor persist operations

Now a single `*sql.DB` handle (with `MaxOpenConns(1)`) is cached
process-wide via `cachedRW(dbPath)`. The underlying connection pool
manages serialization. The original `openRW()` function is retained for
one-shot test usage.

### 2. DBConfig dedup

`DBConfig` was defined identically in both `cmd/server/config.go` and
`cmd/ingestor/config.go`. Extracted to `internal/dbconfig/` as a shared
package; both binaries now use a type alias (`type DBConfig =
dbconfig.DBConfig`).

## Tests added

| Test | File |
|------|------|
| `TestCachedRW_ReturnsSameHandle` | `cmd/server/rw_cache_test.go` |
| `TestCachedRW_100Calls_SingleConnection` |
`cmd/server/rw_cache_test.go` |
| `TestGetIncrementalVacuumPages_Default` |
`internal/dbconfig/dbconfig_test.go` |
| `TestGetIncrementalVacuumPages_Configured` |
`internal/dbconfig/dbconfig_test.go` |

## Verification

```
ok  github.com/corescope/server    20.069s
ok  github.com/corescope/ingestor  47.117s
ok  github.com/meshcore-analyzer/dbconfig  0.003s
```

Both binaries build cleanly. 100 sequential `cachedRW()` calls return
the same handle with exactly 1 entry in the cache map.

---------

Co-authored-by: you <you@example.com>
2026-05-02 20:15:30 -07:00
Kpa-clawbot fc57433f27 fix(analytics): merge channel buckets by hash byte; reject rainbow-table mismatches (closes #978) (#980)
## Summary

Closes #978 — analytics channels duplicated by encrypted/decrypted split
+ rainbow-table collisions.

## Root cause

Two distinct bugs in `computeAnalyticsChannels` (`cmd/server/store.go`):

1. **Encrypted/decrypted split**: The grouping key included the decoded
channel name (`hash + "_" + channel`), so packets from observers that
could decrypt a channel created a separate bucket from packets where
decryption failed. Same physical channel, two entries.

2. **Rainbow-table collisions**: Some observers' lookup tables map hash
bytes to wrong channel names. E.g., hash `72` incorrectly claimed to be
`#wardriving` (real hash is `129`). This created ghost 1-message
entries.

## Fix

1. **Always group by hash byte alone** (drop `_channel` suffix from
`chKey`). When any packet decrypts successfully, upgrade the bucket's
display name from placeholder (`chN`) to the real name
(first-decrypter-wins for stability).

2. **Validate channel names** against the firmware hash invariant:
`SHA256(SHA256("#name")[:16])[0] == channelHash`. Mismatches are treated
as encrypted (placeholder name, no trust in decoded channel). Guard is
in the analytics handler (not the ingestor) to avoid breaking other
surfaces that use the decoded field for display.

## Verification (e2e-fixture.db)

| Metric | BEFORE | AFTER |
|--------|--------|-------|
| Total channels | 22 | 19 |
| Duplicate hash bytes | 3 (hashes 217, 202, 17) | 0 |

## Tests added

- `TestComputeAnalyticsChannels_MergesEncryptedAndDecrypted` — same
hash, mixed encrypted/decrypted → ONE bucket
- `TestComputeAnalyticsChannels_RejectsRainbowTableMismatch` — hash 72
claimed as `#wardriving` (real=129) → rejected, stays `ch72`
- `TestChannelNameMatchesHash` — unit test for hash validation helper
- `TestIsPlaceholderName` — unit test for placeholder detection

Anti-tautology gate: both main tests fail when their respective fix
lines are reverted.

Co-authored-by: you <you@example.com>
2026-05-02 16:05:56 -07:00
Kpa-clawbot 4b8d8143f4 feat(server): explicit CORS policy with configurable origin allowlist (#883) (#971)
## Summary

Adds explicit CORS policy support to the CoreScope API server, closing
#883.

### Problem

The API relied on browser same-origin defaults with no way for operators
to configure cross-origin access. Operators running dashboards or
third-party frontends on different origins had no supported way to make
API calls.

### Solution

**New config option:** `corsAllowedOrigins` (string array, default `[]`)

**Middleware behavior:**
| Config | Behavior |
|--------|----------|
| `[]` (default) | No `Access-Control-*` headers added — browsers
enforce same-origin. **Preserves current behavior.** |
| `["https://dashboard.example.com"]` | Echoes matching `Origin`, sets
`Allow-Methods`/`Allow-Headers` |
| `["*"]` | Sets `Access-Control-Allow-Origin: *` (explicit opt-in only)
|

**Headers set when origin matches:**
- `Access-Control-Allow-Origin: <origin>` (or `*`)
- `Access-Control-Allow-Methods: GET, POST, OPTIONS`
- `Access-Control-Allow-Headers: Content-Type, X-API-Key`
- `Vary: Origin` (non-wildcard only)

**Preflight handling:** `OPTIONS` → `204 No Content` with CORS headers
(or `403` if origin not in allowlist).

### Config example

```json
{
  "corsAllowedOrigins": ["https://dashboard.example.com", "https://monitor.internal"]
}
```

### Files changed

| File | Change |
|------|--------|
| `cmd/server/cors.go` | New CORS middleware |
| `cmd/server/cors_test.go` | 7 unit tests covering all branches |
| `cmd/server/config.go` | `CORSAllowedOrigins` field |
| `cmd/server/routes.go` | Wire middleware before all routes |

### Testing

**Unit tests (7):**
- Default config → no CORS headers
- Allowlist match → headers present with `Vary: Origin`
- Allowlist miss → no CORS headers
- Preflight allowed → 204 with headers
- Preflight rejected → 403
- Wildcard → `*` without `Vary`
- No `Origin` header → pass-through

**Live verification (Rule 18):**

```
# Default (empty corsAllowedOrigins):
$ curl -I -H "Origin: https://evil.example" localhost:19883/api/health
HTTP/1.1 200 OK
# No Access-Control-* headers ✓

# With corsAllowedOrigins: ["https://good.example"]:
$ curl -I -H "Origin: https://good.example" localhost:19884/api/health
Access-Control-Allow-Origin: https://good.example
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-API-Key
Vary: Origin ✓

$ curl -I -H "Origin: https://evil.example" localhost:19884/api/health
# No Access-Control-* headers ✓

$ curl -I -X OPTIONS -H "Origin: https://good.example" localhost:19884/api/health
HTTP/1.1 204 No Content
Access-Control-Allow-Origin: https://good.example ✓
```

Closes #883

Co-authored-by: you <you@example.com>
2026-05-02 12:04:37 -07:00
Kpa-clawbot 3364eed303 feat: separate "Last Status Update" from "Last Packet Observation" for observers (v3 rebase) (#969)
Rebased version of #968 (which was itself a rebase of #905) — resolves
merge conflict with #906 (clock-skew UI) that landed on master.

## Conflict resolution

**`public/observers.js`** — master (#906) added "Clock Offset" column to
observer table; #968 split "Last Seen" into "Last Status" + "Last
Packet" columns. Combined both: the table now has Status | Name | Region
| Last Status | Last Packet | Packets | Packets/Hour | Clock Offset |
Uptime.

## What this PR adds (unchanged from #968/#905)

- `last_packet_at` column in observers DB table
- Separate "Last Status Update" and "Last Packet Observation" display in
observers list and detail page
- Server-side migration to add the column automatically
- Backfill heuristic for existing data
- Tests for ingestor and server

## Verification

- All Go tests pass (`cmd/server`, `cmd/ingestor`)
- Frontend tests pass (`test-packets.js`, `test-hash-color.js`)
- Built server, hit `/api/observers` — `last_packet_at` field present in
JSON
- Observer table header has all 9 columns including both Last Packet and
Clock Offset

## Prior PRs

- #905 — original (conflicts with master)
- #968 — first rebase (conflicts after #906 landed)
- This PR — second rebase, resolves #906 conflict

Supersedes #968. Closes #905.

---------

Co-authored-by: you <you@example.com>
2026-05-02 12:03:42 -07:00
efiten 40c3aa13f9 fix(paths): exclude false-positive paths from short-prefix collisions (#930)
Fixes #929

## Summary

- `handleNodePaths` pulls candidates from `byPathHop` using 2-char and
4-char prefix keys (e.g. `"7a"` for a node using 1-byte adverts)
- When two nodes share the same short prefix, paths through the *other*
node are included as candidates
- The `resolved_path` post-filter covers decoded packets but falls
through conservatively (`inIndex = true`) when `resolved_path` is NULL,
letting false positives reach the response

**Fix:** during the aggregation phase (which already calls `resolveHop`
per hop), add a `containsTarget` check. If every hop resolves to a
different node's pubkey, skip the path. Packets confirmed via the
full-pubkey index key or via SQL bypass the check. Unresolvable hops are
kept conservatively.

## Test plan
- [x] `TestHandleNodePaths_PrefixCollisionExclusion`: two nodes sharing
`"7a"` prefix; verifies the path with no `resolved_path` (false
positive) is excluded and the SQL-confirmed path (true positive) is
included
- [x] Full test suite: `go test github.com/corescope/server` — all pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:15:25 -07:00
Kpa-clawbot b47587f031 feat(#690): expose observer skew + per-hash evidence in clock UI (#906)
## Summary

UI completion of #690 — surfaces observer clock skew and per-hash
evidence that the backend already computes but wasn't exposed in the
frontend.

**Not related to #845/PR #894** (bimodal detection) — this is the UI
surface for the original #690 scope.

## Changes

### Backend: per-hash evidence in node clock-skew API (commit 1)
- Extended `GET /api/nodes/{pubkey}/clock-skew` to return
`recentHashEvidence` (most recent 10 hashes with per-observer
raw/corrected skew and observer offset) and `calibrationSummary`
(total/calibrated/uncalibrated counts).
- Evidence is cached during `ClockSkewEngine.Recompute()` — route
handler is cheap.
- Fleet endpoint omits evidence to keep payload small.

### Frontend: observer list page — clock offset column (commit 2)
- Added "Clock Offset" column to observers table.
- Fetches `/api/observers/clock-skew` once on page load, joins by
ObserverID.
- Color-coded severity badge + sample count tooltip.
- Singleton observers show "—" not "0".

### Frontend: observer-detail clock card (commit 3)
- Added clock offset card mirroring node clock card style.
- Shows: offset value, sample count, severity badge.
- Inline explainer describing how offset is computed from multi-observer
packets.

### Frontend: node clock card evidence panel (commit 4)
- Collapsible "Evidence" section in existing node clock skew card.
- Per-hash breakdown: observer count, median corrected skew,
per-observer raw/corrected/offset.
- Calibration summary line and plain-English severity reason at top.

## Test Results

```
go test ./... (cmd/server) — PASS (19.3s)
go test ./... (cmd/ingestor) — PASS (31.6s)
Frontend helpers: 610 passed, 0 failed
```

New test: `TestNodeClockSkew_EvidencePayload` — 3-observer scenario
verifying per-hash array shape, corrected = raw + offset math, and
median.

No frontend JS smoke test added — no existing test harness for
clock/observer rendering. Noted for future.

## Screenshots

Screenshots TBD

## Perf justification

Evidence is computed inside the existing `Recompute()` cycle (already
O(n) on samples). The `hashEvidence` map adds ~32 bytes per sample of
memory. Evidence is stripped from fleet responses. Per-node endpoint
returns at most 10 evidence entries — bounded payload.

---------

Co-authored-by: you <you@example.com>
2026-05-02 10:30:54 -07:00
Kpa-clawbot b3a9677c52 feat(ingestor + server): observerBlacklist config (#962) (#963)
## Summary

Implements `observerBlacklist` config — mirrors the existing
`nodeBlacklist` pattern for observers. Drop observers by pubkey at
ingest, with defense-in-depth filtering on the server side.

Closes #962

## Changes

### Ingestor (`cmd/ingestor/`)
- **`config.go`**: Added `ObserverBlacklist []string` field +
`IsObserverBlacklisted()` method (case-insensitive, whitespace-trimmed)
- **`main.go`**: Early return in `handleMessage` when `parts[2]`
(observer ID from MQTT topic) matches blacklist — before status
handling, before IATA filter. No UpsertObserver, no observations, no
metrics insert. Log line: `observer <pubkey-short> blacklisted,
dropping`

### Server (`cmd/server/`)
- **`config.go`**: Same `ObserverBlacklist` field +
`IsObserverBlacklisted()` with `sync.Once` cached set (same pattern as
`nodeBlacklist`)
- **`routes.go`**: Defense-in-depth filtering in `handleObservers` (skip
blacklisted in list) and `handleObserverDetail` (404 for blacklisted ID)
- **`main.go`**: Startup `softDeleteBlacklistedObservers()` marks
matching rows `inactive=1` so historical data is hidden
- **`neighbor_persist.go`**: `softDeleteBlacklistedObservers()`
implementation

### Tests
- `cmd/ingestor/observer_blacklist_test.go`: config method tests
(case-insensitive, empty, nil)
- `cmd/server/observer_blacklist_test.go`: config tests + HTTP handler
tests (list excludes blacklisted, detail returns 404, no-blacklist
passes all, concurrent safety)

## Config

```json
{
  "observerBlacklist": [
    "EE550DE547D7B94848A952C98F585881FCF946A128E72905E95517475F83CFB1"
  ]
}
```

## Verification (Rule 18 — actual server output)

**Before blacklist** (no config):
```
Total: 31
DUBLIN in list: True
```

**After blacklist** (DUBLIN Observer pubkey in `observerBlacklist`):
```
[observer-blacklist] soft-deleted 1 blacklisted observer(s)
Total: 30
DUBLIN in list: False
```

Detail endpoint for blacklisted observer returns **404**.

All existing tests pass (`go test ./...` for both server and ingestor).

---------

Co-authored-by: you <you@example.com>
2026-05-01 23:11:27 -07:00
Kpa-clawbot e1a1be1735 fix(server): add observers.inactive column at startup if missing (root cause of CI flake) (#961)
## The actual root cause

PR #954 added `WHERE inactive IS NULL OR inactive = 0` to the server's
observer queries, but the `inactive` column is only added by the
**ingestor** migration (`cmd/ingestor/db.go:344-354`). When the server
runs against a DB the ingestor never touched (e.g. the e2e fixture), the
column doesn't exist:

```
$ sqlite3 test-fixtures/e2e-fixture.db "SELECT COUNT(*) FROM observers WHERE inactive IS NULL OR inactive = 0;"
Error: no such column: inactive
```

The server's `db.QueryRow().Scan()` swallows that error →
`totalObservers` stays 0 → `/api/observers` returns empty → map test
fails with "No map markers/overlays found".

This explains all the failing CI runs since #954 merged. PR #957
(freshen fixture) helped with the `nodes` time-rot but couldn't fix the
missing-column problem. PR #960 (freshen observers) added the right
timestamps but the column was still missing. PR #959 (data-loaded in
finally) fixed a different real bug. None of those touched the actual
mechanism.

## Fix

Mirror the existing `ensureResolvedPathColumn` pattern: add
`ensureObserverInactiveColumn` that runs at server startup, checks if
the column exists via `PRAGMA table_info`, adds it with `ALTER TABLE
observers ADD COLUMN inactive INTEGER DEFAULT 0` if missing.

Wired into `cmd/server/main.go` immediately after
`ensureResolvedPathColumn`.

## Verification

End-to-end on a freshened fixture:

```
$ sqlite3 /tmp/e2e-verify.db "PRAGMA table_info(observers);" | grep inactive
(no output — column absent)

$ ./cs-fixed -port 13702 -db /tmp/e2e-verify.db -public public &
[store] Added inactive column to observers

$ curl 'http://localhost:13702/api/observers'
returned=31    # was 0 before fix
```

`go test ./...` passes (19.8s).

## Lessons

I should have run `sqlite3 fixture "SELECT ... WHERE inactive ..."`
directly the first time the map test failed after #954 instead of
writing four "fix" PRs that didn't address the actual mechanism.
Apologies for the wild goose chase.

Co-authored-by: Kpa-clawbot <bot@example.invalid>
2026-05-01 19:04:23 -07:00
Kpa-clawbot 568de4b441 fix(observers): exclude soft-deleted observers from /api/observers and totalObservers (#954)
## Bug

`/api/observers` returned soft-deleted (inactive=1) observers. Operators
saw stale observers in the UI even after the auto-prune marked them
inactive on schedule. Reproduced on staging: 14 observers older than 14
days returned by the API; all of them had `inactive=1` in the DB.

## Root cause

`DB.GetObservers()` (`cmd/server/db.go:974`) ran `SELECT ... FROM
observers ORDER BY last_seen DESC` with no WHERE filter. The
`RemoveStaleObservers` path correctly soft-deletes by setting
`inactive=1`, but the read path didn't honor it.

`statsRow` (`cmd/server/db.go:234`) had the same bug — `totalObservers`
count included soft-deleted rows.

## Fix

Add `WHERE inactive IS NULL OR inactive = 0` to both:

```go
// GetObservers
"SELECT ... FROM observers WHERE inactive IS NULL OR inactive = 0 ORDER BY last_seen DESC"

// statsRow.TotalObservers
"SELECT COUNT(*) FROM observers WHERE inactive IS NULL OR inactive = 0"
```

`NULL` check preserves backward compatibility with rows from before the
`inactive` migration.

## Tests

Added regression `TestGetObservers_ExcludesInactive`:
- Seed two observers, mark one inactive, assert `GetObservers()` returns
only the other.
- **Anti-tautology gate verified**: reverting the WHERE clause causes
the test to fail with `expected 1 observer, got 2` and `inactive
observer obs2 should be excluded`.

`go test ./...` passes (19.6s).

## Out of scope

- `GetObserverByID` lookup at line 1009 still returns inactive observers
— this is intentional, so an old deep link to `/observers/<id>` shows
"inactive" rather than 404.
- Frontend may also have its own caching layer; this fix is server-side
only.

---------

Co-authored-by: Kpa-clawbot <bot@example.invalid>
Co-authored-by: you <you@example.com>
Co-authored-by: KpaBap <kpabap@gmail.com>
2026-05-01 17:51:08 +00:00
Kpa-clawbot 57e272494d feat(server): /api/healthz readiness endpoint gated on store load (#955) (#956)
## Summary

Fixes RCA #2 from #955: the HTTP listener and `/api/stats` go live
before background goroutines (pickBestObservation, neighbor graph build)
finish, causing CI readiness checks to pass prematurely.

## Changes

1. **`cmd/server/healthz.go`** — New `GET /api/healthz` endpoint:
- Returns `503 {"ready":false,"reason":"loading"}` while background init
is running
   - Returns `200 {"ready":true,"loadedTx":N,"loadedObs":N}` once ready

2. **`cmd/server/main.go`** — Added `sync.WaitGroup` tracking
pickBestObservation and neighbor graph build goroutines. A coordinator
goroutine sets `readiness.Store(1)` when all complete.
`backfillResolvedPathsAsync` is NOT gated (async by design, can take 20+
min).

3. **`cmd/server/routes.go`** — Wired `/api/healthz` before system
endpoints.

4. **`.github/workflows/deploy.yml`** — CI wait-for-ready loop now polls
`/api/healthz` instead of `/api/stats`.

5. **`cmd/server/healthz_test.go`** — Tests for 503-before-ready,
200-after-ready, JSON shape, and anti-tautology gate.

## Rule 18 Verification

Built and ran against `test-fixtures/e2e-fixture.db` (499 tx):
- With the small fixture DB, init completes in <300ms so both immediate
and delayed curls return 200
- Unit tests confirm 503 behavior when `readiness=0` (simulating slow
init)
- On production DBs with 100K+ txs, the 503 window would be 5-15s
(pickBestObservation processes in 5000-tx chunks with 10ms yields)

## Test Results

```
=== RUN   TestHealthzNotReady    --- PASS
=== RUN   TestHealthzReady       --- PASS  
=== RUN   TestHealthzAntiTautology --- PASS
ok  github.com/corescope/server  19.662s (full suite)
```

Co-authored-by: you <you@example.com>
2026-05-01 07:55:57 -07:00
efiten e460932668 fix(store): apply retentionHours cutoff in Load() to prevent OOM on cold start (#917)
## Problem

`Load()` loaded all transmissions from the DB regardless of
`retentionHours`, so `buildSubpathIndex()` processed the full DB history
on every startup. On a DB with ~280K paths this produces ~13.5M subpath
index entries, OOM-killing the process before it ever starts listening —
causing a supervisord crash loop with no useful error message.

## Fix

Apply the same `retentionHours` cutoff to `Load()`'s SQL that
`EvictStale()` already uses at runtime. Both conditions
(`retentionHours` window and `maxPackets` cap) are combined with AND so
neither safety limit is bypassed.

Startup now builds indexes only over the retention window, making
startup time and memory proportional to recent activity rather than
total DB history.

## Docs

- `config.example.json`: adds `retentionHours` to the `packetStore`
block with recommended value `168` (7 days) and a warning about `0` on
large DBs
- `docs/user-guide/configuration.md`: documents the field and adds an
explicit OOM warning

## Test plan

- [x] `cd cmd/server && go test ./... -run TestRetentionLoad` — covers
the retention-filtered load: verifies packets outside the window are
excluded, and that `retentionHours: 0` still loads everything
- [x] Deploy on an instance with a large DB (>100K paths) and
`retentionHours: 168` — server reaches "listening" in seconds instead of
OOM-crashing
- [x] Verify `config.example.json` has `retentionHours: 168` in the
`packetStore` block
- [x] Verify `docs/user-guide/configuration.md` documents the field and
warning

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Kpa-clawbot <kpaclawbot@outlook.com>
2026-05-01 06:47:55 +00:00
Kpa-clawbot aeae7813bc fix: enable SQLite incremental auto-vacuum so DB shrinks after retention (#919) (#920)
Closes #919

## Summary

Enables SQLite incremental auto-vacuum so the database file actually
shrinks after retention reaper deletes old data. Previously, `DELETE`
operations freed pages internally but never returned disk space to the
OS.

## Changes

### 1. Auto-vacuum on new databases
- `PRAGMA auto_vacuum = INCREMENTAL` set via DSN pragma before
`journal_mode(WAL)` in the ingestor's `OpenStoreWithInterval`
- Must be set before any tables are created; DSN ordering ensures this

### 2. Post-reaper incremental vacuum
- `PRAGMA incremental_vacuum(N)` runs after every retention reaper cycle
(packets, metrics, observers, neighbor edges)
- N defaults to 1024 pages, configurable via `db.incrementalVacuumPages`
- Noop on `auto_vacuum=NONE` databases (safe before migration)
- Added to both server and ingestor

### 3. Opt-in full VACUUM for existing databases
- Startup check logs a clear warning if `auto_vacuum != INCREMENTAL`
- `db.vacuumOnStartup: true` config triggers one-time `PRAGMA
auto_vacuum = INCREMENTAL; VACUUM`
- Logs start/end time for operator visibility

### 4. Documentation
- `docs/user-guide/configuration.md`: retention section notes that
lowering retention doesn't immediately shrink the DB
- `docs/user-guide/database.md`: new guide covering WAL, auto-vacuum,
migration, manual VACUUM

### 5. Tests
- `TestNewDBHasIncrementalAutoVacuum` — fresh DB gets `auto_vacuum=2`
- `TestExistingDBHasAutoVacuumNone` — old DB stays at `auto_vacuum=0`
- `TestVacuumOnStartupMigratesDB` — full VACUUM sets `auto_vacuum=2`
- `TestIncrementalVacuumReducesFreelist` — DELETE + vacuum shrinks
freelist
- `TestCheckAutoVacuumLogs` — handles both modes without panic
- `TestConfigIncrementalVacuumPages` — config defaults and overrides

## Migration path for existing databases

1. On startup, CoreScope logs: `[db] auto_vacuum=NONE — DB needs
one-time VACUUM...`
2. Set `db.vacuumOnStartup: true` in config.json
3. Restart — VACUUM runs (blocks startup, minutes on large DBs)
4. Remove `vacuumOnStartup` after migration

## Test results

```
ok  github.com/corescope/server    19.448s
ok  github.com/corescope/ingestor  30.682s
```

---------

Co-authored-by: you <you@example.com>
2026-04-30 23:45:00 -07:00
Kpa-clawbot 54f7f9d35b feat: path-prefix candidate inspector with map view (#944) (#945)
## feat: path-prefix candidate inspector with map view (#944)

Implements the locked spec from #944: a beam-search-based path prefix
inspector that enumerates candidate full-pubkey paths from short hex
prefixes and scores them.

### Server (`cmd/server/path_inspect.go`)

- **`POST /api/paths/inspect`** — accepts 1-64 hex prefixes (1-3 bytes,
uniform length per request)
- Beam search (width 20) over cached `prefixMap` + `NeighborGraph`
- Per-hop scoring: edge weight (35%), GPS plausibility (20%), recency
(15%), prefix selectivity (30%)
- Geometric mean aggregation with 0.05 floor per hop
- Speculative threshold: score < 0.7
- Score cache: 30s TTL, keyed by (prefixes, observer, window)
- Cold-start: synchronous NeighborGraph rebuild with 2s hard timeout →
503 `{retry:true}`
- Body limit: 4096 bytes via `http.MaxBytesReader`
- Zero SQL queries in handler hot path
- Request validation: rejects empty, odd-length, >3 bytes, mixed
lengths, >64 hops

### Frontend (`public/path-inspector.js`)

- New page under Tools route with input field (comma/space separated hex
prefixes)
- Client-side validation with error feedback
- Results table: rank, score (color-coded speculative), path names,
per-hop evidence (collapsed)
- "Show on Map" button calls `drawPacketRoute` (one path at a time,
clears prior)
- Deep link: `#/tools/path-inspector?prefixes=2c,a1,f4`

### Nav reorganization

- `Traces` nav item renamed to `Tools`
- Backward-compat: `#/traces/<hash>` redirects to `#/tools/trace/<hash>`
- Tools sub-routing dispatches to traces or path-inspector

### Store changes

- Added `LastSeen time.Time` to `nodeInfo` struct, populated from
`nodes.last_seen`
- Added `inspectMu` + `inspectCache` fields to `PacketStore`

### Tests

- **Go unit tests** (`path_inspect_test.go`): scoreHop components, beam
width cap, speculative flag, all validation error cases, valid request
integration
- **Frontend tests** (`test-path-inspector.js`): parse
comma/space/mixed, validation (empty, odd, >3 bytes, mixed lengths,
invalid hex, valid)
- Anti-tautology gate verified: removing beam pruning fails width test;
removing validation fails reject tests

### CSS

- `--path-inspector-speculative` variable in both themes (amber, WCAG AA
on both dark/light backgrounds)
- All colors via CSS variables (no hardcoded hex in production code)

Closes #944

---------

Co-authored-by: you <you@example.com>
2026-04-30 23:28:16 -07:00
Kpa-clawbot 5678874128 fix: exclude non-repeater nodes from path-hop resolution (#935) (#936)
Fixes #935

## Problem

`buildPrefixMap()` indexed ALL nodes regardless of role, causing
companions/sensors to appear as repeater hops when their pubkey prefix
collided with a path-hop hash byte.

## Fix

### Server (`cmd/server/store.go`)
- Added `canAppearInPath(role string) bool` — allowlist of roles that
can forward packets (repeater, room_server, room)
- `buildPrefixMap` now skips nodes that fail this check

### Client (`public/hop-resolver.js`)
- Added matching `canAppearInPath(role)` helper
- `init()` now only populates `prefixIdx` for path-eligible nodes
- `pubkeyIdx` remains complete — `resolveFromServer()` still resolves
any node type by full pubkey (for server-confirmed `resolved_path`
arrays)

## Tests

- `cmd/server/prefix_map_role_test.go`: 7 new tests covering role
filtering in prefix map and resolveWithContext
- `test-hop-resolver-affinity.js`: 4 new tests verifying client-side
role filter + pubkeyIdx completeness
- All existing tests updated to include `Role: "repeater"` where needed
- `go test ./cmd/server/...` — PASS
- `node test-hop-resolver-affinity.js` — 16/17 pass (1 pre-existing
centroid failure unrelated to this change)

## Commits

1. `fix: filter prefix map to only repeater/room roles (#935)` — server
implementation
2. `test: prefix map role filter coverage (#935)` — server tests
3. `ui: filter HopResolver prefix index to repeater/room roles (#935)` —
client implementation
4. `test: hop-resolver role filter coverage (#935)` — client tests

---------

Co-authored-by: you <you@example.com>
2026-04-30 09:25:51 -07:00
Kpa-clawbot 6ca5e86df6 fix: compute hex-dump byte ranges client-side from per-obs raw_hex (#891)
## Symptom
The colored byte strip in the packet detail pane is offset from the
labeled byte breakdown below it. Off by N bytes where N is the
difference between the top-level packet's path length and the displayed
observation's path length.

## Root cause
Server computes `breakdown.ranges` once from the top-level packet's
raw_hex (in `BuildBreakdown`) and ships it in the API response. After
#882 we render each observation's own raw_hex, but we keep using the
top-level breakdown — so a 7-hop top-level packet shipped "Path: bytes
2-8", and when we rendered an 8-hop observation we coloured 7 of the 8
path bytes and bled into the payload.

The labeled rows below (which use `buildFieldTable`) parse the displayed
raw_hex on the client, so they were correct — they just didn't match the
strip above.

## Fix
Port `BuildBreakdown()` to JS as `computeBreakdownRanges()` in `app.js`.
Use it in `renderDetail()` from the actually-rendered (per-obs) raw_hex.

## Test
Manually verified the JS function output matches the Go implementation
for FLOOD/non-transport, transport, ADVERT, and direct-advert (zero
hops) cases.

Closes nothing (caught in post-tag bug bash).

---------

Co-authored-by: you <you@example.com>
2026-04-21 22:17:14 -07:00
Kpa-clawbot 56ec590bc4 fix(#886): derive path_json from raw_hex at ingest (#887)
## Problem

Per-observation `path_json` disagrees with `raw_hex` path section for
TRACE packets.

**Reproducer:** packet `af081a2c41281b1e`, observer `lutin🏡`
- `path_json`: `["67","33","D6","33","67"]` (5 hops — from TRACE
payload)
- `raw_hex` path section: `30 2D 0D 23` (4 bytes — SNR values in header)

## Root Cause

`DecodePacket` correctly parses TRACE packets by replacing `path.Hops`
with hop IDs from the payload's `pathData` field (the actual route).
However, the header path bytes for TRACE packets contain **SNR values**
(one per completed hop), not hop IDs.

`BuildPacketData` used `decoded.Path.Hops` to build `path_json`, which
for TRACE packets contained the payload-derived hops — not the header
path bytes that `raw_hex` stores. This caused `path_json` and `raw_hex`
to describe completely different paths.

## Fix

- Added `DecodePathFromRawHex(rawHex)` — extracts header path hops
directly from raw hex bytes, independent of any TRACE payload
overwriting.
- `BuildPacketData` now calls `DecodePathFromRawHex(msg.Raw)` instead of
using `decoded.Path.Hops`, guaranteeing `path_json` always matches the
`raw_hex` path section.

## Tests (8 new)

**`DecodePathFromRawHex` unit tests:**
- hash_size 1, 2, 3, 4
- zero-hop direct packets
- transport route (4-byte transport codes before path)

**`BuildPacketData` integration tests:**
- TRACE packet: asserts path_json matches raw_hex header path (not
payload hops)
- Non-TRACE packet: asserts path_json matches raw_hex header path

All existing tests continue to pass (`go test ./...` for both ingestor
and server).

Fixes #886

---------

Co-authored-by: you <you@example.com>
2026-04-21 21:13:58 -07:00
Kpa-clawbot a605518d6d fix(#881): per-observation raw_hex — each observer sees different bytes on air (#882)
## Problem

Each MeshCore observer receives a physically distinct over-the-air byte
sequence for the same transmission (different path bytes, flags/hops
remaining). The `observations` table stored only `path_json` per
observer — all observations pointed at one `transmissions.raw_hex`. This
prevented the hex pane from updating when switching observations in the
packet detail view.

## Changes

| Layer | Change |
|-------|--------|
| **Schema** | `ALTER TABLE observations ADD COLUMN raw_hex TEXT`
(nullable). Migration: `observations_raw_hex_v1` |
| **Ingestor** | `stmtInsertObservation` now stores per-observer
`raw_hex` from MQTT payload |
| **View** | `packets_v` uses `COALESCE(o.raw_hex, t.raw_hex)` —
backward compatible with NULL historical rows |
| **Server** | `enrichObs` prefers `obs.RawHex` when non-empty, falls
back to `tx.RawHex` |
| **Frontend** | No changes — `effectivePkt.raw_hex` already flows
through `renderDetail` |

## Tests

- **Ingestor**: `TestPerObservationRawHex` — two MQTT packets for same
hash from different observers → both stored with distinct raw_hex
- **Server**: `TestPerObservationRawHexEnrich` — enrichObs returns
per-obs raw_hex when present, tx fallback when NULL
- **E2E**: Playwright assertion in `test-e2e-playwright.js` for hex pane
update on observation switch

E2E assertion added: `test-e2e-playwright.js:1794`

## Scope

- Historical observations: raw_hex stays NULL, UI falls back to
transmission raw_hex silently
- No backfill, no path_json reconstruction, no frontend changes

Closes #881

---------

Co-authored-by: you <you@example.com>
2026-04-21 13:45:29 -07:00
Kpa-clawbot 42ff5a291b fix(#866): full-page obs-switch — update hex + path + direction per observation (#870)
## Problem

On `/#/packets/<hash>?obs=<id>`, clicking a different observation
updated summary fields (Observer, SNR/RSSI, Timestamp) but **not** hex
payload or path details. Sister bug to #849 (fixed in #851 for the
detail dialog).

## Root Causes

| Cause | Impact |
|-------|--------|
| `selectPacket` called `renderDetail` without `selectedObservationId` |
Initial render missed observation context on some code paths |
| `ObservationResp` missing `direction`, `resolved_path`, `raw_hex` |
Frontend obs-switch lost direction and resolved_path context |
| `obsPacket` construction omitted `direction` field | Direction not
preserved when switching observations |

## Fix

- `selectPacket` explicitly passes `selectedObservationId` to
`renderDetail`
- `ObservationResp` gains `Direction`, `ResolvedPath`, `RawHex` fields
- `mapSliceToObservations` copies the three new fields
- `obsPacket` spreads include `direction` from the observation

## Tests

7 new tests in `test-frontend-helpers.js`:
- Observation switch updates `effectivePkt` path
- `raw_hex` preserved from packet when obs has none
- `raw_hex` from obs overrides when API provides it
- `direction` carried through observation spread
- `resolved_path` carried through observation spread
- `getPathLenOffset` cross-check for transport routes
- URL hash `?obs=` round-trip encoding

All 584 frontend + 62 filter + 29 aging tests pass. Go server tests
pass.

Fixes #866

Co-authored-by: you <you@example.com>
2026-04-21 10:40:52 -07:00
Kpa-clawbot 441409203e feat(#845): bimodal_clock severity — surface flaky-RTC nodes instead of hiding as 'No Clock' (#850)
## Problem

Nodes with flaky RTC (firmware emitting interleaved good and nonsense
timestamps) were classified as `no_clock` because the broken samples
poisoned the recent median. Operators lost visibility into these nodes —
they showed "No Clock" even though ~60% of their adverts had valid
timestamps.

Observed on staging: a node with 31K samples where recent adverts
interleave good skew (-6.8s, -13.6s) with firmware nonsense (-56M, -60M
seconds). Under the old logic, median of the mixed window → `no_clock`.

## Solution

New `bimodal_clock` severity tier that surfaces flaky-RTC nodes with
their real (good-sample) skew value.

### Classification order (first match wins)

| Severity | Good Fraction | Description |
|----------|--------------|-------------|
| `no_clock` | < 10% | Essentially no real clock |
| `bimodal_clock` | 10–80% (and bad > 0) | Mixed good/bad — flaky RTC |
| `ok`/`warn`/`critical`/`absurd` | ≥ 80% | Normal classification |

"Good" = `|skew| <= 1 hour`; "bad" = likely uninitialized RTC nonsense.

When `bimodal_clock`, `recentMedianSkewSec` is computed from **good
samples only**, so the dashboard shows the real working-clock value
(e.g. -7s) instead of the broken median.

### Backend changes
- New constant `BimodalSkewThresholdSec = 3600`
- New severity `bimodal_clock` in classification logic
- New API fields: `goodFraction`, `recentBadSampleCount`,
`recentSampleCount`

### Frontend changes
- Amber `Bimodal` badge with tooltip showing bad-sample percentage
- Bimodal nodes render skew value like ok/warn/severe (not the "No
Clock" path)
- Warning line below sparkline: "⚠️ X of last Y adverts had nonsense
timestamps (likely RTC reset)"

### Tests
- 3 new Go unit tests: bimodal (60% good → bimodal_clock), all-bad (→
no_clock), 90%-good (→ ok)
- 1 new frontend test: bimodal badge rendering with tooltip
- Existing `TestReporterScenario_789` passes unchanged

Builds on #789 (recent-window severity).

Closes #845

---------

Co-authored-by: you <you@example.com>
2026-04-21 09:11:14 -07:00
Kpa-clawbot a371d35bfd feat(#847): dedupe Top Longest Hops by pair + add obs count and SNR cues (#848)
## Problem

The "Top 20 Longest Hops" RF analytics card shows the same repeater pair
filling most slots because the query sorts raw hop records by distance
with no pair deduplication. A single long link observed 12+ times
dominates the leaderboard.

## Fix

Dedupe by unordered `(pk1, pk2)` pair. Per pair, keep the max-distance
record and compute reliability metrics:

| Column | Description |
|--------|-------------|
| **Obs** | Total observations of this link |
| **Best SNR** | Maximum SNR seen (dB) |
| **Median SNR** | Median SNR across all observations (dB) |

Tooltip on each row shows the timestamp of the best observation.

### Before
| # | From | To | Distance | Type | SNR | Packet |
|---|------|----|----------|------|-----|--------|
| 1 | NodeX | NodeY | 200 mi | R↔R | 5 dB | abc… |
| 2 | NodeX | NodeY | 199 mi | R↔R | 6 dB | def… |
| 3 | NodeX | NodeY | 198 mi | R↔R | 4 dB | ghi… |

### After
| # | From | To | Distance | Type | Obs | Best SNR | Median SNR | Packet
|

|---|------|----|----------|------|-----|----------|------------|--------|
| 1 | NodeX | NodeY | 200 mi | R↔R | 12 | 8.0 dB | 5.2 dB | abc… |
| 2 | NodeA | NodeB | 150 mi | C↔R | 3 | 6.5 dB | 6.5 dB | jkl… |

## Changes

- **`cmd/server/store.go`**: Group `filteredHops` by unordered pair key,
accumulate obs count / best SNR / median SNR per group, sort by max
distance, take top 20
- **`cmd/server/types.go`**: Update `DistanceHop` struct — replace `SNR`
with `BestSnr`, `MedianSnr`, add `ObsCount`
- **`public/analytics.js`**: Replace single SNR column with Obs, Best
SNR, Median SNR; add row tooltip with best observation timestamp
- **`cmd/server/store_tophops_test.go`**: 3 unit tests — basic dedupe,
reverse-pair merge, nil SNR edge case

## Test Coverage

- `TestDedupeTopHopsByPair`: 5 records on pair (A,B) + 1 on (C,D) → 2
results, correct obsCount/dist/bestSnr/medianSnr
- `TestDedupeTopHopsReversePairMerges`: (B,A) and (A,B) merge into one
entry
- `TestDedupeTopHopsNilSNR`: all-nil SNR records → bestSnr and medianSnr
both nil
- Existing `TestAnalyticsRFEndpoint` and `TestAnalyticsRFWithRegion`
still pass

Closes #847

---------

Co-authored-by: you <you@example.com>
2026-04-21 09:09:39 -07:00
Kpa-clawbot 3f26dc7190 obs: surface real RSS alongside tracked store bytes in /api/stats (#832) (#835)
Closes #832.

## Root cause confirmed
\`trackedMB\` (\`s.trackedBytes\` in \`store.go\`) only sums per-packet
struct + payload sizes recorded at insertion. It excludes the index maps
(\`byHash\`, \`byTxID\`, \`byNode\`, \`byObserver\`, \`byPathHop\`,
\`byPayloadType\`, hash-prefix maps, name lookups), the analytics LRUs
(rfCache/topoCache/hashCache/distCache/subpathCache/chanCache/collisionCache),
WS broadcast queues, and Go runtime overhead. It's \"useful packet
bytes,\" not RSS — typically 3–5× off on staging.

## Fix (Option C from the issue)
Expose four memory fields on \`/api/stats\` from a single cached
snapshot:

| Field | Source | Semantics |
|---|---|---|
| \`storeDataMB\` | \`s.trackedBytes\` | in-store packet bytes; eviction
watermark input |
| \`goHeapInuseMB\` | \`runtime.MemStats.HeapInuse\` | live Go heap |
| \`goSysMB\` | \`runtime.MemStats.Sys\` | total Go-managed memory |
| \`processRSSMB\` | \`/proc/self/status VmRSS\` (Linux), falls back to
\`goSysMB\` | what the kernel sees |

\`trackedMB\` is retained as a deprecated alias for \`storeDataMB\` so
existing dashboards/QA scripts keep working.

Field invariants are documented on \`MemorySnapshot\`: \`processRSSMB ≥
goSysMB ≥ goHeapInuseMB ≥ storeDataMB\` (typical).

## Performance
Single \`getMemorySnapshot\` call cached for 1s —
\`runtime.ReadMemStats\` (stop-the-world) and the \`/proc/self/status\`
read are amortized across burst polling. \`/proc\` read is bounded to 8
KiB, parsed with \`strconv\` only — no shell-out, no untrusted input.

\`cgoBytesMB\` is omitted: the build uses pure-Go
\`modernc.org/sqlite\`, so there is no cgo allocator to measure.
Documented in code comment.

## Tests
\`cmd/server/stats_memory_test.go\` asserts presence, types, sign, and
ordering invariants. Avoids the flaky \"matches RSS to ±X%\" pattern.

\`\`\`
$ go test ./... -count=1 -timeout 180s
ok  	github.com/corescope/server	19.410s
\`\`\`

## QA plan
§1.4 now compares \`processRSSMB\` against procfs RSS (the right
invariant); threshold stays at 0.20.

---------

Co-authored-by: MeshCore Agent <meshcore-agent@openclaw.local>
2026-04-20 23:10:33 -07:00
Kpa-clawbot 886aabf0ae fix(#827): /api/packets/{hash} falls back to DB when in-memory store misses (#831)
Closes #827.

## Problem
`/api/packets/{hash}` only consulted the in-memory `PacketStore`. When a
packet aged out of memory, the handler 404'd — even though SQLite still
had it and `/api/nodes/{pubkey}` `recentAdverts` (which reads from the
DB) was actively surfacing the hash. Net effect: the **Analyze →** link
on older adverts in the node detail page led to a dead "Not found".

Two-store inconsistency: DB has the packet, in-memory doesn't, node
detail surfaces it from DB → packet detail can't serve it.

## Fix
In `handlePacketDetail`:
- After in-memory miss, fall back to `db.GetPacketByHash` (already
existed) for hash lookups, and `db.GetTransmissionByID` for numeric IDs.
- Track when the result came from the DB; if so and the store has no
observations, populate from DB via a new `db.GetObservationsForHash` so
the response shows real observations instead of the misleading
`observation_count = 1` fallback.

## Tests
- `TestPacketDetailFallsBackToDBWhenStoreMisses` — insert a packet
directly into the DB after `store.Load()`, confirm store doesn't have
it, assert 200 + populated observations.
- `TestPacketDetail404WhenAbsentFromBoth` — neither store nor DB → 404
(no false positives).
- `TestPacketDetailPrefersStoreOverDB` — both have it; store result wins
(no double-fetch).
- `TestHandlePacketDetailNoStore` updated: it previously asserted the
old buggy 404 behavior; now asserts the correct DB-fallback 200.

All `go test ./... -run "PacketDetail|Packet|GetPacket"` and the full
`cmd/server` suite pass.

## Out of scope
The `/api/packets?hash=` filter is the live in-memory list endpoint and
intentionally store-only for performance. Not touched here — happy to
file a follow-up if you'd rather harmonise.

## Repro context
Verified against prod with a recently-adverting repeater whose recent
advert hash lives in `recentAdverts` (DB) but had been evicted from the
in-memory store; pre-fix 404, post-fix 200 with full observations.

Co-authored-by: you <you@example.com>
2026-04-20 22:50:01 -07:00
Kpa-clawbot a0fddb50aa fix(#789): severity from recent samples; Theil-Sen drift with outlier rejection (#828)
Closes #789.

## The two bugs

1. **Severity from stale median.** `classifySkew(absMedian)` used the
all-time `MedianSkewSec` over every advert ever recorded for the node. A
repeater that was off for hours and then GPS-corrected stayed pinned to
`absurd` because hundreds of historical bad samples poisoned the median.
Reporter's case: `medianSkewSec: -59,063,561.8` while `lastSkewSec:
-0.8` — current health was perfect, dashboard said catastrophic.

2. **Drift from a single correction jump.** Drift used OLS over every
`(ts, skew)` pair, with no outlier rejection. A single GPS-correction
event (skew jumps millions of seconds in ~30s) dominated the regression
and produced `+1,793,549.9 s/day` — physically nonsense; the existing
`maxReasonableDriftPerDay` cap then zeroed it (better than absurd, but
still useless).

## The two fixes

1. **Recent-window severity.** New field `recentMedianSkewSec` = median
over the last `N=5` samples or last `1h`, whichever is narrower (more
current view). Severity now derives from `abs(recentMedianSkewSec)`.
`MeanSkewSec`, `MedianSkewSec`, `LastSkewSec` are preserved unchanged so
the frontend, fleet view, and any external consumers continue to work.

2. **Theil-Sen drift with outlier filter.** Drift now uses the Theil-Sen
estimator (median of all pairwise slopes — textbook robust regression,
~29% breakdown point) on a series pre-filtered to drop samples whose
skew jumps more than `maxPlausibleSkewJumpSec = 60s` from the previous
accepted point. Real µC drift is fractions of a second per advert; clock
corrections fall well outside. Capped at `theilSenMaxPoints = 200`
(most-recent) so O(n²) stays bounded for chatty nodes.

## What stays the same

- Epoch-0 / out-of-range advert filter (PR #769).
- `minDriftSamples = 5` floor.
- `maxReasonableDriftPerDay = 86400` hard backstop.
- API shape: only additions (`recentMedianSkewSec`); no fields removed
or renamed.

## Tests

All in `cmd/server/clock_skew_test.go`:

- `TestSeverityUsesRecentNotMedian` — 100 bad samples (-60s) + 5 good
(-1s) → severity = `ok`, historical median still huge.
- `TestDriftRejectsCorrectionJump` — 30 min of clean linear drift + one
1000s jump → drift small (~12 s/day).
- `TestTheilSenMatchesOLSWhenClean` — clean linear data, Theil-Sen
within ~1% of OLS.
- `TestReporterScenario_789` — exact reproducer: 1662 samples, 1657 @
-683 days then 5 @ -1s → severity `ok`, `recentMedianSkewSec ≈ 0`, drift
bounded; legacy `medianSkewSec` preserved as historical context.

`go test ./... -count=1` (cmd/server) and `node
test-frontend-helpers.js` both pass.

---------

Co-authored-by: clawbot <bot@corescope.local>
Co-authored-by: you <you@example.com>
2026-04-20 22:47:10 -07:00
efiten 7f024b7aa7 fix(#673): replace raw JSON text search with byNode index for node packet queries (#803)
## Summary

Fixes #673

- GRP_TXT packets whose message text contains a node's pubkey were
incorrectly counted as packets for that node, inflating packet counts
and type breakdowns
- Two code paths in `store.go` used `strings.Contains` on the full
`DecodedJSON` blob — this matched pubkeys appearing anywhere in the
JSON, including inside chat message text
- `filterPackets` slow path (combined node + other filters): replaced
substring search with a hash-set membership check against
`byNode[nodePK]`
- `GetNodeAnalytics`: removed the full-packet-scan + text search branch
entirely; always uses the `byNode` index (which already covers
`pubKey`/`destPubKey`/`srcPubKey` via structured field indexing)

## Test Plan

- [x] `TestGetNodeAnalytics_ExcludesGRPTXTWithPubkeyInText` — verifies a
GRP_TXT packet with the node's pubkey in its text field is not counted
in that node's analytics
- [x] `TestFilterPackets_NodeQueryDoesNotMatchChatText` — verifies the
combined-filter slow path of `filterPackets` returns only the indexed
ADVERT, not the chat packet

Both tests were written as failing tests against the buggy code and pass
after the fix.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 22:15:02 -07:00
Kpa-clawbot 2460e33f94 fix(#810): /health.recentPackets resolved_path falls back to longest sibling obs (#821)
## What + why

`fetchResolvedPathForTxBest` (used by every API path that fills the
top-level `resolved_path`, including
`/api/nodes/{pk}/health.recentPackets`) picked the observation with the
longest `path_json` and queried SQL for that single obs ID. When the
longest-path obs had `resolved_path` NULL but a shorter sibling had one,
the helper returned nil and the top-level field was dropped — even
though the data exists. QA #809 §2.1 caught it on the health endpoint
because that page surfaces it per-tx.

Fix: keep the LRU-friendly fast path (try the longest-path obs), then
fall back to scanning all observations of the tx and picking the longest
`path_json` that actually has a stored `resolved_path`.

## Changes
- `cmd/server/resolved_index.go`: extend `fetchResolvedPathForTxBest`
with a fallback through `fetchResolvedPathsForTx`.
- `cmd/server/issue810_repro_test.go`: regression test — seeds a tx
whose longest-path obs lacks `resolved_path` and a shorter sibling has
it, then asserts `/api/packets` and
`/api/nodes/{pk}/health.recentPackets` agree.

## Tests
`go test ./... -count=1` from `cmd/server` — PASS (full suite, ~19s).

## Perf
Fast path unchanged (single LRU/SQL lookup, dominant case). Fallback
only runs when the longest-path obs has NULL `resolved_path` — one
indexed query per affected tx, bounded by observations-per-tx (small).

Closes #810

---------

Co-authored-by: you <you@example.com>
2026-04-21 04:51:24 +00:00
Kpa-clawbot d7fe24e2db Fix channel filter on Packets page (UI + API) — #812 (#816)
Closes #812

## Root causes

**Server (`/api/packets?channel=…` returned identical totals):**
The handler in `cmd/server/routes.go` never read the `channel` query
parameter into `PacketQuery`, so it was silently ignored by both the
SQLite path (`db.go::buildTransmissionWhere`) and the in-memory path
(`store.go::filterPackets`). The codebase already had everything else in
place — the `channel_hash` column with an index from #762, decoded
`channel` / `channelHashHex` fields on each packet — it just wasn't
wired up.

**UI (`/#/packets` had no channel filter):**
`public/packets.js` rendered observer / type / time-window / region
filters but no channel control, and didn't read `?channel=` from the
URL.

## Fix

### Server
- New `Channel` field on `PacketQuery`; `handlePackets` reads
`r.URL.Query().Get("channel")`.
- DB path filters by the indexed `channel_hash` column (exact match).
- In-memory path: helper `packetMatchesChannel` matches
`decoded.channel` (plaintext, e.g. `#test`, `public`) or `enc_<HEX>`
against `channelHashHex` for undecryptable GRP_TXT. Uses cached
`ParsedDecoded()` so it's O(1) after first parse. Fast-path index guards
and the grouped-cache key updated to include channel.
- Regression test (`channel_filter_test.go`): `channel=#test` returns ≥1
GRP_TXT packet and fewer than baseline; `channel=nonexistentchannel`
returns `total=0`.

### UI
- New `<select id="fChannel">` populated from `/api/channels`.
- Round-trips via `?channel=…` on the URL hash (read on init, written on
change).
- Pre-seeds the current value as an option so encrypted hashes not in
`/api/channels` still display as selected on reload.
- On change, calls `loadPackets()` so the server-side filter applies
before pagination.

## Perf

Filter adds at most one cached map lookup per packet (DB path uses
indexed column, store path uses `ParsedDecoded()` cache). Staging
baseline 149–190 ms for `?channel=#test&limit=50`; the new comparison is
negligible. Target ≤ 500 ms preserved.

## Tests
`cd cmd/server && go test ./... -count=1 -timeout 120s` → PASS.

---------

Co-authored-by: you <you@example.com>
2026-04-20 21:46:34 -07:00
Kpa-clawbot 9e90548637 perf(#800): remove per-StoreTx ResolvedPath, replace with membership index + on-demand decode (#806)
## Summary

Remove `ResolvedPath []*string` field from `StoreTx` and `StoreObs`
structs, replacing it with a compact membership index + on-demand SQL
decode. This eliminates the dominant heap cost identified in profiling
(#791, #799).

**Spec:** #800 (consolidated from two rounds of expert + implementer
review on #799)

Closes #800
Closes #791

## Design

### Removed
- `StoreTx.ResolvedPath []*string`
- `StoreObs.ResolvedPath []*string`
- `TransmissionResp.ResolvedPath`, `ObservationResp.ResolvedPath` struct
fields

### Added
| Structure | Purpose | Est. cost at 1M obs |
|---|---|---:|
| `resolvedPubkeyIndex map[uint64][]int` | FNV-1a(pubkey) → []txID
forward index | 50–120 MB |
| `resolvedPubkeyReverse map[int][]uint64` | txID → []hashes for clean
removal | ~40 MB |
| `apiResolvedPathLRU` (10K entries) | FIFO cache for on-demand API
decode | ~2 MB |

### Decode-window discipline
`resolved_path` JSON decoded once per packet. Consumers fed in order,
temp slice dropped — never stored on struct:
1. `addToByNode` — relay node indexing
2. `touchRelayLastSeen` — relay liveness DB updates
3. `byPathHop` resolved-key entries
4. `resolvedPubkeyIndex` + reverse insert
5. WebSocket broadcast map (raw JSON bytes)
6. Persist batch (raw JSON bytes for SQL UPDATE)

### Collision safety
When the forward index returns candidates, a batched SQL query confirms
exact pubkey presence using `LIKE '%"pubkey"%'` on the `resolved_path`
column.

### Feature flag
`useResolvedPathIndex` (default `true`). Off-path is conservative: all
candidates kept, index not consulted. For one-release rollback safety.

## Files changed

| File | Changes |
|---|---|
| `resolved_index.go` | **New** — index structures, LRU cache, on-demand
SQL helpers, collision safety |
| `store.go` | Remove RP fields, decode-window discipline in
Load/Ingest, on-demand txToMap/obsToMap/enrichObs, eviction cleanup via
SQL, memory accounting update |
| `types.go` | Remove RP fields from TransmissionResp/ObservationResp |
| `routes.go` | Replace `nodeInResolvedPath` with
`nodeInResolvedPathViaIndex`, remove RP from mapSlice helpers |
| `neighbor_persist.go` | Refactor backfill: reverse-map removal →
forward+reverse insert → LRU invalidation |

## Tests added (27 new)

**Unit:**
- `TestStoreTx_ResolvedPathFieldAbsent` — reflection guard
- `TestResolvedPubkeyIndex_BuildFromLoad` — forward+reverse consistency
- `TestResolvedPubkeyIndex_HashCollision` — SQL collision safety
- `TestResolvedPubkeyIndex_IngestUpdate` — maps reflect new ingests
- `TestResolvedPubkeyIndex_RemoveOnEvict` — clean removal via reverse
map
- `TestResolvedPubkeyIndex_PerObsCoverage` — non-best obs pubkeys
indexed
- `TestAddToByNode_WithoutResolvedPathField`
- `TestTouchRelayLastSeen_WithoutResolvedPathField`
- `TestWebSocketBroadcast_IncludesResolvedPath`
- `TestBackfill_InvalidatesLRU`
- `TestEviction_ByNodeCleanup_OnDemandSQL`
- `TestExtractResolvedPubkeys`, `TestMergeResolvedPubkeys`
- `TestResolvedPubkeyHash_Deterministic`
- `TestLRU_EvictionOnFull`

**Endpoint:**
- `TestPathsThroughNode_NilResolvedPathFallback`
- `TestPacketsAPI_OnDemandResolvedPath`
- `TestPacketsAPI_OnDemandResolvedPath_LRUHit`
- `TestPacketsAPI_OnDemandResolvedPath_Empty`

**Feature flag:**
- `TestFeatureFlag_OffPath_PreservesOldBehavior`
- `TestFeatureFlag_Toggle_NoStateLeak`

**Concurrency:**
- `TestReverseMap_NoLeakOnPartialFailure`
- `TestDecodeWindow_LockHoldTimeBounded`
- `TestLivePolling_LRUUnderConcurrentIngest`

**Regression:**
- `TestRepeaterLiveness_StillAccurate`

**Benchmarks:**
- `BenchmarkLoad_BeforeAfter`
- `BenchmarkResolvedPubkeyIndex_Memory`
- `BenchmarkPathsThroughNode_Latency`
- `BenchmarkLivePolling_UnderIngest`

## Benchmark results

```
BenchmarkResolvedPubkeyIndex_Memory/pubkeys=50K     429ms  103MB   777K allocs
BenchmarkResolvedPubkeyIndex_Memory/pubkeys=500K   4205ms  896MB  7.67M allocs
BenchmarkLoad_BeforeAfter                            65ms   20MB   202K allocs
BenchmarkPathsThroughNode_Latency                   3.9µs    0B      0 allocs
BenchmarkLivePolling_UnderIngest                    5.4µs  545B      7 allocs
```

Key: per-obs `[]*string` overhead completely eliminated. At 1M obs with
3 hops average, this saves ~72 bytes/obs × 1M = ~68 MB just from the
slice headers + pointers, plus the JSON-decoded string data (~900 MB at
scale per profiling).

## Design choices

- **FNV-1a instead of xxhash**: stdlib availability, no external
dependency. Performance is equivalent for this use case (pubkey strings
are short).
- **FIFO LRU instead of true LRU**: simpler implementation, adequate for
the access pattern (mostly sequential obs IDs from live polling).
- **Grouped packets view omits resolved_path**: cold path, not worth SQL
round-trip per page render.
- **Backfill pending check uses reverse-map presence** instead of
per-obs field: if a tx has any indexed pubkeys, its observations are
considered resolved.


Closes #807

---------

Co-authored-by: you <you@example.com>
2026-04-20 19:55:00 -07:00
Kpa-clawbot a8e1cea683 fix: use payload type bits only in content hash (not full header byte) (#787)
## Problem

The firmware computes packet content hash as:

```
SHA256(payload_type_byte + [path_len for TRACE] + payload)
```

Where `payload_type_byte = (header >> 2) & 0x0F` — just the payload type
bits (2-5).

CoreScope was using the **full header byte** in its hash computation,
which includes route type bits (0-1) and version bits (6-7). This meant
the same logical packet produced different content hashes depending on
route type — breaking dedup and packet lookup.

**Firmware reference:** `Packet.cpp::calculatePacketHash()` uses
`getPayloadType()` which returns `(header >> PH_TYPE_SHIFT) &
PH_TYPE_MASK`.

## Fix

- Extract only payload type bits: `payloadType := (headerByte >> 2) &
0x0F`
- Include `path_len` byte in hash for TRACE packets (matching firmware
behavior)
- Applied to both `cmd/server/decoder.go` and `cmd/ingestor/decoder.go`

## Tests Added

- **Route type independence:** Same payload with FLOOD vs DIRECT route
types produces identical hash
- **TRACE path_len inclusion:** TRACE packets with different `path_len`
produce different hashes
- **Firmware compatibility:** Hash output matches manual computation of
firmware algorithm

## Migration Impact

Existing packets in the DB have content hashes computed with the old
(incorrect) formula. Options:

1. **Recompute hashes** via migration (recommended for clean state)
2. **Dual lookup** — check both old and new hash on queries (backward
compat)
3. **Accept the break** — old hashes become stale, new packets get
correct hashes

Recommend option 1 (migration) as a follow-up. The volume of affected
packets depends on how many distinct route types were seen for the same
logical packet.

Fixes #786

---------

Co-authored-by: you <you@example.com>
2026-04-18 11:52:22 -07:00
Kpa-clawbot bf674ebfa2 feat: validate advert signatures on ingest, reject corrupt packets (#794)
## Summary

Validates ed25519 signatures on ADVERT packets during MQTT ingest.
Packets with invalid signatures are rejected before storage, preventing
corrupt/truncated adverts from polluting the database.

## Changes

### Ingestor (`cmd/ingestor/`)

- **Signature validation on ingest**: After decoding an ADVERT, checks
`SignatureValid` from the decoder. Invalid signatures → packet dropped,
never stored.
- **Config flag**: `validateSignatures` (default `true`). Set to `false`
to disable validation for backward compatibility with existing installs.
- **`dropped_packets` table**: New SQLite table recording every rejected
packet with full attribution:
- `hash`, `raw_hex`, `reason`, `observer_id`, `observer_name`,
`node_pubkey`, `node_name`, `dropped_at`
  - Indexed on `observer_id` and `node_pubkey` for investigation queries
- **`SignatureDrops` counter**: New atomic counter in `DBStats`, logged
in periodic stats output as `sig_drops=N`
- **Retention**: `dropped_packets` pruned alongside metrics on the same
`retention.metricsDays` schedule

### Server (`cmd/server/`)

- **`GET /api/dropped-packets`** (API key required): Returns recent
drops with optional `?observer=` and `?pubkey=` filters, `?limit=`
(default 100, max 500)
- **`signatureDrops`** field added to `/api/stats` response (count from
`dropped_packets` table)

### Tests (8 new)

| Test | What it verifies |
|------|-----------------|
| `TestSigValidation_ValidAdvertStored` | Valid advert passes validation
and is stored |
| `TestSigValidation_TamperedSignatureDropped` | Tampered signature →
dropped, recorded in `dropped_packets` with correct fields |
| `TestSigValidation_TruncatedAppdataDropped` | Truncated appdata
invalidates signature → dropped |
| `TestSigValidation_DisabledByConfig` | `validateSignatures: false`
skips validation, stores tampered packet |
| `TestSigValidation_DropCounterIncrements` | Counter increments
correctly across multiple drops |
| `TestSigValidation_LogContainsFields` | `dropped_packets` row contains
hash, reason, observer, pubkey, name |
| `TestPruneDroppedPackets` | Old entries pruned, recent entries
retained |
| `TestShouldValidateSignatures_Default` | Config helper returns correct
defaults |

### Config example

```json
{
  "validateSignatures": true
}
```

Fixes #793

---------

Co-authored-by: you <you@example.com>
2026-04-18 11:39:13 -07:00
Kpa-clawbot d596becca3 feat: bounded cold load — limit Load() by memory budget (#790)
## Implements #748 M1 — Bounded Cold Load

### Problem
`Load()` pulls the ENTIRE database into RAM before eviction runs. On a
1GB database, this means 3+ GB peak memory at startup, regardless of
`maxMemoryMB`. This is the root cause of #743 (OOM on 2GB VMs).

### Solution
Calculate the maximum number of transmissions that fit within the
`maxMemoryMB` budget and use a SQL subquery LIMIT to load only the
newest packets.

**Two-phase approach** (avoids the JOIN-LIMIT row count problem):
```sql
SELECT ... FROM transmissions t
LEFT JOIN observations o ON ...
WHERE t.id IN (SELECT id FROM transmissions ORDER BY first_seen DESC LIMIT ?)
ORDER BY t.first_seen ASC, o.timestamp DESC
```

### Changes
- **`estimateStoreTxBytesTypical(numObs)`** — estimates memory cost of a
typical transmission without needing an actual `StoreTx` instance. Used
for budget calculation.
- **Budget calculation in `Load()`** — `maxPackets = (maxMemoryMB *
1048576) / avgBytesPerPacket` with a floor of 1000 packets.
- **Subquery LIMIT** — loads only the newest N transmissions when
bounded.
- **`oldestLoaded` tracking** — records the oldest packet timestamp in
memory so future SQL fallback queries (M2+) know where in-memory data
ends.
- **Perf stats** — `oldestLoaded` exposed in `/api/perf/store-stats`.
- **Logging** — bounded loads show `Loaded X/Y transmissions (limited by
ZMB budget)`.

### When `maxMemoryMB=0` (unlimited)
Behavior is completely unchanged — no LIMIT clause, all packets loaded.

### Tests (6 new)
| Test | Validates |
|------|-----------|
| `TestBoundedLoad_LimitedMemory` | With 1MB budget, loads fewer than
total (hits 1000 minimum) |
| `TestBoundedLoad_NewestFirst` | Loaded packets are the newest, not
oldest |
| `TestBoundedLoad_OldestLoadedSet` | `oldestLoaded` matches first
packet's `FirstSeen` |
| `TestBoundedLoad_UnlimitedWithZero` | `maxMemoryMB=0` loads all
packets |
| `TestBoundedLoad_AscendingOrder` | Packets remain in ascending
`first_seen` order after bounded load |
| `TestEstimateStoreTxBytesTypical` | Estimate grows with observation
count, exceeds floor |

Plus benchmarks: `BenchmarkLoad_Bounded` vs `BenchmarkLoad_Unlimited`.

### Perf justification
On a 5000-transmission test DB with 1MB budget:
- Bounded: loads 1000 packets (the minimum) in ~1.3s
- The subquery uses SQLite's index on `first_seen` — O(N log N) for the
LIMIT, then indexed JOIN for observations
- No full table scan needed when bounded

### Next milestones
- **M2**: Packet list/search SQL fallback (uses `oldestLoaded` boundary)
- **M3**: Node analytics SQL fallback
- **M4-M5**: Remaining endpoint fallbacks + live-only memory store

---------

Co-authored-by: you <you@example.com>
2026-04-17 18:35:44 -07:00
Joel Claw b9ba447046 feat: add nodeBlacklist config to hide abusive/troll nodes (#742)
## Problem

Some mesh participants set offensive names, report deliberately false
GPS positions, or otherwise troll the network. Instance operators
currently have no way to hide these nodes from public-facing APIs
without deleting the underlying data.

## Solution

Add a `nodeBlacklist` array to `config.json` containing public keys of
nodes to exclude from all API responses.

### Blacklisted nodes are filtered from:

- `GET /api/nodes` — list endpoint
- `GET /api/nodes/search` — search results
- `GET /api/nodes/{pubkey}` — detail (returns 404)
- `GET /api/nodes/{pubkey}/health` — returns 404
- `GET /api/nodes/{pubkey}/paths` — returns 404
- `GET /api/nodes/{pubkey}/analytics` — returns 404
- `GET /api/nodes/{pubkey}/neighbors` — returns 404
- `GET /api/nodes/bulk-health` — filtered from results

### Config example

```json
{
  "nodeBlacklist": [
    "aabbccdd...",
    "11223344..."
  ]
}
```

### Design decisions

- **Case-insensitive** — public keys normalized to lowercase
- **Whitespace trimming** — leading/trailing whitespace handled
- **Empty entries ignored** — `""` or `" "` do not cause false positives
- **Nil-safe** — `IsBlacklisted()` on nil Config returns false
- **Backward-compatible** — empty/missing `nodeBlacklist` has zero
effect
- **Lazy-cached set** — blacklist converted to `map[string]bool` on
first lookup

### What this does NOT do (intentionally)

- Does **not** delete or modify database data — only filters API
responses
- Does **not** block packet ingestion — data still flows for analytics
- Does **not** filter `/api/packets` — only node-facing endpoints are
affected

## Testing

- Unit tests for `Config.IsBlacklisted()` (case sensitivity, whitespace,
empty entries, nil config)
- Integration tests for `/api/nodes`, `/api/nodes/{pubkey}`,
`/api/nodes/search`
- Full test suite passes with no regressions
2026-04-17 23:43:05 +00:00