## Problem
Adding a channel key in the Channels UI for a channel the server already
knows about (e.g. `#public` from rainbow / config) leaves the
localStorage entry **unremovable**:
- `mergeUserChannels` sees the name already exists in the channel list
and skips the user entry.
- The existing channel row is never marked `userAdded:true`.
- The ✕ button (`[data-remove-channel]`) is only rendered for
`userAdded` rows.
- Result: stuck localStorage key, no UI to delete it.
There was also a latent bug in the remove handler — for non-`user:`
rows, it used the raw hash (e.g. `enc_11`) as the
`ChannelDecrypt.removeKey()` argument, but the storage key is the
channel **name**.
## Fix
1. **`mergeUserChannels`**: when a stored key matches an existing
channel by name/hash, mark the existing channel `userAdded=true` so the
✕ renders on it. (No magical/auto deletion of stored keys — the user
explicitly chooses to remove.)
2. **Remove handler**:
- Look up the channel object to get the correct display name for the
localStorage key.
- Keep server-known channels in the list when their ✕ is clicked (only
the user's localStorage entry + cache are cleared, `userAdded` is
unset). The channel still exists upstream.
- Pure `user:`-prefixed channels are removed from the list as before.
## Repro
1. Open Channels.
2. Add a key for `#public` (or any rainbow-known channel).
3. Reload. Before this PR: row has no ✕, key is stuck. After this PR: ✕
appears, click clears the local key and cache.
## Files
- `public/channels.js` only.
## Notes
- No backend changes.
- No new APIs.
- Behaviour for purely user-added channels (e.g. `user:#somechannel` not
known to the server) is unchanged.
---------
Co-authored-by: you <you@example.com>
## Symptom
🔍 Details button in the nodes side panel does nothing on click.
## Root cause (4th regression of the same shape)
- Row click → `selectNode()` → `history.replaceState(null, '',
'#/nodes/' + pk)`
- Details button click → `location.hash = '#/nodes/' + pk`
- Hash is already that value → assignment is a no-op → no `hashchange`
event → no router → panel stays open.
## Fix
Mirror the analytics-link branch already inside the panel click handler:
`destroy()` then `init(appEl, pubkey)` directly (which hits the
`directNode` full-screen branch unconditionally). Also `replaceState` to
keep the URL in sync.
## Test
New Playwright E2E: open side panel via row click, click Details, assert
`.node-fullscreen` appears.
## Why this keeps regressing
Every time we tighten the row-click handler to use `replaceState`
(correct — avoids hashchange flicker), the button-click handler that
uses `location.hash` becomes a no-op for the same pubkey. Need to
remember they're coupled. Worth a follow-up to extract a
`navigateToNode(pk)` helper that always works regardless of current hash
state — filing as #890-followup if not already there.
Co-authored-by: you <you@example.com>
## Symptom
The colored byte strip in the packet detail pane is offset from the
labeled byte breakdown below it. Off by N bytes where N is the
difference between the top-level packet's path length and the displayed
observation's path length.
## Root cause
Server computes `breakdown.ranges` once from the top-level packet's
raw_hex (in `BuildBreakdown`) and ships it in the API response. After
#882 we render each observation's own raw_hex, but we keep using the
top-level breakdown — so a 7-hop top-level packet shipped "Path: bytes
2-8", and when we rendered an 8-hop observation we coloured 7 of the 8
path bytes and bled into the payload.
The labeled rows below (which use `buildFieldTable`) parse the displayed
raw_hex on the client, so they were correct — they just didn't match the
strip above.
## Fix
Port `BuildBreakdown()` to JS as `computeBreakdownRanges()` in `app.js`.
Use it in `renderDetail()` from the actually-rendered (per-obs) raw_hex.
## Test
Manually verified the JS function output matches the Go implementation
for FLOOD/non-transport, transport, ADVERT, and direct-advert (zero
hops) cases.
Closes nothing (caught in post-tag bug bash).
---------
Co-authored-by: you <you@example.com>
## Problem
Per-observation `path_json` disagrees with `raw_hex` path section for
TRACE packets.
**Reproducer:** packet `af081a2c41281b1e`, observer `lutin🏡`
- `path_json`: `["67","33","D6","33","67"]` (5 hops — from TRACE
payload)
- `raw_hex` path section: `30 2D 0D 23` (4 bytes — SNR values in header)
## Root Cause
`DecodePacket` correctly parses TRACE packets by replacing `path.Hops`
with hop IDs from the payload's `pathData` field (the actual route).
However, the header path bytes for TRACE packets contain **SNR values**
(one per completed hop), not hop IDs.
`BuildPacketData` used `decoded.Path.Hops` to build `path_json`, which
for TRACE packets contained the payload-derived hops — not the header
path bytes that `raw_hex` stores. This caused `path_json` and `raw_hex`
to describe completely different paths.
## Fix
- Added `DecodePathFromRawHex(rawHex)` — extracts header path hops
directly from raw hex bytes, independent of any TRACE payload
overwriting.
- `BuildPacketData` now calls `DecodePathFromRawHex(msg.Raw)` instead of
using `decoded.Path.Hops`, guaranteeing `path_json` always matches the
`raw_hex` path section.
## Tests (8 new)
**`DecodePathFromRawHex` unit tests:**
- hash_size 1, 2, 3, 4
- zero-hop direct packets
- transport route (4-byte transport codes before path)
**`BuildPacketData` integration tests:**
- TRACE packet: asserts path_json matches raw_hex header path (not
payload hops)
- Non-TRACE packet: asserts path_json matches raw_hex header path
All existing tests continue to pass (`go test ./...` for both ingestor
and server).
Fixes#886
---------
Co-authored-by: you <you@example.com>
## Problem
On the packet detail pane, the **path pill** (top) and the **byte
breakdown** (bottom) showed different numbers of hops for the same
packet. Example: `46cf35504a21ef0d` rendered as `1 hop` badge followed
by 8 node names in the path pill, while the byte breakdown listed only 1
hop row.
## Root cause
Mixed data sources:
- Path-pill badge used `(raw_hex path_len) & 0x3F` (= firmware truth for
one observer = 1)
- Path-pill names used `path_json.length` (= server-aggregated longest
path across observers = 8)
- Byte breakdown section header used `(raw_hex path_len) & 0x3F` (= 1)
- Byte breakdown rows were sliced from `raw_hex` (= 1 row)
- `renderPath(pathHops, ...)` iterated all `path_json` entries
For group-header view, `packet.path_json` is aggregated across observers
and therefore longer than the raw_hex of any single observer's packet.
## Fix
Both surfaces now render from `pathHops` (= effective observation's
`path_json`). The raw_hex vs path_json mismatch is still logged as a
console.warn for diagnostics, but does not drive the UI.
With per-observation `raw_hex` (#882) shipped, clicking an observation
row already swaps the effective packet so both surfaces stay consistent.
## Testing
- Adds E2E regression `Packet detail path pill and byte breakdown agree
on hop count` that asserts:
1. `pill badge count == byte breakdown section count`
2. `rendered hop names ≈ badge count` (within 1 for separators)
3. `byte breakdown rendered rows == section count`
- Manually reproduced on staging with `46cf35504a21ef0d` (8-name path +
`1 hop` badge before fix).
Related: #881#882#866
---------
Co-authored-by: you <you@example.com>
## Problem
`pickByAffinity` in `hop-resolver.js` picks wrong regional candidates
when 1-byte pubkey prefixes collide. The old implementation only
considers one adjacent hop (forward OR backward pass), leading to
suboptimal picks when both neighbors provide useful context.
Measured on staging: **61.6% of hops have ≥2 same-prefix candidates**,
making collision resolution critical.
## Fix
Replaced the separate forward/backward pass disambiguation with a
**combined iterative resolver** that scores candidates against BOTH prev
and next resolved hops:
1. **Neighbor-graph edge weight** (priority 1): Sum edge scores to prev
+ next pubkeys. Pick max sum.
2. **Geographic centroid** (priority 2): Average lat/lon of prev + next
positions. Pick closest candidate by haversine distance.
3. **Single-anchor geo** (priority 3): When only one neighbor is
resolved, use it directly.
4. **Fallback** (priority 4): First candidate when no context exists.
The iterative approach resolves cascading dependencies — resolving one
ambiguous hop may unlock context for its neighbors.
### Dev-mode trace
Multi-candidate picks now emit: `[hop-resolver] hash=46 candidates=N
scored=[...] chose=<pubkey> method=graph|centroid|fallback`
## Before/After (staging, 1539 packets, 12928 hops)
| Metric | Before | After |
|--------|--------|-------|
| Unreliable hops | 39 (0.3%) | 23 (0.2%) |
| Packets with unreliable | 33 (2.14%) | 17 (1.10%) |
~41% reduction in unreliable hops, ~48% reduction in affected packets.
## Tests
5 new tests in `test-frontend-helpers.js`:
- Graph edge scoring picks correct regional candidate
- Next hop breaks tie when prev has no edges
- Centroid fallback when no graph edges exist
- Centroid uses average of prev+next positions
- Fallback when no context at all
All 595 tests pass. No regressions in `test-packet-filter.js` (62 pass)
or `test-aging.js` (29 pass).
Closes#874
---------
Co-authored-by: you <you@example.com>
## Problem
Each MeshCore observer receives a physically distinct over-the-air byte
sequence for the same transmission (different path bytes, flags/hops
remaining). The `observations` table stored only `path_json` per
observer — all observations pointed at one `transmissions.raw_hex`. This
prevented the hex pane from updating when switching observations in the
packet detail view.
## Changes
| Layer | Change |
|-------|--------|
| **Schema** | `ALTER TABLE observations ADD COLUMN raw_hex TEXT`
(nullable). Migration: `observations_raw_hex_v1` |
| **Ingestor** | `stmtInsertObservation` now stores per-observer
`raw_hex` from MQTT payload |
| **View** | `packets_v` uses `COALESCE(o.raw_hex, t.raw_hex)` —
backward compatible with NULL historical rows |
| **Server** | `enrichObs` prefers `obs.RawHex` when non-empty, falls
back to `tx.RawHex` |
| **Frontend** | No changes — `effectivePkt.raw_hex` already flows
through `renderDetail` |
## Tests
- **Ingestor**: `TestPerObservationRawHex` — two MQTT packets for same
hash from different observers → both stored with distinct raw_hex
- **Server**: `TestPerObservationRawHexEnrich` — enrichObs returns
per-obs raw_hex when present, tx fallback when NULL
- **E2E**: Playwright assertion in `test-e2e-playwright.js` for hex pane
update on observation switch
E2E assertion added: `test-e2e-playwright.js:1794`
## Scope
- Historical observations: raw_hex stays NULL, UI falls back to
transmission raw_hex silently
- No backfill, no path_json reconstruction, no frontend changes
Closes#881
---------
Co-authored-by: you <you@example.com>
## Problem
When a packet group is expanded in the Packets table, clicking any child
row pointed the side pane at the same aggregate packet — not the clicked
observation. URL would flip between `?obs=<packet_id>` values instead of
real observation ids.
## Root cause
The expand fetch used `/api/packets?hash=X&limit=20`, which returns ONE
aggregate row keyed by packet.id. Every child therefore carried
`data-value=<packet.id>`.
## Fix
Switch the expand fetch to `/api/packets/<hash>`, which includes the
full `observations[]` array. Build `_children` as `{...pkt, ...obs}` so
each child row gets a unique observation id and observation-level fields
(observer, path, timestamp, snr/rssi) override the aggregate.
## Verified live on staging
Tested on multiple packets:
- Click group-header → side pane shows observation 1 of N (first
observer)
- Click child row → pane updates to show THAT observer's details:
observer name, path, timestamp, obs counter (K of N), URL
`?obs=<observation_id>`
## Tests
592 frontend tests pass (no new ones — this is a wiring fix, live
E2E-verified instead).
Closes#866
---------
Co-authored-by: Kpa-clawbot <agent@corescope.local>
Co-authored-by: you <you@example.com>
## Problem
The `hop-unreliable` CSS class applied `text-decoration: line-through`
and `opacity: 0.5`, making hop names look "dead" to operators. This
caused confusion — the repeater itself is fine, only the name→hash
assignment is uncertain.
## Fix
- **CSS**: Removed `line-through` and heavy opacity from
`.hop-unreliable`. Kept subtle `opacity: 0.85` for scanability. Added
`.hop-unreliable-btn` style for the new badge.
- **JS**: Added a `⚠️` warning badge button next to unreliable hops
(similar pattern to existing conflict badges). The badge is always
visible, keyboard-focusable, and has both `title` and `aria-label` with
an informative tooltip explaining geographic inconsistency.
- **Tests**: Added 2 tests in `test-frontend-helpers.js` asserting the
badge renders for unreliable hops and does NOT render for reliable ones,
and that no `line-through` is present.
### Before → After
| Before | After |
|--------|-------|
| ~~NodeName~~ (struck through, 50% opacity) | NodeName ⚠️ (normal text,
small warning badge with tooltip) |
## Scope
Resolver logic untouched — #873 covers threshold tuning, #874 covers
picker correctness. No candidate-dropdown UX (follow-up per issue
discussion).
Closes#872
Co-authored-by: you <you@example.com>
## Problem
On `/#/packets/<hash>?obs=<id>`, clicking a different observation
updated summary fields (Observer, SNR/RSSI, Timestamp) but **not** hex
payload or path details. Sister bug to #849 (fixed in #851 for the
detail dialog).
## Root Causes
| Cause | Impact |
|-------|--------|
| `selectPacket` called `renderDetail` without `selectedObservationId` |
Initial render missed observation context on some code paths |
| `ObservationResp` missing `direction`, `resolved_path`, `raw_hex` |
Frontend obs-switch lost direction and resolved_path context |
| `obsPacket` construction omitted `direction` field | Direction not
preserved when switching observations |
## Fix
- `selectPacket` explicitly passes `selectedObservationId` to
`renderDetail`
- `ObservationResp` gains `Direction`, `ResolvedPath`, `RawHex` fields
- `mapSliceToObservations` copies the three new fields
- `obsPacket` spreads include `direction` from the observation
## Tests
7 new tests in `test-frontend-helpers.js`:
- Observation switch updates `effectivePkt` path
- `raw_hex` preserved from packet when obs has none
- `raw_hex` from obs overrides when API provides it
- `direction` carried through observation spread
- `resolved_path` carried through observation spread
- `getPathLenOffset` cross-check for transport routes
- URL hash `?obs=` round-trip encoding
All 584 frontend + 62 filter + 29 aging tests pass. Go server tests
pass.
Fixes#866
Co-authored-by: you <you@example.com>
## Problem
`docker pull` on ARM devices fails because the published image is
amd64-only.
## Fix
Enable multi-arch Docker builds via `docker buildx`. **Builder stage
uses native Go cross-compilation; only the runtime-stage `RUN` steps use
QEMU emulation.**
### Changes
| File | Change |
|------|--------|
| `Dockerfile` | Pin builder stage to `--platform=$BUILDPLATFORM`
(always native), accept `ARG TARGETOS`/`ARG TARGETARCH` from buildx, set
`GOOS=$TARGETOS GOARCH=$TARGETARCH CGO_ENABLED=0` on every `go build` |
| `.github/workflows/deploy.yml` | Add `docker/setup-buildx-action@v3` +
`docker/setup-qemu-action@v3` (latter needed only for runtime-stage
RUNs), set `platforms: linux/amd64,linux/arm64` |
### Build architecture
- **Builder stage** (`FROM --platform=$BUILDPLATFORM
golang:1.22-alpine`) — runs natively on amd64. Go toolchain
cross-compiles the binaries to `$TARGETARCH` via `GOOS/GOARCH`. No
emulation, ~10× faster than emulated builds. Works because
`modernc.org/sqlite` is pure Go (no CGO).
- **Runtime stage** (`FROM alpine:3.20`) — buildx pulls the per-arch
base. RUN steps (`apk add`, `mkdir/chown`, `chmod`) execute inside the
target-arch image, so QEMU is required to interpret arm64 binaries on
the amd64 host. Only a handful of short shell commands run under
emulation, so the QEMU cost is small.
### Verify
After merge, on an ARM device:
```bash
docker pull ghcr.io/kpa-clawbot/corescope:edge
docker inspect ghcr.io/kpa-clawbot/corescope:edge --format '{{.Architecture}}'
# → arm64
```
> First arm64 image appears on the next push to master after this
merges.
Closes#768
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <agent@corescope.local>
## Summary
Four related `/nodes` page fixes batched to avoid merge conflicts (all
touch `public/nodes.js`).
---
### #855 — "Show all neighbors" link doesn't expand
**Problem:** The "View all N neighbors →" link in the side panel
navigated to the full detail page instead of expanding the truncated
list inline.
**Fix:** Replaced navigation link with an inline "Show all N neighbors
▼" button that re-renders the neighbor table without the limit.
**Acceptance:** Click the button → all neighbors appear in the same
panel without page navigation.
Closes#855
---
### #856 — "Details" button is a no-op
**Problem:** The "🔍 Details" link in the side panel was an `<a>` tag
whose `href` matched the current hash (set by `replaceState`), making
clicks a same-hash no-op.
**Fix:** Changed from `<a>` link to a `<button>` with a direct click
handler that sets `location.hash`, ensuring the router always fires.
**Acceptance:** Click "🔍 Details" → navigates to full-screen node detail
view.
Closes#856
---
### #857 — Recent Packets shows bullets but no content
**Problem:** The "Recent Packets (N)" section could render entries with
missing `hash` or `timestamp`, producing colored dots with no meaningful
content beside them.
**Fix:** Added `.filter(a => a.hash && a.timestamp)` before rendering,
and updated the count header to reflect filtered entries only.
**Acceptance:** Recent Packets section only shows entries with valid
data; count matches visible items.
Closes#857
---
### #862 — Pubkey prefix search on /#/nodes
**Problem:** Search box only matched node names. Operators couldn't
search by pubkey prefix.
**Fix:** Extended search to detect hex-only queries (`/^[0-9a-f]+$/i`)
and match them against pubkey prefix (`startsWith`). Non-hex queries
continue matching name as before. Both are composable in the same input.
**Acceptance:**
- Typing `3f` filters to nodes whose pubkey starts with `3f`
- Typing `foo` still filters by name
- Search placeholder updated to indicate pubkey support
5 new unit tests added for the search matching logic.
Closes#862
---------
Co-authored-by: you <you@example.com>
## What
Remove a single line in `makeColumnsResizable()` that set
`th.style.position = 'relative'` on every `<th>` except the last,
overriding the CSS `position: sticky` rule from `.data-table th`.
## Why
The column-resize feature added inline `position: relative` to each
header (except the last) to serve as a containing block for the
absolute-positioned resize handles. This inadvertently broke `position:
sticky` on all headers except "Details" (the last column) — visible on
mobile when scrolling the packets table.
`position: sticky` is itself a positioned value and serves as a
containing block for absolute children, so the resize handles work
identically without the override.
## Test
- Open `/#/packets` on mobile (or narrow viewport)
- Scroll down — ALL column headers now remain sticky at the top
- Column resize handles still function correctly on desktop
Fixes#861
Co-authored-by: you <you@example.com>
Two bugs in the Overview tab Packets/Hour chart:
1. **Bars not rendering**: `barW` went negative when `data.length` was
large (e.g. 720 hours for 30-day range), producing zero-width invisible
bars. Fix: `Math.max(1, ...)` floor on bar width.
2. **X-axis labels overlapping**: Every single hour label was emitted
(`02h03h04h...`). Fix: decimate labels based on time range — every 6h
for ≤24h, every 12h for ≤72h, every 24h beyond. Shows `MM-DD` on
midnight boundaries for multi-day ranges.
**Scope**: Only touches the Overview tab `Packets / Hour` section and
the shared `barChart` floor (one-line change). No modifications to
Topology, Channels, Distance, or other tabs.
Fixes#858
Co-authored-by: you <you@example.com>
Fixes#859
## What
The "Per-Observer Reachability" and "Best Path to Each Node" sections in
the Topology tab had inline `opacity` styles on each `.reach-ring` row
that decreased with hop count (`1 - hops * 0.06`, floored at 0.3). This
made text progressively darker/unreadable toward the bottom.
## Fix
Removed the inline `opacity:${opacity}` style from both
`renderPerObserverReach()` and `renderBestPath()`. The rows now render
at full opacity with text colors governed by CSS variables as intended.
## Changed
- `public/analytics.js`: removed opacity computation and inline style in
two functions (4 lines removed, 2 added)
## Scope
Only touches Per-Observer Reachability and Best Path rendering. No
changes to Overview, Channels, or shared helpers.
Co-authored-by: you <you@example.com>
## What & Why
The "Messages / Hour by Channel" chart on `/#/analytics` Channels tab
rendered all channels in both the SVG and legend, causing legend
overflow when 20+ channels are present.
## Fix
- Sort channels by total message volume (descending)
- Render only the top 8 in the chart and legend
- Show "+N more" in the legend when channels are truncated
- `maxCount` for Y-axis scaling is computed from visible channels only,
so the chart uses its full vertical range
Single-file change: `public/analytics.js` — only
`renderChannelTimeline()` modified. No shared helpers touched.
Fixes#860
Co-authored-by: you <you@example.com>
## Problem
The "Deploy Staging" job's Smoke Test always fails with `Staging
/api/stats did not return engine field`.
Root cause: the step hardcodes `http://localhost:82/api/stats`, but
`docker-compose.staging.yml:21` publishes the container on
`${STAGING_GO_HTTP_PORT:-80}:80`. Default is port 80, not 82. curl gets
ECONNREFUSED, `-sf` swallows the error, `grep -q engine` sees empty
input → failure.
Verified on staging VM: `ss -lntp` shows only `:80` listening; `docker
ps` confirms `0.0.0.0:80->80/tcp`. A `curl http://localhost:82` returns
connection-refused.
## Fix
Read `STAGING_GO_HTTP_PORT` (same default as compose) so the smoke test
tracks the port the container was actually launched on. Failure message
now includes the resolved port to make future port mismatches
self-diagnosing.
## Tested
Logic only — the curl + grep pattern is unchanged. If any CI env
override sets `STAGING_GO_HTTP_PORT`, the smoke test now follows it.
Co-authored-by: Kpa-clawbot <agent@corescope.local>
## Problem
Nodes with flaky RTC (firmware emitting interleaved good and nonsense
timestamps) were classified as `no_clock` because the broken samples
poisoned the recent median. Operators lost visibility into these nodes —
they showed "No Clock" even though ~60% of their adverts had valid
timestamps.
Observed on staging: a node with 31K samples where recent adverts
interleave good skew (-6.8s, -13.6s) with firmware nonsense (-56M, -60M
seconds). Under the old logic, median of the mixed window → `no_clock`.
## Solution
New `bimodal_clock` severity tier that surfaces flaky-RTC nodes with
their real (good-sample) skew value.
### Classification order (first match wins)
| Severity | Good Fraction | Description |
|----------|--------------|-------------|
| `no_clock` | < 10% | Essentially no real clock |
| `bimodal_clock` | 10–80% (and bad > 0) | Mixed good/bad — flaky RTC |
| `ok`/`warn`/`critical`/`absurd` | ≥ 80% | Normal classification |
"Good" = `|skew| <= 1 hour`; "bad" = likely uninitialized RTC nonsense.
When `bimodal_clock`, `recentMedianSkewSec` is computed from **good
samples only**, so the dashboard shows the real working-clock value
(e.g. -7s) instead of the broken median.
### Backend changes
- New constant `BimodalSkewThresholdSec = 3600`
- New severity `bimodal_clock` in classification logic
- New API fields: `goodFraction`, `recentBadSampleCount`,
`recentSampleCount`
### Frontend changes
- Amber `Bimodal` badge with tooltip showing bad-sample percentage
- Bimodal nodes render skew value like ok/warn/severe (not the "No
Clock" path)
- Warning line below sparkline: "⚠️ X of last Y adverts had nonsense
timestamps (likely RTC reset)"
### Tests
- 3 new Go unit tests: bimodal (60% good → bimodal_clock), all-bad (→
no_clock), 90%-good (→ ok)
- 1 new frontend test: bimodal badge rendering with tooltip
- Existing `TestReporterScenario_789` passes unchanged
Builds on #789 (recent-window severity).
Closes#845
---------
Co-authored-by: you <you@example.com>
## Problem
The "Top 20 Longest Hops" RF analytics card shows the same repeater pair
filling most slots because the query sorts raw hop records by distance
with no pair deduplication. A single long link observed 12+ times
dominates the leaderboard.
## Fix
Dedupe by unordered `(pk1, pk2)` pair. Per pair, keep the max-distance
record and compute reliability metrics:
| Column | Description |
|--------|-------------|
| **Obs** | Total observations of this link |
| **Best SNR** | Maximum SNR seen (dB) |
| **Median SNR** | Median SNR across all observations (dB) |
Tooltip on each row shows the timestamp of the best observation.
### Before
| # | From | To | Distance | Type | SNR | Packet |
|---|------|----|----------|------|-----|--------|
| 1 | NodeX | NodeY | 200 mi | R↔R | 5 dB | abc… |
| 2 | NodeX | NodeY | 199 mi | R↔R | 6 dB | def… |
| 3 | NodeX | NodeY | 198 mi | R↔R | 4 dB | ghi… |
### After
| # | From | To | Distance | Type | Obs | Best SNR | Median SNR | Packet
|
|---|------|----|----------|------|-----|----------|------------|--------|
| 1 | NodeX | NodeY | 200 mi | R↔R | 12 | 8.0 dB | 5.2 dB | abc… |
| 2 | NodeA | NodeB | 150 mi | C↔R | 3 | 6.5 dB | 6.5 dB | jkl… |
## Changes
- **`cmd/server/store.go`**: Group `filteredHops` by unordered pair key,
accumulate obs count / best SNR / median SNR per group, sort by max
distance, take top 20
- **`cmd/server/types.go`**: Update `DistanceHop` struct — replace `SNR`
with `BestSnr`, `MedianSnr`, add `ObsCount`
- **`public/analytics.js`**: Replace single SNR column with Obs, Best
SNR, Median SNR; add row tooltip with best observation timestamp
- **`cmd/server/store_tophops_test.go`**: 3 unit tests — basic dedupe,
reverse-pair merge, nil SNR edge case
## Test Coverage
- `TestDedupeTopHopsByPair`: 5 records on pair (A,B) + 1 on (C,D) → 2
results, correct obsCount/dist/bestSnr/medianSnr
- `TestDedupeTopHopsReversePairMerges`: (B,A) and (A,B) merge into one
entry
- `TestDedupeTopHopsNilSNR`: all-nil SNR records → bestSnr and medianSnr
both nil
- Existing `TestAnalyticsRFEndpoint` and `TestAnalyticsRFWithRegion`
still pass
Closes#847
---------
Co-authored-by: you <you@example.com>
## Problem
The Packet Detail dialog summary (Observer, Path, Hops, SNR/RSSI,
Timestamp) used the **aggregated cross-observer view** (`_parsedPath` /
`getParsedPath(pkt)`), which contradicted the byte breakdown after #844.
A packet observed with 2 hops by one observer would show "Path: 7 hops"
in the summary because it merged all observers' paths.
## Fix
The dialog is now **per-observation**:
- `renderDetail` resolves a `currentObservation` from
`selectedObservationId` (set when clicking an observation child row) or
defaults to `observations[0]`
- All summary fields read from the current observation: Observer,
SNR/RSSI, Timestamp, Path, Direction
- Hop count badge comes from `path_len & 0x3F` of the observation's
`raw_hex` (firmware truth, same source as byte breakdown). Cross-checked
against `path_json` length — logs a console warning on mismatch
- **Observations table** rendered inside the detail panel when multiple
observations exist. Clicking a row updates `currentObservation` and
re-renders the summary in-place (no dialog close/reopen)
- `.observation-current` CSS class highlights the selected observation
row
### Cross-observer aggregate (Option B)
A read-only "Cross-observer aggregate" section below the observations
table shows the longest observed path across all observers. This is
**not** the default view — it's always visible as secondary context.
## Tests
8 new tests in `test-frontend-helpers.js`:
- Hop count extraction from raw_hex (normal, direct, transport route
types)
- Inconsistency detection between path_json and raw_hex
- Per-observation field override of aggregated packet fields
- First observation used when no specific observation selected
- Observation row click selects that observation
- Null/missing raw_hex handling
All 572 tests pass (564 frontend + 62 filter + 29 aging).
## Acceptance
- Summary shows per-observation path/hops/SNR/RSSI/timestamp
- Switching observations in the detail updates everything
- Cross-observer aggregate available as secondary section
- Byte breakdown untouched (owned by #846)
## Related
- Closes#849
- Related: #844 (#846) — byte breakdown fix (separate PR, different code
region)
---------
Co-authored-by: you <you@example.com>
## Problem
The Packet Detail dialog's "Packet Byte Breakdown" section was using the
aggregated `_parsedPath` (longest path observed across all observers) to
render hop entries, instead of deriving the hop count from the
`path_len` byte in `raw_hex`. This caused:
- Wrong hop count (e.g., "Path (7 hops)" when `raw_hex` only contains 2)
- Hop values from the aggregated path displayed at incorrect byte
offsets
- Subsequent fields (pubkey, timestamp, signature) rendered at wrong
offsets because `off` was advanced by the wrong amount
## Fix
In `buildFieldTable()` (packets.js), the Path section now:
1. Derives `hashCountVal` from `path_len & 0x3F` (firmware truth per
`Packet.h:79-83`)
2. Derives `hashSize` from `(path_len >> 6) + 1`
3. Reads each hop's hex value directly from `raw_hex` at the correct
byte offset
4. Advances `off` by `hashSize * hashCountVal`
5. Skips the Path section entirely when `hashCountVal === 0` (direct
advert)
The "Path" summary section above the breakdown (which uses the
aggregated path for route visualization) is unchanged — only the byte
breakdown is fixed.
## Tests
3 new tests in `test-frontend-helpers.js`:
- Verifies 2 hops rendered (not 7) when `path_len=0x42` despite 7-hop
aggregated path
- Verifies pubkey offset is 6 (not 16) after a 2-hop path
- Verifies direct advert (`hashCount=0`) skips Path section
Also fixed pre-existing `HopDisplay is not defined` failures in the
`#765` transport offset test sandbox (added mock).
All 559 tests pass.
Closes#844
---------
Co-authored-by: you <you@example.com>
Doc-only reconciliation of v3.6.0-rc plan with what actually shipped.
## Changes
- **§8.2** — desktop deep link now opens full-screen view
(post-#823/#824), not split panel as the plan still asserted.
- **§5.3** — pin that severity now derives from `recentMedianSkewSec`
(#789), not the all-time `medianSkewSec` — a re-tester needs to know
which field drives the badge.
- **§6.2** — pin the existing observer-graph element location
(`public/analytics.js:2048-2051`).
- **New §8.7** — side-panel "Recent Packets" entries must navigate to a
valid packet detail (DB-fallback per #827) AND text must be readable in
the current theme (explicit color per #829). Both bugs surfaced this
session.
No CI gates.
Co-authored-by: Kpa-clawbot <agent@corescope.local>
Closes#840
## What
Switch the map-popup "Show Neighbors" link from `<a href="#">` to `<a
href="javascript:void(0)" role="button">` so iOS Safari doesn't navigate
when the document-level click delegation fails to fire.
## Why
On iOS Safari, when a user taps the link inside a Leaflet popup:
- The document-level click delegation at `public/map.js:927` calls
`e.preventDefault()` and triggers `selectReferenceNode`.
- BUT inside a Leaflet popup, `L.DomEvent.disableClickPropagation()` is
internally applied to popup content — on iOS Safari the click sometimes
doesn't bubble to `document`.
- When that happens, the browser's default `<a href="#">` action runs:
- hash becomes empty (`#`)
- `navigate()` in `app.js:458` sees empty hash → defaults to `'packets'`
- map page is destroyed mid-tap → user perceives "nothing happened" (or
a brief flash if they back-button)
`href="javascript:void(0)"` removes the navigation fall-through
entirely. The `role="button"` keeps a11y semantics, `cursor:pointer`
keeps the visual cue.
## Tested
- Headless Chromium desktop + iPhone 13 emulation: tap fires
`/api/nodes/{pk}/neighbors?min_count=3`, marker count drops from 441 →
44, `#mcNeighbors` checkbox toggles on, URL stays on `/#/map`. Same as
before.
- Frontend helpers: 556/0
- Real iOS Safari fix verification needs a physical-device test
post-deploy
## Out of scope (follow-up)
- Same `<a href="#">` pattern exists for the topright "Close route"
control at `public/map.js:389` — uses `L.DomEvent.preventDefault` so
should work, but worth auditing if the symptom recurs.
Co-authored-by: Kpa-clawbot <agent@corescope.local>
Apply v3.6.0-rc QA learnings to the plan.
## Changes
- **§1.1** — 1 GB cap is unrealistic on real DBs without `GOMEMLIMIT` +
bounded cold-load. Raised target to 3 GB and pointed to follow-up
**#836**. (Investigation showed cold-load transient blows past any
sub-2GB cap regardless of `maxMemoryMB` setting because
`runtime.MemStats.NextGC` ignores cgroup ceilings.)
- **§1.4** — `trackedBytes`/`trackedMB` is in-store packet bytes only
and under-reports RSS by ~3–5× (no indexes, caches, runtime overhead,
cgo). Switched the assertion to use `processRSSMB` exposed by **#832**
(PR **#835**).
- **§11.1** — noted the Playwright deep-link E2E assertion was updated
by **#833** (PR **#834**) to match the post-#823 full-screen behavior.
## Why
Three real findings from the QA ops sweep ([§1.4 fail
comment](https://github.com/Kpa-clawbot/CoreScope/issues/809#issuecomment-4286113141)).
Updating the plan so the next run doesn't replay the same
false-fail/false-pass conditions.
Co-authored-by: Kpa-clawbot <agent@corescope.local>
Closes#825
## Root cause
PR #815 added a `#`-prefix branch in `selectChannel` that
unconditionally rendered the lock affordance whenever the channel object
wasn't in the loaded `channels` list. With the encrypted toggle off,
unencrypted channels like `#test` are also absent from the list, so the
new branch wrongly locked them instead of falling through to the REST
fetch.
## Fix
When no stored key matches, refetch `/channels?includeEncrypted=true`
and check `ch.encrypted` before locking. Only render the lock when we
positively know the channel is encrypted; otherwise fall through to the
existing REST messages fetch.
This regresses #815's behavior **only for the unencrypted case** (which
is the bug). The encrypted-no-key (#811) and encrypted-with-stored-key
(#815) paths are preserved.
## Tests
3 new regression tests in `test-frontend-helpers.js`:
- `#test` (unencrypted) deep link → REST fetched, no lock
- `#private` (encrypted, no key) deep link → lock, no REST (#811
preserved)
- `#private` (encrypted, with stored key) deep link → decrypt path (#815
preserved)
`node test-frontend-helpers.js` → 556 passed, 0 failed.
## Perf
One extra REST call per cold deep link to a `#`-named channel that's not
in the toggle-off list — same endpoint already cached via
`CLIENT_TTL.channels`, so subsequent navigations are free.
---------
Co-authored-by: you <you@example.com>
Closes#832.
## Root cause confirmed
\`trackedMB\` (\`s.trackedBytes\` in \`store.go\`) only sums per-packet
struct + payload sizes recorded at insertion. It excludes the index maps
(\`byHash\`, \`byTxID\`, \`byNode\`, \`byObserver\`, \`byPathHop\`,
\`byPayloadType\`, hash-prefix maps, name lookups), the analytics LRUs
(rfCache/topoCache/hashCache/distCache/subpathCache/chanCache/collisionCache),
WS broadcast queues, and Go runtime overhead. It's \"useful packet
bytes,\" not RSS — typically 3–5× off on staging.
## Fix (Option C from the issue)
Expose four memory fields on \`/api/stats\` from a single cached
snapshot:
| Field | Source | Semantics |
|---|---|---|
| \`storeDataMB\` | \`s.trackedBytes\` | in-store packet bytes; eviction
watermark input |
| \`goHeapInuseMB\` | \`runtime.MemStats.HeapInuse\` | live Go heap |
| \`goSysMB\` | \`runtime.MemStats.Sys\` | total Go-managed memory |
| \`processRSSMB\` | \`/proc/self/status VmRSS\` (Linux), falls back to
\`goSysMB\` | what the kernel sees |
\`trackedMB\` is retained as a deprecated alias for \`storeDataMB\` so
existing dashboards/QA scripts keep working.
Field invariants are documented on \`MemorySnapshot\`: \`processRSSMB ≥
goSysMB ≥ goHeapInuseMB ≥ storeDataMB\` (typical).
## Performance
Single \`getMemorySnapshot\` call cached for 1s —
\`runtime.ReadMemStats\` (stop-the-world) and the \`/proc/self/status\`
read are amortized across burst polling. \`/proc\` read is bounded to 8
KiB, parsed with \`strconv\` only — no shell-out, no untrusted input.
\`cgoBytesMB\` is omitted: the build uses pure-Go
\`modernc.org/sqlite\`, so there is no cgo allocator to measure.
Documented in code comment.
## Tests
\`cmd/server/stats_memory_test.go\` asserts presence, types, sign, and
ordering invariants. Avoids the flaky \"matches RSS to ±X%\" pattern.
\`\`\`
$ go test ./... -count=1 -timeout 180s
ok github.com/corescope/server 19.410s
\`\`\`
## QA plan
§1.4 now compares \`processRSSMB\` against procfs RSS (the right
invariant); threshold stays at 0.20.
---------
Co-authored-by: MeshCore Agent <meshcore-agent@openclaw.local>
Closes#827.
## Problem
`/api/packets/{hash}` only consulted the in-memory `PacketStore`. When a
packet aged out of memory, the handler 404'd — even though SQLite still
had it and `/api/nodes/{pubkey}` `recentAdverts` (which reads from the
DB) was actively surfacing the hash. Net effect: the **Analyze →** link
on older adverts in the node detail page led to a dead "Not found".
Two-store inconsistency: DB has the packet, in-memory doesn't, node
detail surfaces it from DB → packet detail can't serve it.
## Fix
In `handlePacketDetail`:
- After in-memory miss, fall back to `db.GetPacketByHash` (already
existed) for hash lookups, and `db.GetTransmissionByID` for numeric IDs.
- Track when the result came from the DB; if so and the store has no
observations, populate from DB via a new `db.GetObservationsForHash` so
the response shows real observations instead of the misleading
`observation_count = 1` fallback.
## Tests
- `TestPacketDetailFallsBackToDBWhenStoreMisses` — insert a packet
directly into the DB after `store.Load()`, confirm store doesn't have
it, assert 200 + populated observations.
- `TestPacketDetail404WhenAbsentFromBoth` — neither store nor DB → 404
(no false positives).
- `TestPacketDetailPrefersStoreOverDB` — both have it; store result wins
(no double-fetch).
- `TestHandlePacketDetailNoStore` updated: it previously asserted the
old buggy 404 behavior; now asserts the correct DB-fallback 200.
All `go test ./... -run "PacketDetail|Packet|GetPacket"` and the full
`cmd/server` suite pass.
## Out of scope
The `/api/packets?hash=` filter is the live in-memory list endpoint and
intentionally store-only for performance. Not touched here — happy to
file a follow-up if you'd rather harmonise.
## Repro context
Verified against prod with a recently-adverting repeater whose recent
advert hash lives in `recentAdverts` (DB) but had been evicted from the
in-memory store; pre-fix 404, post-fix 200 with full observations.
Co-authored-by: you <you@example.com>
Closes#789.
## The two bugs
1. **Severity from stale median.** `classifySkew(absMedian)` used the
all-time `MedianSkewSec` over every advert ever recorded for the node. A
repeater that was off for hours and then GPS-corrected stayed pinned to
`absurd` because hundreds of historical bad samples poisoned the median.
Reporter's case: `medianSkewSec: -59,063,561.8` while `lastSkewSec:
-0.8` — current health was perfect, dashboard said catastrophic.
2. **Drift from a single correction jump.** Drift used OLS over every
`(ts, skew)` pair, with no outlier rejection. A single GPS-correction
event (skew jumps millions of seconds in ~30s) dominated the regression
and produced `+1,793,549.9 s/day` — physically nonsense; the existing
`maxReasonableDriftPerDay` cap then zeroed it (better than absurd, but
still useless).
## The two fixes
1. **Recent-window severity.** New field `recentMedianSkewSec` = median
over the last `N=5` samples or last `1h`, whichever is narrower (more
current view). Severity now derives from `abs(recentMedianSkewSec)`.
`MeanSkewSec`, `MedianSkewSec`, `LastSkewSec` are preserved unchanged so
the frontend, fleet view, and any external consumers continue to work.
2. **Theil-Sen drift with outlier filter.** Drift now uses the Theil-Sen
estimator (median of all pairwise slopes — textbook robust regression,
~29% breakdown point) on a series pre-filtered to drop samples whose
skew jumps more than `maxPlausibleSkewJumpSec = 60s` from the previous
accepted point. Real µC drift is fractions of a second per advert; clock
corrections fall well outside. Capped at `theilSenMaxPoints = 200`
(most-recent) so O(n²) stays bounded for chatty nodes.
## What stays the same
- Epoch-0 / out-of-range advert filter (PR #769).
- `minDriftSamples = 5` floor.
- `maxReasonableDriftPerDay = 86400` hard backstop.
- API shape: only additions (`recentMedianSkewSec`); no fields removed
or renamed.
## Tests
All in `cmd/server/clock_skew_test.go`:
- `TestSeverityUsesRecentNotMedian` — 100 bad samples (-60s) + 5 good
(-1s) → severity = `ok`, historical median still huge.
- `TestDriftRejectsCorrectionJump` — 30 min of clean linear drift + one
1000s jump → drift small (~12 s/day).
- `TestTheilSenMatchesOLSWhenClean` — clean linear data, Theil-Sen
within ~1% of OLS.
- `TestReporterScenario_789` — exact reproducer: 1662 samples, 1657 @
-683 days then 5 @ -1s → severity `ok`, `recentMedianSkewSec ≈ 0`, drift
bounded; legacy `medianSkewSec` preserved as historical context.
`go test ./... -count=1` (cmd/server) and `node
test-frontend-helpers.js` both pass.
---------
Co-authored-by: clawbot <bot@corescope.local>
Co-authored-by: you <you@example.com>
Closes#833
## What
Update Playwright E2E assertion for desktop deep link to
`/#/nodes/{pubkey}`. Now expects `.node-fullscreen` to be present
(matches the spec set by PR #824 / issue #823).
## Why
The previous assertion encoded the old pre-#823 behavior — "split panel
on desktop deep link." PR #824 intentionally removed the
`window.innerWidth <= 640` gate so desktop deep links open the
full-screen view (matching the Details link path that #779/#785/#824
ultimately made work). The test failed on every PR that rebased onto
master, blocking `Deploy Staging`.
## Verified
- 1-test diff, no other behavior change
- Mobile-viewport assertions elsewhere already exercise the same
`.node-fullscreen` selector
Co-authored-by: Kpa-clawbot <agent@corescope.local>
Closes#829
## What
Add explicit `color: var(--text)` to `.advert-info` (and `var(--accent)`
to its links) so the side-panel "Recent Packets" entries stay readable
in all themes.
## Why
`.advert-info` had only `font-size` + `line-height` rules — text
inherited from ancestors. In default light/dark themes the inherited
color happens to differ enough from `--card-bg`. Under custom themes
where they collide, text becomes invisible — only the colored
`.advert-dot` shows. Operator screenshot confirmed the symptom.
Same class of bug as the existing fix at `style.css:660` ("Bug 7 fix:
neighbor table text inherits accent color — force readable text") which
forced `color: var(--text)` on `.node-detail-section .data-table td`.
The advert timeline doesn't use a data-table, so it fell through.
## Verified
- DOM contains correct text — only the rendered color was wrong
- `getComputedStyle(.advert-info).color` previously matched `--card-bg`
under affected themes
- After fix: `.advert-info` resolves to `var(--text)` regardless of
inherited chain
- Frontend helpers: 553/0
- Full-screen `node-full-card` view (separate `.node-activity-item`
markup) unaffected
Co-authored-by: Kpa-clawbot <agent@corescope.local>
## Problems
Two independent ingestor bugs identified in #694:
### 1. IATA filter drops status messages from out-of-region observers
The IATA filter ran at the top of `handleMessage()` before any
message-type discrimination. Status messages carrying observer metadata
(`noise_floor`, battery, airtime) from observers outside the configured
IATA regions were silently discarded before `UpsertObserver()` and
`InsertMetrics()` ran.
**Impact:** Observers running `meshcoretomqtt/1.0.8.0` in BFL and LAX —
the only client versions that include `noise_floor` in status messages —
had their health data dropped entirely on prod instances filtering to
SJC.
**Fix:** Moved the IATA filter to the packet path only (after the
`parts[3] == "status"` branch). Status messages now always populate
observer health data regardless of configured region filter.
### 2. `INSERT OR IGNORE` discards SNR/RSSI on late arrival
When the same `(transmission_id, observer_idx, path_json)` observation
arrived twice — first without RF fields, then with — `INSERT OR IGNORE`
silently discarded the SNR/RSSI from the second arrival.
**Fix:** Changed to `ON CONFLICT(...) DO UPDATE SET snr =
COALESCE(excluded.snr, snr), rssi = ..., score = ...`. A later arrival
with SNR fills in a `NULL`; a later arrival without SNR does not
overwrite an existing value.
## Tests
- `TestIATAFilterDoesNotDropStatusMessages` — verifies BFL status
message is processed when IATA filter includes only SJC, and that BFL
packet is still filtered
- `TestInsertObservationSNRFillIn` — verifies SNR fills in on second
arrival, and is not overwritten by a subsequent null arrival
## Related
Partially addresses #694 (upstream client issue of missing SNR in packet
messages is out of scope)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Fixes#673
- GRP_TXT packets whose message text contains a node's pubkey were
incorrectly counted as packets for that node, inflating packet counts
and type breakdowns
- Two code paths in `store.go` used `strings.Contains` on the full
`DecodedJSON` blob — this matched pubkeys appearing anywhere in the
JSON, including inside chat message text
- `filterPackets` slow path (combined node + other filters): replaced
substring search with a hash-set membership check against
`byNode[nodePK]`
- `GetNodeAnalytics`: removed the full-packet-scan + text search branch
entirely; always uses the `byNode` index (which already covers
`pubKey`/`destPubKey`/`srcPubKey` via structured field indexing)
## Test Plan
- [x] `TestGetNodeAnalytics_ExcludesGRPTXTWithPubkeyInText` — verifies a
GRP_TXT packet with the node's pubkey in its text field is not counted
in that node's analytics
- [x] `TestFilterPackets_NodeQueryDoesNotMatchChatText` — verifies the
combined-filter slow path of `filterPackets` returns only the indexed
ADVERT, not the chat packet
Both tests were written as failing tests against the buggy code and pass
after the fix.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes#823
## What
Remove the `window.innerWidth <= 640` gate on the `directNode`
full-screen branch in `init()` so the 🔍 Details link works on desktop.
## Why
- #739 (`e6ace95`) gated full-screen to mobile so desktop **deep links**
would land on the split panel.
- But the same gate broke the **Details link** flow (#779/#785): the
click handler calls `init(app, pubkey)` directly. On desktop the gated
branch was skipped, the list re-rendered with `selectedKey = pubkey`,
and the side panel was already open → no visible change.
- Dropping the gate makes the directNode branch the single, unambiguous
path to full-screen for both the Details link and any deep link.
## Why the desktop split-panel UX is still preserved
Row clicks call `selectNode()`, which uses `history.replaceState` — no
`hashchange` event, no router re-init, no `directNode` set. Only the
Details link handler (which calls `init()` explicitly) and a fresh
deep-link load reach this branch.
## Repro / verify
1. Desktop, viewport > 640px, open `/#/nodes`.
2. Click a node row → split panel opens (unchanged).
3. Click 🔍 Details inside the panel → full-screen single-node view (was
broken; now works).
4. Back button / Escape → back to list view.
5. Paste `/#/nodes/{pubkey}` directly → full-screen on both desktop and
mobile.
## Tests
`node test-frontend-helpers.js` → 553 passed, 0 failed.
Co-authored-by: you <you@example.com>
## Summary
- **Node detail `top: 60px` → `64px`**: aligns with other overlay
panels, gives proper clearance from the 52px fixed nav bar
- **Mobile bottom sheet `z-index: 1050`**: node detail now renders above
the VCR bar (`z-index: 1000`), close button never obscured
- **Mobile `max-height: 60vh` → `60dvh`**: respects iOS Safari browser
chrome correctly
- **`.live-toggles` horizontal scroll**: `overflow-x: auto; flex-wrap:
nowrap` — all 8 checkboxes reachable via horizontal swipe
Fixes#797
## Test plan
- [x] Mobile portrait (<640px): tap a map node → bottom sheet slides up,
close button (✕) visible and tappable above VCR bar
- [x] Mobile portrait: scroll the live-header toggles horizontally → all
checkboxes reachable
- [x] Desktop/tablet (>640px): node detail panel top-right corner fully
below the nav bar
- [x] Desktop: close button functional, panel hides correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Automates QA plan §10.1 (nodeBlacklist hide) and §10.2 (DB retain),
flipping both rows from `human` to `auto`. Stacks on top of #808.
**What**
- New `qa/scripts/blacklist-test.sh` — env-driven harness:
- Args: `BASELINE_URL TARGET_URL TEST_PUBKEY`
- Env: `TARGET_SSH_HOST`, `TARGET_SSH_KEY` (default
`/root/.ssh/id_ed25519`), `TARGET_CONFIG_PATH`, `TARGET_CONTAINER`,
optional `TARGET_DB_PATH` / `ADMIN_API_TOKEN`.
- Edits `nodeBlacklist` on target via remote `jq` (python3 fallback),
atomic move with preserved perms.
- Restarts container, waits up to 120 s for `/api/stats == 200`.
- §10.1 asserts `/api/nodes/{pk}` is 404 **or** absent from `/api/nodes`
listing, and `/api/topology` does not reference the pubkey.
- §10.2 prefers `/api/admin/transmissions` if `ADMIN_API_TOKEN` set,
else falls back to `sqlite3` inside the container (and host as last
resort).
- **Teardown is mandatory** (`trap … EXIT INT TERM`): removes pubkey,
restarts, verifies the node is visible again. Teardown failures count
toward exit code.
- Exit code = number of failures; per-step ✅/❌ with classified failure
modes (`ssh-failed`, `restart-stuck`, `hide-failed`, `retain-failed`,
`teardown-failed`).
- `qa/plans/v3.6.0-rc.md` §10.1 / §10.2 mode → `auto
(qa/scripts/blacklist-test.sh)`.
**Why**
Manual blacklist verification was the slowest item in the §10 block and
the easiest to get wrong (forgetting teardown leaks state into the next
QA pass). Now it's a single command, public-repo-safe (zero PII /
hardcoded hosts), and the trap guarantees the target is restored.
`bash -n` passes locally. Live run requires staging credentials.
---------
Co-authored-by: meshcore-agent <agent@meshcore>
Co-authored-by: meshcore-agent <meshcore@openclaw.local>
## What + why
`fetchResolvedPathForTxBest` (used by every API path that fills the
top-level `resolved_path`, including
`/api/nodes/{pk}/health.recentPackets`) picked the observation with the
longest `path_json` and queried SQL for that single obs ID. When the
longest-path obs had `resolved_path` NULL but a shorter sibling had one,
the helper returned nil and the top-level field was dropped — even
though the data exists. QA #809 §2.1 caught it on the health endpoint
because that page surfaces it per-tx.
Fix: keep the LRU-friendly fast path (try the longest-path obs), then
fall back to scanning all observations of the tx and picking the longest
`path_json` that actually has a stored `resolved_path`.
## Changes
- `cmd/server/resolved_index.go`: extend `fetchResolvedPathForTxBest`
with a fallback through `fetchResolvedPathsForTx`.
- `cmd/server/issue810_repro_test.go`: regression test — seeds a tx
whose longest-path obs lacks `resolved_path` and a shorter sibling has
it, then asserts `/api/packets` and
`/api/nodes/{pk}/health.recentPackets` agree.
## Tests
`go test ./... -count=1` from `cmd/server` — PASS (full suite, ~19s).
## Perf
Fast path unchanged (single LRU/SQL lookup, dominant case). Fallback
only runs when the longest-path obs has NULL `resolved_path` — one
indexed query per affected tx, bounded by observations-per-tx (small).
Closes#810
---------
Co-authored-by: you <you@example.com>
Adds the CoreScope-side artifacts that pair with the generic [`qa-suite`
skill](https://github.com/Kpa-clawbot/ai-sdlc/pull/1).
## Layout
```
qa/
├── README.md
├── plans/
│ └── v3.6.0-rc.md # 34-commit test plan since v3.5.1
└── scripts/
└── api-contract-diff.sh # CoreScope-tuned API contract diff
```
The skill ships the reusable engine + qa-engineer persona + an example
plan. This PR adds the CoreScope-tuned plan and the CoreScope-tuned
script (correct seed lookups for our `{packets, total}` response shape,
our endpoint list, our `resolved_path` requirement). Read by the parent
agent at runtime.
## How to use
From chat:
- `qa staging` — runs the latest `qa/plans/v*-rc.md` against staging,
files a fresh GH issue with the report
- `qa pr <N>` — uses `qa/plans/pr-<N>.md` if present, else latest RC
plan; comments on the PR
- `qa v3.6.0-rc` — runs that specific plan
The qa-engineer subagent walks every step, classifying each as `auto`
(script) / `browser` (UI assertion) / `human` (manual) / `browser+auto`.
Quantified pass criteria are mandatory — banned phrases: 'visually
aligned' / 'fast' / 'no regression'.
## Plan v3.6.0-rc contents
Covers the 34 commits since v3.5.1:
- §1 Memory & Load (#806, #790, #807) — heap thresholds, sawtooth
pattern
- §2 API contract (#806) — every endpoint that should carry
`resolved_path`, auto-checked by `api-contract-diff.sh`
- §3 Decoder & hashing (#787, #732, #747, #766, #794, #761)
- §4 Channels (#725 series M1–M5)
- §5 Clock skew (#690 series M1–M3)
- §6 Observers (#764, #774)
- §7 Multi-byte hash adopters (#758, #767)
- §8 Frontend nav & deep linking (#739, #740, #779, #785, #776, #745)
- §9 Geofilter (#735, #734)
- §10 Node blacklist (#742)
- §11 Deploy/ops
Release blockers: §1.2, §2, §3. §4 is the headline-feature gate.
## Adding new plans
Per release: copy `plans/v<last>-rc.md` to `plans/v<new>-rc.md` and
update commit-range header, new sections, GO criteria.
Per PR: create `plans/pr-<N>.md` with the bare minimum for that PR's
surface area.
Co-authored-by: you <you@example.com>
Closes#812
## Root causes
**Server (`/api/packets?channel=…` returned identical totals):**
The handler in `cmd/server/routes.go` never read the `channel` query
parameter into `PacketQuery`, so it was silently ignored by both the
SQLite path (`db.go::buildTransmissionWhere`) and the in-memory path
(`store.go::filterPackets`). The codebase already had everything else in
place — the `channel_hash` column with an index from #762, decoded
`channel` / `channelHashHex` fields on each packet — it just wasn't
wired up.
**UI (`/#/packets` had no channel filter):**
`public/packets.js` rendered observer / type / time-window / region
filters but no channel control, and didn't read `?channel=` from the
URL.
## Fix
### Server
- New `Channel` field on `PacketQuery`; `handlePackets` reads
`r.URL.Query().Get("channel")`.
- DB path filters by the indexed `channel_hash` column (exact match).
- In-memory path: helper `packetMatchesChannel` matches
`decoded.channel` (plaintext, e.g. `#test`, `public`) or `enc_<HEX>`
against `channelHashHex` for undecryptable GRP_TXT. Uses cached
`ParsedDecoded()` so it's O(1) after first parse. Fast-path index guards
and the grouped-cache key updated to include channel.
- Regression test (`channel_filter_test.go`): `channel=#test` returns ≥1
GRP_TXT packet and fewer than baseline; `channel=nonexistentchannel`
returns `total=0`.
### UI
- New `<select id="fChannel">` populated from `/api/channels`.
- Round-trips via `?channel=…` on the URL hash (read on init, written on
change).
- Pre-seeds the current value as an option so encrypted hashes not in
`/api/channels` still display as selected on reload.
- On change, calls `loadPackets()` so the server-side filter applies
before pagination.
## Perf
Filter adds at most one cached map lookup per packet (DB path uses
indexed column, store path uses `ParsedDecoded()` cache). Staging
baseline 149–190 ms for `?channel=#test&limit=50`; the new comparison is
negligible. Target ≤ 500 ms preserved.
## Tests
`cd cmd/server && go test ./... -count=1 -timeout 120s` → PASS.
---------
Co-authored-by: you <you@example.com>
Closes#813
## Root cause
The Node detail **side panel** (`renderDetail()`,
`public/nodes.js:1145`) was missing both the `#node-clock-skew`
placeholder div and the `loadClockSkew()` IIFE loader. Those exist only
in the **full-screen** detail page (`loadFullNode`, lines 498 / 632), so
any node opened via deep link or click in the listing — which uses the
side panel — showed no clock-skew UI even when
`/api/nodes/{pk}/clock-skew` returned rich data.
## Fix
Mirror the full-screen template branch and IIFE in `renderDetail`:
- Add `<div class="node-detail-section skew-detail-section"
id="node-clock-skew" style="display:none">` to the side-panel template
(right above Observers).
- Add an async `loadClockSkewPanel()` IIFE after the panel `innerHTML`
is set, using the same severity/badge/drift/sparkline rendering and the
`severity === 'no_clock'` branch the full-screen view uses.
No new helpers — reuses existing window globals (`formatSkew`,
`formatDrift`, `renderSkewBadge`, `renderSkewSparkline`).
## Verification
- Syntax check: `node -c public/nodes.js` ✓
- `node test-frontend-helpers.js` → 553/553 ✓
- Browser: staging runs master so I couldn't validate the deployed UI
yet. Manual repro after deploy:
1. Open `https://analyzer.00id.net/#/nodes`, click any node with a known
skew (e.g. Puppy Solar `a8dde6d7…` shows `⏰ -23d 8h` in listing).
2. Side panel should show a **⏰ Clock Skew** section with median skew,
severity badge, drift line, and sparkline.
3. For `severity === 'no_clock'` (e.g. SKCE_RS `14531bd2…`), section
shows "No Clock" instead of skew value.
---------
Co-authored-by: you <you@example.com>
Closes#811
## What
Deep linking to `/#/channels/%23private` (encrypted channel, no key
configured) now shows the existing 🔒 lock affordance instead of an empty
"No messages in this channel yet" pane.
## Why
`selectChannel` only rendered the lock message inside the `if (ch &&
ch.encrypted)` branch. On a cold deep link:
- `loadChannels` omits encrypted channels unless the toggle is on, so
`ch` is `undefined`.
- The hash isn't `user:`-prefixed, so that branch is skipped too.
- Code falls through to the REST fetch, returns 0 messages, and
`renderMessages` shows the generic empty state.
## Fix
Add a `#`-prefixed-hash branch immediately before the REST fetch:
- If a stored key matches the channel name → decrypt and render.
- Otherwise → reuse the existing 🔒 "encrypted and no decryption key is
configured" message.
## Trace (URL → render)
1. `#/channels/%23private` → `init(routeParam='#private')`
2. `loadChannels()` → `channels` has no `#private` entry (toggle off)
3. `selectChannel('#private')` → `ch` undefined → skips encrypted
branches → **new check fires** → lock message
4. With key stored: same check → `decryptAndRender`
## Validation
- `node test-frontend-helpers.js` → 553 passed, 0 failed
- Manual trace above; change is a 15-line localized guard before the
REST fetch, no hot-path or perf impact.
Co-authored-by: meshcore-agent <agent@corescope.local>
## Problem
The firmware computes packet content hash as:
```
SHA256(payload_type_byte + [path_len for TRACE] + payload)
```
Where `payload_type_byte = (header >> 2) & 0x0F` — just the payload type
bits (2-5).
CoreScope was using the **full header byte** in its hash computation,
which includes route type bits (0-1) and version bits (6-7). This meant
the same logical packet produced different content hashes depending on
route type — breaking dedup and packet lookup.
**Firmware reference:** `Packet.cpp::calculatePacketHash()` uses
`getPayloadType()` which returns `(header >> PH_TYPE_SHIFT) &
PH_TYPE_MASK`.
## Fix
- Extract only payload type bits: `payloadType := (headerByte >> 2) &
0x0F`
- Include `path_len` byte in hash for TRACE packets (matching firmware
behavior)
- Applied to both `cmd/server/decoder.go` and `cmd/ingestor/decoder.go`
## Tests Added
- **Route type independence:** Same payload with FLOOD vs DIRECT route
types produces identical hash
- **TRACE path_len inclusion:** TRACE packets with different `path_len`
produce different hashes
- **Firmware compatibility:** Hash output matches manual computation of
firmware algorithm
## Migration Impact
Existing packets in the DB have content hashes computed with the old
(incorrect) formula. Options:
1. **Recompute hashes** via migration (recommended for clean state)
2. **Dual lookup** — check both old and new hash on queries (backward
compat)
3. **Accept the break** — old hashes become stale, new packets get
correct hashes
Recommend option 1 (migration) as a follow-up. The volume of affected
packets depends on how many distinct route types were seen for the same
logical packet.
Fixes#786
---------
Co-authored-by: you <you@example.com>
## Implements #748 M1 — Bounded Cold Load
### Problem
`Load()` pulls the ENTIRE database into RAM before eviction runs. On a
1GB database, this means 3+ GB peak memory at startup, regardless of
`maxMemoryMB`. This is the root cause of #743 (OOM on 2GB VMs).
### Solution
Calculate the maximum number of transmissions that fit within the
`maxMemoryMB` budget and use a SQL subquery LIMIT to load only the
newest packets.
**Two-phase approach** (avoids the JOIN-LIMIT row count problem):
```sql
SELECT ... FROM transmissions t
LEFT JOIN observations o ON ...
WHERE t.id IN (SELECT id FROM transmissions ORDER BY first_seen DESC LIMIT ?)
ORDER BY t.first_seen ASC, o.timestamp DESC
```
### Changes
- **`estimateStoreTxBytesTypical(numObs)`** — estimates memory cost of a
typical transmission without needing an actual `StoreTx` instance. Used
for budget calculation.
- **Budget calculation in `Load()`** — `maxPackets = (maxMemoryMB *
1048576) / avgBytesPerPacket` with a floor of 1000 packets.
- **Subquery LIMIT** — loads only the newest N transmissions when
bounded.
- **`oldestLoaded` tracking** — records the oldest packet timestamp in
memory so future SQL fallback queries (M2+) know where in-memory data
ends.
- **Perf stats** — `oldestLoaded` exposed in `/api/perf/store-stats`.
- **Logging** — bounded loads show `Loaded X/Y transmissions (limited by
ZMB budget)`.
### When `maxMemoryMB=0` (unlimited)
Behavior is completely unchanged — no LIMIT clause, all packets loaded.
### Tests (6 new)
| Test | Validates |
|------|-----------|
| `TestBoundedLoad_LimitedMemory` | With 1MB budget, loads fewer than
total (hits 1000 minimum) |
| `TestBoundedLoad_NewestFirst` | Loaded packets are the newest, not
oldest |
| `TestBoundedLoad_OldestLoadedSet` | `oldestLoaded` matches first
packet's `FirstSeen` |
| `TestBoundedLoad_UnlimitedWithZero` | `maxMemoryMB=0` loads all
packets |
| `TestBoundedLoad_AscendingOrder` | Packets remain in ascending
`first_seen` order after bounded load |
| `TestEstimateStoreTxBytesTypical` | Estimate grows with observation
count, exceeds floor |
Plus benchmarks: `BenchmarkLoad_Bounded` vs `BenchmarkLoad_Unlimited`.
### Perf justification
On a 5000-transmission test DB with 1MB budget:
- Bounded: loads 1000 packets (the minimum) in ~1.3s
- The subquery uses SQLite's index on `first_seen` — O(N log N) for the
LIMIT, then indexed JOIN for observations
- No full table scan needed when bounded
### Next milestones
- **M2**: Packet list/search SQL fallback (uses `oldestLoaded` boundary)
- **M3**: Node analytics SQL fallback
- **M4-M5**: Remaining endpoint fallbacks + live-only memory store
---------
Co-authored-by: you <you@example.com>
## Problem
Some mesh participants set offensive names, report deliberately false
GPS positions, or otherwise troll the network. Instance operators
currently have no way to hide these nodes from public-facing APIs
without deleting the underlying data.
## Solution
Add a `nodeBlacklist` array to `config.json` containing public keys of
nodes to exclude from all API responses.
### Blacklisted nodes are filtered from:
- `GET /api/nodes` — list endpoint
- `GET /api/nodes/search` — search results
- `GET /api/nodes/{pubkey}` — detail (returns 404)
- `GET /api/nodes/{pubkey}/health` — returns 404
- `GET /api/nodes/{pubkey}/paths` — returns 404
- `GET /api/nodes/{pubkey}/analytics` — returns 404
- `GET /api/nodes/{pubkey}/neighbors` — returns 404
- `GET /api/nodes/bulk-health` — filtered from results
### Config example
```json
{
"nodeBlacklist": [
"aabbccdd...",
"11223344..."
]
}
```
### Design decisions
- **Case-insensitive** — public keys normalized to lowercase
- **Whitespace trimming** — leading/trailing whitespace handled
- **Empty entries ignored** — `""` or `" "` do not cause false positives
- **Nil-safe** — `IsBlacklisted()` on nil Config returns false
- **Backward-compatible** — empty/missing `nodeBlacklist` has zero
effect
- **Lazy-cached set** — blacklist converted to `map[string]bool` on
first lookup
### What this does NOT do (intentionally)
- Does **not** delete or modify database data — only filters API
responses
- Does **not** block packet ingestion — data still flows for analytics
- Does **not** filter `/api/packets` — only node-facing endpoints are
affected
## Testing
- Unit tests for `Config.IsBlacklisted()` (case sensitivity, whitespace,
empty entries, nil config)
- Integration tests for `/api/nodes`, `/api/nodes/{pubkey}`,
`/api/nodes/search`
- Full test suite passes with no regressions
## Problem
Deep-linking to an encrypted channel (e.g. `#/channels/42`) when the
user has no client-side decryption key falls through to the plaintext
API fetch, displaying gibberish base64/binary content instead of a
meaningful message.
## Root Cause
In `selectChannel()`, the encrypted channel key-matching loop iterates
all stored keys. If none match, execution falls through to the normal
plaintext message fetch — which returns raw encrypted data rendered as
gibberish.
## Fix
After the key-matching loop for encrypted channels, return early with
the lock message instead of falling through.
**3 lines added** in `public/channels.js`, **108 lines** regression test
in `test-frontend-helpers.js`.
## Investigation: Sidebar Display
The sidebar filtering is already correct:
- DB path: SQL filters out `enc_` prefix channel hashes
- In-memory path: Only returns `type: CHAN` (server-decrypted) channels,
with `hasGarbageChars` validation
- Server-side decryption: MAC verification (2-byte HMAC) + UTF-8 +
non-printable character validation prevents false-positive decryptions
- Encrypted channels only appear when the toggle is explicitly enabled
## Testing
- All existing tests pass
- New regression test verifies: lock message shown, messages API NOT
called for encrypted channels without key
Fixes#781
---------
Co-authored-by: you <you@example.com>
Improves the fix for #778 (replaces #779's approach).
## Problem
When clicking "Details" in the node side panel, the hash is already
`#/nodes/{pubkey}` (set by `replaceState` in `selectNode`). The link
targets the same hash → no `hashchange` event → router never fires →
detail view never renders.
## What was wrong with #779
PR #779 used `replaceState('#/')` + `location.hash = target`
synchronously, which forces a full SPA router teardown/rebuild cycle
just to re-render the same page. This is wasteful and can cause visual
flicker.
## This fix
**Detail link** (`#/nodes/{pubkey}`): Calls `init(app, pubkey)` directly
— no router teardown, no page flash. The `init()` function already
handles rendering the detail view when `routeParam` is set.
**Analytics link** (`#/nodes/{pubkey}/analytics`): Uses `setTimeout` to
ensure reliable `hashchange` firing, since this routes to a different
page (`node-analytics`) that requires the full SPA router.
## Testing
- Frontend helper tests: 552/552 ✅
- Packet filter tests: 62/62 ✅
- Aging tests: 29/29 ✅
- Go server tests: pass ✅
- Go ingestor tests: pass ✅
---------
Co-authored-by: you <you@example.com>
## Summary
Observers that stop actively sending data now get removed after a
configurable retention period (default 14 days).
Previously, observers remained in the `observers` table forever. This
meant nodes that were once observers for an instance but are no longer
connected (even if still active in the mesh elsewhere) would continue
appearing in the observer list indefinitely.
## Key Design Decisions
- **Active data requirement**: `last_seen` is only updated when the
observer itself sends packets (via `stmtUpdateObserverLastSeen`). Being
seen by another node does NOT update this field. So an observer must
actively send data to stay listed.
- **Default: 14 days** — observers not seen in 14 days are removed
- **`-1` = keep forever** — for users who want observers to never be
removed
- **`0` = use default (14 days)** — same as not setting the field
- **Runs on startup + daily ticker** — staggered 3 minutes after metrics
prune to avoid DB contention
## Changes
| File | Change |
|------|--------|
| `cmd/ingestor/config.go` | Add `ObserverDays` to `RetentionConfig`,
add `ObserverDaysOrDefault()` |
| `cmd/ingestor/db.go` | Add `RemoveStaleObservers()` — deletes
observers with `last_seen` before cutoff |
| `cmd/ingestor/main.go` | Wire up startup + daily ticker for observer
retention |
| `cmd/server/config.go` | Add `ObserverDays` to `RetentionConfig`, add
`ObserverDaysOrDefault()` |
| `cmd/server/db.go` | Add `RemoveStaleObservers()` (server-side, uses
read-write connection) |
| `cmd/server/main.go` | Wire up startup + daily ticker, shutdown
cleanup |
| `cmd/server/routes.go` | Admin prune API now also removes stale
observers |
| `config.example.json` | Add `observerDays: 14` with documentation |
| `cmd/ingestor/coverage_boost_test.go` | 4 tests: basic removal, empty
store, keep forever (-1), default (0→14) |
| `cmd/server/config_test.go` | 4 tests: `ObserverDaysOrDefault` edge
cases |
## Config Example
```json
{
"retention": {
"nodeDays": 7,
"observerDays": 14,
"packetDays": 30,
"_comment": "observerDays: -1 = keep forever, 0 = use default (14)"
}
}
```
## Admin API
The `/api/admin/prune` endpoint now also removes stale observers (using
`observerDays` from config) and reports `observers_removed` in the
response alongside `packets_deleted`.
## Test Plan
- [x] `TestRemoveStaleObservers` — old observer removed, recent observer
kept
- [x] `TestRemoveStaleObserversNone` — empty store, no errors
- [x] `TestRemoveStaleObserversKeepForever` — `-1` keeps even year-old
observers
- [x] `TestRemoveStaleObserversDefault` — `0` defaults to 14 days
- [x] `TestObserverDaysOrDefault` (ingestor) —
nil/zero/positive/keep-forever
- [x] `TestObserverDaysOrDefault` (server) —
nil/zero/positive/keep-forever
- [x] Both binaries compile cleanly (`go build`)
- [ ] Manual: verify observer count decreases after retention period on
a live instance
Fixes#778
## Problem
The Details and Analytics links in the node side panel don't navigate
when clicked. This is a regression from #739 (desktop node deep
linking).
**Root cause:** When a node is selected, `selectNode()` uses
`history.replaceState()` to set the URL to `#/nodes/{pubkey}`. The
Details link has `href="#/nodes/{pubkey}"` — the same hash. Clicking an
anchor with the same hash as the current URL doesn't fire the
`hashchange` event, so the SPA router never triggers navigation.
## Fix
Added a click handler on the `nodesRight` panel that intercepts clicks
on `.btn-primary` navigation links:
1. `e.preventDefault()` to stop the default anchor behavior
2. If the current hash already matches the target, temporarily clear it
via `replaceState`
3. Set `location.hash` to the target, which fires `hashchange` and
triggers the SPA router
This handles both the Details link (`#/nodes/{pubkey}`) and the
Analytics link (`#/nodes/{pubkey}/analytics`).
## Testing
- All frontend helper tests pass (552/552)
- All packet filter tests pass (62/62)
- All aging tests pass (29/29)
- Go server tests pass
---------
Co-authored-by: you <you@example.com>
The deploy step used only 'docker compose down' which can't remove
containers created via 'docker run'. Now explicitly stops+removes
the named container first, then runs compose down as cleanup.
Permanent fix for the recurring CI deploy failure.
## Summary
The neighbor graph min score slider didn't persist its value to
localStorage, resetting to 0.10 on every page load. This was a poor
default for most use cases.
## Changes
- **Default changed from 0.10 to 0.70** — more useful starting point
that filters out low-confidence edges
- **localStorage persistence** — slider value saved on change, restored
on page load
- **3 new tests** in `test-frontend-helpers.js` verifying default value,
load behavior, and save behavior
## Testing
- `node test-frontend-helpers.js` — 547 passed, 0 failed
- `node test-packet-filter.js` — 62 passed, 0 failed
- `node test-aging.js` — 29 passed, 0 failed
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#753 — Milestones M1 and M2: Observer nodes in the neighbor graph
are now correctly labeled, colored, and filterable.
### M1: Label + color observers
**Backend** (`cmd/server/neighbor_api.go`):
- `buildNodeInfoMap()` now queries the `observers` table after building
from `nodes`
- Observer-only pubkeys (not already in the map as repeaters etc.) get
`role: "observer"` and their name from the observers table
- Observer-repeaters keep their repeater role (not overwritten)
**Frontend**:
- CSS variable `--role-observer: #8b5cf6` added to `:root`
- `ROLE_COLORS.observer` was already defined in `roles.js`
### M2: Observer filter checkbox (default unchecked)
**Frontend** (`public/analytics.js`):
- Observer checkbox added to the role filter section, **unchecked by
default**
- Observers create hub-and-spoke patterns (one observer can have 100+
edges) that drown out the actual repeater topology — hiding them by
default keeps the graph clean
- Fixed `applyNGFilters()` which previously always showed observers
regardless of checkbox state
### Tests
- Backend: `TestBuildNodeInfoMap_ObserverEnrichment` — verifies
observer-only pubkeys get name+role from observers table, and
observer-repeaters keep their repeater role
- All existing Go tests pass
- All frontend helper tests pass (544/544)
---------
Co-authored-by: you <you@example.com>
page.goto with hash-only change may not reliably trigger hashchange in
Playwright, causing the mobile full-screen node view to never render.
Use page.evaluate to set location.hash directly, which guarantees the
SPA router fires. Also increase timeout from 10s to 15s for CI margin.
## Fix: Multi-Byte Adopters Table — Three Bugs (#754)
### Bug 1: Companions in "Unknown"
`computeMultiByteCapability()` was repeater-only. Extended to classify
**all node types** (companions, rooms, sensors). A companion advertising
with 2-byte hash is now correctly "Confirmed".
### Bug 2: No Role Column
Added a **Role** column to the merged Multi-Byte Hash Adopters table,
color-coded using `ROLE_COLORS` from `roles.js`. Users can now
distinguish repeaters from companions without clicking through to node
detail.
### Bug 3: Data Source Disagreement
When adopter data (from `computeAnalyticsHashSizes`) shows `hashSize >=
2` but capability only found path evidence ("Suspected"), the
advert-based adopter data now takes precedence → "Confirmed". The
adopter hash sizes are passed into `computeMultiByteCapability()` as an
additional confirmed evidence source.
### Changes
- `cmd/server/store.go`: Extended capability to all node types, accept
adopter hash sizes, prioritize advert evidence
- `public/analytics.js`: Added Role column with color-coded badges
- `cmd/server/multibyte_capability_test.go`: 3 new tests (companion
confirmed, role populated, adopter precedence)
### Tests
- All 10 multi-byte capability tests pass
- All 544 frontend helper tests pass
- All 62 packet filter tests pass
- All 29 aging tests pass
---------
Co-authored-by: you <you@example.com>
## Summary
Shows which prefixes are colliding in the Hash Usage Matrix, making the
"PREFIX COLLISIONS: N" count actionable.
Fixes#757
## Changes
### Frontend (`public/analytics.js`)
- **Clickable collision count**: When collisions > 0, the stat card is
clickable and scrolls to the collision details section. Shows a `▼`
indicator.
- **3-byte collision table**: The collision risk section and
`renderCollisionsFromServer` now render for all hash sizes including
3-byte (was previously hidden/skipped for 3-byte).
- **Helpful hint**: 3-byte panel now says "See collision details below"
when collisions exist.
### Backend (`cmd/server/collision_details_test.go`)
- Test that collision details include correct prefix and node
name/pubkey pairs
- Test that collision details are empty when no collisions exist
### Frontend Tests (`test-frontend-helpers.js`)
- Test clickable stat card renders `onclick` and `cursor:pointer` when
collisions > 0
- Test non-clickable card when collisions = 0
- Test collision table renders correct node links (`#/nodes/{pubkey}`)
- Test no-collision message renders correctly
## What was already there
The backend already returned full collision details (prefix, nodes with
pubkeys/names/coords, distance classification) in the `hash-collisions`
API. The frontend already had `renderCollisionsFromServer` rendering a
rich table with node links. The gap was:
1. The 3-byte tab hid the collision risk section entirely
2. No visual affordance to navigate from the stat count to the details
## Perf justification
No new computation — collision data was already computed and returned by
the API. The only change is rendering it for 3-byte (same as
1-byte/2-byte). The collision list is already limited by the backend
sort+slice pattern.
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#765 — packet detail field table showed wrong byte offsets for
transport routes.
## Problem
`buildFieldTable()` hardcoded `path_length` at byte 1 for ALL packet
types. For `TRANSPORT_FLOOD` (route_type=0) and `TRANSPORT_DIRECT`
(route_type=3), transport codes occupy bytes 1-4, pushing `path_length`
to byte 5.
This caused:
- Wrong offset numbers in the field table for transport packets
- Transport codes displayed AFTER path length (wrong byte order)
- `Advertised Hash Size` row referenced wrong byte
## Fix
- Use dynamic `offset` tracking that accounts for transport codes
- Render transport code rows before path length (matching actual wire
format)
- Store `pathLenOffset` for correct reference in ADVERT payload section
- Reuse already-parsed `pathByte0` for hash size calculation in path
section
## Tests
Added 4 regression tests in `test-frontend-helpers.js`:
- FLOOD (route_type=1): path_length at byte 1, no transport codes
- TRANSPORT_FLOOD (route_type=0): transport codes at bytes 1-4,
path_length at byte 5
- TRANSPORT_DIRECT (route_type=3): same offsets as TRANSPORT_FLOOD
- Field table row order matches byte layout for transport routes
All existing tests pass (538 frontend helpers, 62 packet filter, 29
aging).
Co-authored-by: you <you@example.com>
## Problem
Channel API endpoints scan entire DB — 2.4s for channel list, 30s for
messages.
## Fix
- Added `channel_hash` column to transmissions (populated on ingest,
backfilled on startup)
- `GetChannels()` rewrites to GROUP BY channel_hash (one row per channel
vs scanning every packet)
- `GetChannelMessages()` filters by channel_hash at SQL level with
proper LIMIT/OFFSET
- 60s cache for channel list
- Index: `idx_tx_channel_hash` for fast lookups
Expected: 2.4s → <100ms for list, 30s → <500ms for messages.
Fixes#762
---------
Co-authored-by: you <you@example.com>
## Fixes#759
The "Add Channel" input was a bare text field with no visible submit
button and no feedback — users didn't know how to submit or whether it
worked.
### Changes
**`public/channels.js`**
- Replaced bare `<input>` with structured form: label, input + button
row, hint text, status div
- Added `showAddStatus()` helper for visual feedback during/after
channel add
- Status messages: loading → success (with decrypted message count) /
warning (no messages) / error
- Auto-hide status after 5 seconds
- Fallback click handler on the `+` button for browsers that don't fire
form submit
**`public/style.css`**
- `.ch-add-form` — form container
- `.ch-add-label` — bold 13px label
- `.ch-add-row` — flex row for input + button
- `.ch-add-btn` — 32×32 accent-colored submit button
- `.ch-add-hint` — muted helper text
- `.ch-add-status` — feedback with success/warn/error/loading variants
**`test-channel-add-ux.js`** — 20 tests validating HTML structure, CSS
classes, and feedback logic
### Before / After
**Before:** Bare input field, no button, no hint, no feedback
**After:** Labeled section with visible `+` button, format hint, and
status messages showing decryption results
---------
Co-authored-by: you <you@example.com>
## Problem
Fixes#743 — High memory usage / OOM with relatively small dataset.
`trackedBytes` severely undercounted actual per-packet memory because it
only tracked base struct sizes and string field lengths, missing major
allocations:
| Structure | Untracked Cost | Scale Impact |
|-----------|---------------|--------------|
| `spTxIndex` (O(path²) subpath entries) | 40 bytes × path combos |
50-150MB |
| `ResolvedPath` on observations | 24 bytes × elements | ~25MB |
| Per-tx maps (`obsKeys`, `observerSet`) | 200 bytes/tx flat | ~11MB |
| `byPathHop` index entries | 50 bytes/hop | 20-40MB |
This caused eviction to trigger too late (or not at all), leading to
OOM.
## Fix
Expanded `estimateStoreTxBytes` and `estimateStoreObsBytes` to account
for:
- **Per-tx maps**: +200 bytes flat for `obsKeys` + `observerSet` map
headers
- **Path hop index**: +50 bytes per hop in `byPathHop`
- **Subpath index**: +40 bytes × `hops*(hops-1)/2` combinations for
`spTxIndex`
- **Resolved paths**: +24 bytes per `ResolvedPath` element on
observations
Updated the existing `TestEstimateStoreTxBytes` to match new formula.
All existing eviction tests continue to pass — the eviction logic itself
is unchanged.
Also exposed `avgBytesPerPacket` in the perf API (`/api/perf`) so
operators can monitor per-packet memory costs.
## Performance
Benchmark confirms negligible overhead (called on every insert):
```
BenchmarkEstimateStoreTxBytes 159M ops 7.5 ns/op 0 B/op 0 allocs
BenchmarkEstimateStoreObsBytes 1B ops 1.0 ns/op 0 B/op 0 allocs
```
## Tests
- 6 new tests in `tracked_bytes_test.go`:
- Reasonable value ranges for different packet sizes
- 10-hop packets estimate significantly more than 2-hop (subpath cost)
- Observations with `ResolvedPath` estimate more than without
- 15 observations estimate >10x a single observation
- `trackedBytes` matches sum of individual estimates after batch insert
- Eviction triggers correctly with improved estimates
- 2 benchmarks confirming sub-10ns estimate cost
- Updated existing `TestEstimateStoreTxBytes` for new formula
- Full test suite passes
---------
Co-authored-by: you <you@example.com>
## Summary
Implements **Milestone 1** of #690 — backend clock skew computation for
nodes and observers.
## What's New
### Clock Skew Engine (`clock_skew.go`)
**Phase 1 — Raw Skew Calculation:**
For every ADVERT observation: `raw_skew = advert_timestamp -
observation_timestamp`
**Phase 2 — Observer Calibration:**
Same packet seen by multiple observers → compute each observer's clock
offset as the median deviation from the per-packet median observation
timestamp. This identifies observers with their own clock drift.
**Phase 3 — Corrected Node Skew:**
`corrected_skew = raw_skew + observer_offset` — compensates for observer
clock error.
**Phase 4 — Trend Analysis:**
Linear regression over time-ordered skew samples estimates drift rate in
seconds/day. Detects crystal drift vs stable offset vs sudden jumps.
### Severity Classification
| Level | Threshold | Meaning |
|-------|-----------|---------|
| ✅ OK | < 5 min | Normal |
| ⚠️ Warning | 5 min – 1 hour | Clock drifting |
| 🔴 Critical | 1 hour – 30 days | Likely no time source |
| 🟣 Absurd | > 30 days | Firmware default or epoch 0 |
### New API Endpoints
- `GET /api/nodes/{pubkey}/clock-skew` — per-node skew data (mean,
median, last, drift, severity)
- `GET /api/observers/clock-skew` — observer calibration offsets
- Clock skew also included in `GET /api/nodes/{pubkey}/analytics`
response as `clockSkew` field
### Performance
- 30-second compute cache avoids reprocessing on every request
- Operates on in-memory `byPayloadType[ADVERT]` index — no DB queries
- O(n) in total ADVERT observations, O(m log m) for median calculations
## Tests
15 unit tests covering:
- Severity classification at all thresholds
- Median/mean math helpers
- ISO timestamp parsing
- Timestamp extraction from decoded JSON (nested and top-level)
- Observer calibration with single and multi-observer scenarios
- Observer offset correction direction (verified the sign is
`+obsOffset`)
- Drift estimation: stable, linear, insufficient data, short time span
- JSON number extraction edge cases
## What's NOT in This PR
- No UI changes (M2–M4)
- No customizer integration (M5)
- Thresholds are hardcoded constants (will be configurable in M5)
Implements #690 M1.
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#744Fixes#722
Three bugs in hash_size computation caused zero-hop adverts to
incorrectly report `hash_size=1`, masking nodes that actually use
multi-byte hashes.
## Bugs Fixed
### 1. Wrong path byte offset for transport routes
(`computeNodeHashSizeInfo`)
Transport routes (types 0 and 3) have 4 transport code bytes before the
path byte. The code read the path byte from offset 1 (byte index
`RawHex[2:4]`) for all route types. For transport routes, the correct
offset is 5 (`RawHex[10:12]`).
### 2. Missing RouteTransportDirect skip (`computeNodeHashSizeInfo`)
Zero-hop adverts from `RouteDirect` (type 2) were correctly skipped, but
`RouteTransportDirect` (type 3) zero-hop adverts were not. Both have
locally-generated path bytes with unreliable hash_size bits.
### 3. Zero-hop adverts not skipped in analytics
(`computeAnalyticsHashSizes`)
`computeAnalyticsHashSizes()` unconditionally overwrote a node's
`hashSize` with whatever the latest advert reported. A zero-hop direct
advert with `hash_size=1` could overwrite a previously-correct
`hash_size=2` from a multi-hop flood advert.
Fix: skip hash_size update for zero-hop direct/transport-direct adverts
while still counting the packet and updating `lastSeen`.
## Tests Added
- `TestHashSizeTransportRoutePathByteOffset` — verifies transport routes
read path byte at offset 5, regular flood reads at offset 1
- `TestHashSizeTransportDirectZeroHopSkipped` — verifies both
RouteDirect and RouteTransportDirect zero-hop adverts are skipped
- `TestAnalyticsHashSizesZeroHopSkip` — verifies analytics hash_size is
not overwritten by zero-hop adverts
- Fixed 3 existing tests (`FlipFlop`, `Dominant`, `LatestWins`) that
used route_type 0 (TransportFlood) header bytes without proper transport
code padding
## Complexity
All changes are O(1) per packet — no new loops or data structures. The
additional offset computation and zero-hop check are constant-time
operations within the existing packet scan loop.
Co-authored-by: you <you@example.com>
## Problem
When a node acts as both a repeater and an observer (same public key —
common with powered repeaters running MQTT clients), the map shows two
separate markers at the same location: a repeater rectangle and an
observer star. This causes visual clutter and both markers get pushed
out from the real location by the deconfliction algorithm.
## Solution
Detect combined repeater+observer nodes by matching node `public_key`
against observer `id`. When matched:
- **Label mode (hash labels on):** The repeater label gets a gold ★
appended inside the rectangle
- **Icon mode (hash labels off):** The repeater diamond gets a small
star overlay in the top-right corner of the SVG
- **Popup:** Shows both REPEATER and OBSERVER badges
- **Observer markers:** Skipped when the observer is already represented
as a combined node marker
- **Legend count:** Observer count excludes combined nodes (shows
standalone observers only)
## Performance
- Observer lookup uses a `Map` keyed by lowercase pubkey — O(1) per node
check
- Legend count uses a `Set` of node pubkeys — O(n+m) instead of O(n×m)
- No additional API calls; uses existing `observers` array already
fetched
## Testing
- All 523 frontend helper tests pass
- All 62 packet filter tests pass
- Visual: combined nodes show as single marker with star indicator
Fixes#719
---------
Co-authored-by: you <you@example.com>
## Summary
- Add missing `geo_filter` block to `config.example.json` with polygon
example, `bufferKm`, and inline `_comment`
- Add `docs/user-guide/geofilter.md`: full operator guide covering
config schema, GeoFilter Builder workflow, and prune script as one-time
migration tool
- Add Geographic filtering section to `docs/user-guide/configuration.md`
with link to the full guide
Closes#669 (M1: documentation)
## Test plan
- [x] `config.example.json` parses cleanly (no JSON errors)
- [x] `docs/user-guide/geofilter.md` renders correctly in GitHub preview
- [x] Link from `configuration.md` to `geofilter.md` resolves
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Part of #669 — M2: Link the builder from the app.
- **`public/geofilter-builder.html`** — the existing
`tools/geofilter-builder.html` is now served by the static file server
at `/geofilter-builder.html`. Additions vs the original: a `← CoreScope`
back-link in the header, inline code comments explaining the output
format, and a help bar below the output panel with paste instructions
and a link to the documentation.
- **`public/customize-v2.js`** — adds a "Tools" section at the bottom of
the Export tab with a `🗺️ GeoFilter Builder →` link and a one-line
description.
- **`docs/user-guide/customization.md`** — documents the new GeoFilter
Builder entry in the Export tab.
> **Note:** `tools/geofilter-builder.html` is kept as-is for
local/offline use. The `public/` copy is what the server serves.
> **Depends on:** #734 (M1 docs) for `docs/user-guide/geofilter.md` —
the link in the help bar references that file. Can be merged
independently; the link still works once M1 lands.
## Test plan
- [x] Open the app, go to Customizer → Export tab — "Tools" section
appears with GeoFilter Builder link
- [x] Click the link — opens `/geofilter-builder.html` in a new tab
- [x] Builder loads the Leaflet map, draw 3+ points — JSON output
appears
- [x] Copy button works, output is valid `{ "geo_filter": { ... } }`
JSON
- [x] `← CoreScope` back-link navigates to `/`
- [x] Help bar shows paste instructions
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
Clicking a node on desktop opened the side panel but never updated the
URL hash, making nodes non-shareable/bookmarkable on desktop. Loading
`#/nodes/{pubkey}` directly on desktop also incorrectly showed the
full-screen mobile view.
## Changes
- `selectNode()` on desktop: adds `history.replaceState(null, '',
'#/nodes/' + pubkey)` so the URL updates on every click
- `init()`: full-screen path is now gated to `window.innerWidth <= 640`
(mobile only); desktop with a `routeParam` falls through to the split
panel and calls `selectNode()` to pre-select the node
- Deselect (Escape / close button): also calls `history.replaceState`
back to `#/nodes`
## Test plan
- [x] Desktop: click a node → URL updates to `#/nodes/{pubkey}`, split
panel opens
- [x] Desktop: copy URL, open in new tab → split panel opens with that
node selected (not full-screen)
- [x] Desktop: press Escape → URL reverts to `#/nodes`
- [x] Mobile (≤640px): clicking a node still navigates to full-screen
view
Closes#676🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
Applying packet filters (hash, node, observer, Wireshark expression) did
not update the URL hash, so filtered views could not be shared or
bookmarked.
## Changes
**`buildPacketsQuery()`** — extended to include:
- `hash=` from `filters.hash`
- `node=` from `filters.node`
- `observer=` from `filters.observer`
- `filter=` from `filters._filterExpr` (Wireshark expression string)
**`updatePacketsUrl()`** — now called on every filter change:
- hash input (debounced)
- observer multi-select change
- node autocomplete select and clear
- Wireshark filter input (on valid expression or clear)
**URL restore on load** — `getHashParams()` now reads `hash`, `node`,
`observer`, `filter` and restores them into `filters` before the DOM is
built. Input fields pick up values from `filters` as before. Wireshark
expression is also recompiled and `filter-active` class applied.
## Test plan
- [ ] Type in hash filter → URL updates with `&hash=...`
- [ ] Copy URL, open in new tab → hash filter is pre-filled
- [ ] Select an observer → URL updates with `&observer=...`
- [ ] Select a node filter → URL updates with `&node=...`
- [ ] Type `type=ADVERT` in Wireshark filter → URL updates with
`&filter=type%3DADVERT`
- [ ] Load that URL → filter expression restored and active
Closes#682🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
Channels page shows 53K 'Unknown' messages — undecryptable GRP_TXT
packets with no content. Pure noise.
## Fix
- Backend: channels API filters out undecrypted messages by default
- `?includeEncrypted=true` param to include them
- Frontend: 'Show encrypted' toggle in channels sidebar
- Unknown channels grayed out with '(no key)' label
- Toggle persists in localStorage
Fixes#727
---------
Co-authored-by: you <you@example.com>
## Summary
Pure client-side channel decryption. Users can add custom hashtag
channels or PSK channels directly in the browser. **The server never
sees the keys.**
Implements #725 M2 (revised). Does NOT close#725.
## How it works
1. User types `#channelname` or pastes a hex PSK in the channels sidebar
2. Browser derives key (`SHA256("#name")[:16]`) using Web Crypto API
3. Key stored in **localStorage** — never sent to the server
4. Browser fetches encrypted GRP_TXT packets via existing API
5. Browser decrypts client-side: AES-128-ECB + HMAC-SHA256 MAC
verification
6. Decrypted messages cached in localStorage
7. Progressive rendering — newest messages first, chunk-based
## Security
- Keys never leave the browser
- No new API endpoints
- No server-side changes whatsoever
- Channel interest partially observable via hash-based API requests
(documented, acceptable tradeoff)
## Changes
- `public/channels.js` — client-side decrypt module + UI integration
(+307 lines)
- `public/index.html` — no new script (inline in channels.js IIFE)
- `public/style.css` — add-channel input styling
---------
Co-authored-by: you <you@example.com>
## Summary
TRACE packets encode their route hash size in the flags byte (`flags &
0x03`), not the header path byte. The decoder was using `path.HashSize`
from the header, which could be wrong or zero for direct-route TRACEs,
producing incorrect hop counts in `path_json`.
## Protocol Note
Per firmware, TRACE packets are **always direct-routed** (route_type 2 =
DIRECT, or 3 = TRANSPORT_DIRECT). FLOOD-routed TRACEs (route_type 1) are
anomalous — firmware explicitly rejects TRACE via flood. The decoder
handles these gracefully without crashing.
## Changes
**`cmd/server/decoder.go` and `cmd/ingestor/decoder.go`:**
- Read `pathSz` from TRACE flags byte: `(traceFlags & 0x03) + 1`
(0→1byte, 1→2byte, 2→3byte)
- Use `pathSz` instead of `path.HashSize` for splitting TRACE payload
path data into hops
- Update `path.HashSize` to reflect the actual TRACE path size
- Added `HopsCompleted` field to ingestor `Path` struct for parity with
server
- Updated comments to clarify TRACE is always direct-routed per firmware
**`cmd/server/decoder_test.go` — 5 new tests:**
- `TraceFlags1_TwoBytePathSz`: flags=1 → 2-byte hashes via DIRECT route
- `TraceFlags2_ThreeBytePathSz`: flags=2 → 3-byte hashes via DIRECT
route
- `TracePathSzUnevenPayload`: payload not evenly divisible by path_sz
- `TraceTransportDirect`: route_type=3 with transport codes + TRACE path
parsing
- `TraceFloodRouteGraceful`: anomalous FLOOD+TRACE handled without crash
All existing TRACE tests (flags=0, 1-byte hashes) continue to pass.
Fixes#731
---------
Co-authored-by: you <you@example.com>
## Summary
Switches channel API endpoints to query SQLite instead of the in-memory
packet store, giving users access to the full message history.
Implements #725 (M1 only — DB-backed channel messages). Does NOT close
#725 — M2-M5 (custom channels, PSK, persistence, retroactive decryption)
remain.
## Problem
Channel endpoints (`/api/channels`, `/api/channels/{hash}/messages`)
preferred the in-memory packet store when available. The store is
bounded by `packetStore.maxMemoryMB` — typically showing only recent
messages. The SQLite database has the complete history (weeks/months of
channel messages) but was only used as a fallback when the store was nil
(never in production).
## Fix
Reversed the preference order: DB first, in-memory store fallback.
Region filtering added to the DB path.
Co-authored-by: you <you@example.com>
## Summary
The perf page "Memory Used" tile displayed `estimatedMB` (Go
`runtime.HeapAlloc`), which includes all Go runtime allocations — not
just packet store data. This made the displayed value misleading: it
showed ~2.4GB heap when only ~833MB was actual tracked packet data.
## Changes
### Frontend (`public/perf.js`)
- Primary tile now shows `trackedMB` as **"Tracked Memory"** — the
self-accounted packet store memory
- Added separate **"Heap (debug)"** tile showing `estimatedMB` for
runtime visibility
### Backend
- **`types.go`**: Added `TrackedMB` field to `HealthPacketStoreStats`
struct
- **`routes.go`**: Populate `TrackedMB` in `/health` endpoint response
from `GetPerfStoreStatsTyped()`
- **`routes_test.go`**: Assert `trackedMB` exists in health endpoint's
`packetStore`
- **`testdata/golden/shapes.json`**: Updated shape fixture with new
field
### What was already correct
- `/api/perf/stats` already exposed both `estimatedMB` and `trackedMB`
- `trackedMemoryMB()` method already existed in store.go
- Eviction logic already used `trackedBytes` (not HeapAlloc)
## Testing
- All Go tests pass (`go test ./... -count=1`)
- No frontend logic changes beyond template string field swap
Fixes#717
Co-authored-by: you <you@example.com>
## Summary
Exclude TRACE packets (payload_type 8) from the "suspected" multi-byte
capability inference logic. TRACE packets carry hash size in their own
flags — forwarding repeaters read it from the TRACE header, not their
compile-time `PATH_HASH_SIZE`. Pre-1.14 repeaters can forward multi-byte
TRACEs without actually supporting multi-byte hashes, creating false
positives.
Fixes#714
## Changes
### `cmd/server/store.go`
- In `computeMultiByteCapability()`, skip packets with `payload_type ==
8` (TRACE) when scanning `byPathHop` for suspected multi-byte nodes
- "Confirmed" detection (from adverts) is unaffected
### `cmd/server/multibyte_capability_test.go`
- `TestMultiByteCapability_TraceExcluded`: TRACE packet with 2-byte path
does NOT mark repeater as suspected
- `TestMultiByteCapability_NonTraceStillSuspected`: Non-TRACE packet
with 2-byte path still marks as suspected
- `TestMultiByteCapability_ConfirmedUnaffectedByTraceExclusion`:
Confirmed status from advert unaffected by TRACE exclusion
## Testing
All 7 multi-byte capability tests pass. Full `cmd/server` and
`cmd/ingestor` test suites pass.
Co-authored-by: you <you@example.com>
## Summary
Fixes#712 — Multi-byte capability filter buttons broken + needs
integration with Hash Adopters.
### Changes
**M1: Fix filter buttons breaking after first click**
- Root cause: `section.replaceWith(newSection)` replaced the entire DOM
node, but the event listener was attached to the old node. After
replacement, clicks went unhandled.
- Fix: Instead of replacing the whole section, only swap the table
content inside a stable `#mbAdoptersTableWrap` div. The event listener
on `#mbAdoptersSection` persists across filter changes.
- Button active state is now toggled via `classList.toggle` instead of
full DOM rebuild.
**M2: Better button labels**
- Changed from icon-only (`✅ 76`) to descriptive labels: `✅ Confirmed
(76)`, `⚠️ Suspected (81)`, `❓ Unknown (223)`
**M3: Integrate with Multi-Byte Hash Adopters**
- Merged capability status into the existing adopters table as a new
"Status" column
- Removed the separate "Repeater Multi-Byte Capability" section
- Filter buttons now apply to the integrated table
- Nodes without capability data default to ❓ Unknown
- Capability data is looked up by pubkey from the existing
`multiByteCapability` API response (no backend changes needed)
### Performance
- No new API calls — capability data already exists in the hash sizes
response
- Filter toggle is O(n) where n = number of adopter nodes (typically
<500)
- Event delegation on stable parent — no listener re-attachment needed
### Tests
- Updated existing `renderMultiByteCapability` tests for new label
format
- Added 5 new tests for `renderMultiByteAdopters`: empty state, status
integration, text labels with counts, unknown default, Status column
presence
- All 507 frontend tests pass, all Go tests pass
Co-authored-by: you <you@example.com>
## Problem
`HeapAlloc`-based eviction cascades on large databases — evicts down to
near-zero packets because Go runtime overhead exceeds `maxMemoryMB` even
with an empty packet store.
## Fix (per Carmack spec on #710)
1. **Self-accounting `trackedBytes`** — running counter maintained on
insert/evict, computed from actual struct sizes. No
`runtime.ReadMemStats`.
2. **High/low watermark hysteresis** (100%/85%) — evict to 85% of
budget, don't re-trigger until 100% crossed again.
3. **25% per-pass safety cap** — never evict more than a quarter of
packets in one cycle.
4. **Oldest-first** — evict from sorted head, O(1) candidate selection.
`maxMemoryMB` now means packet store budget, not total process heap.
Fixes#710
Co-authored-by: you <you@example.com>
## Summary
Fixes#701 — Live feed timestamps showed stale relative times (e.g. "2s
ago" never updated to "5m ago").
## Root Cause
`formatLiveTimestampHtml()` was called once when each feed item was
created and never refreshed. The dedup path (when a duplicate hash moves
an item to the top) also didn't update the timestamp.
## Changes
### `public/live.js`
- **`data-ts` attribute on `.feed-time` spans**: All three feed item
creation paths (VCR replay, `addFeedItemDOM`, `addFeedItem`) now store
the packet timestamp as `data-ts` on the `.feed-time` span element
- **10-second refresh interval**: A `setInterval` queries all
`.feed-time[data-ts]` elements and re-renders their content via
`formatLiveTimestampHtml()`, keeping relative times accurate
- **Dedup path timestamp update**: When a duplicate hash observation
moves an existing feed item to the top, the `.feed-time` span is updated
with the new observation's timestamp
- **Cleanup**: The interval is cleared on page teardown alongside other
intervals
### `test-live.js`
- 3 new tests: formatting idempotency, numeric timestamp acceptance,
`data-ts` round-trip correctness
## Performance
- The refresh interval runs every 10s, iterating over at most 25
`.feed-time` DOM elements (feed is capped at 25 items via `while
(feed.children.length > 25)`). Negligible overhead.
- Uses `querySelectorAll` with attribute selector — O(n) where n ≤ 25.
## Testing
- All 3 new tests pass
- All pre-existing test suites pass (70 live.js tests, 62 packet-filter,
501 frontend-helpers)
- 8 pre-existing failures in `test-live.js` are unrelated
(`getParsedDecoded` missing from sandbox)
Co-authored-by: you <you@example.com>
## Problem
Nodes that only appear as relay hops in packet paths (via
`resolved_path`) were never indexed in `byNode`, so `last_heard` was
never computed for them. This made relay-only nodes show as dead/stale
even when actively forwarding traffic.
Fixes#660
## Root Cause
`indexByNode()` only indexed pubkeys from decoded JSON fields (`pubKey`,
`destPubKey`, `srcPubKey`). Relay nodes appearing in `resolved_path`
were ignored entirely.
## Fix
`indexByNode()` now also iterates:
1. `ResolvedPath` entries from each observation
2. `tx.ResolvedPath` (best observation's resolved path, used for
DB-loaded packets)
A per-call `indexed` set prevents double-indexing when the same pubkey
appears in both decoded JSON and resolved path.
Extracted `addToByNode()` helper to deduplicate the nodeHashes/byNode
append logic.
## Scope
**Phase 1 only** — server-side in-memory indexing. No DB changes, no
ingestor changes. This makes `last_heard` reflect relay activity with
zero risk to persistence.
## Tests
5 new test cases in `TestIndexByNodeResolvedPath`:
- Resolved path pubkeys from observations get indexed
- Null entries in resolved path are skipped
- Relay-only nodes (no decoded JSON match) appear in `byNode`
- Dedup between decoded JSON and resolved path
- `tx.ResolvedPath` indexed when observations are empty
All existing tests pass unchanged.
## Complexity
O(observations × path_length) per packet — typically 1-3 observations ×
1-3 hops. No hot-path regression.
---------
Co-authored-by: you <you@example.com>
## Problem
`SQLITE_BUSY` contention between the ingestor and server's async
persistence goroutine drops `resolved_path` and `neighbor_edges`
updates. The DSN parameter `_busy_timeout=10000` may not be honored by
the modernc/sqlite driver.
## Fix
- **`openRW()` now sets `PRAGMA busy_timeout = 5000`** after opening the
connection, guaranteeing SQLite retries for up to 5 seconds before
returning `SQLITE_BUSY`
- **Refactored `PruneOldPackets` and `PruneOldMetrics`** to use
`openRW()` instead of duplicating connection setup — all RW connections
now get consistent busy_timeout handling
- Added test verifying the pragma is set correctly
## Changes
| File | Change |
|------|--------|
| `cmd/server/neighbor_persist.go` | `openRW()` sets `PRAGMA
busy_timeout = 5000` after open |
| `cmd/server/db.go` | `PruneOldPackets` and `PruneOldMetrics` use
`openRW()` instead of inline `sql.Open` |
| `cmd/server/neighbor_persist_test.go` | `TestOpenRW_BusyTimeout`
verifies pragma is set |
## Performance
No performance impact — `PRAGMA busy_timeout` is a connection-level
setting with zero overhead on uncontended writes. Under contention, it
converts immediate `SQLITE_BUSY` failures into brief retries (up to 5s),
which is strictly better than dropping data.
Fixes#705
---------
Co-authored-by: you <you@example.com>
PR #686 added internal/sigvalidate/ with replace directives in both
go.mod files but didn't update the Dockerfile to COPY it into the
Docker build context. go mod download fails with 'no such file'.
## Summary
Adds a new "Repeater Multi-Byte Capability" section to the Hash Stats
analytics tab that classifies each repeater's ability to handle
multi-byte hash prefixes (firmware >= v1.14).
Fixes#689
## What Changed
### Backend (`cmd/server/store.go`)
- New `computeMultiByteCapability()` method that infers capability for
each repeater using two evidence sources:
- **Confirmed** (100% reliable): node has advertised with `hash_size >=
2`, leveraging existing `computeNodeHashSizeInfo()` data
- **Suspected** (<100%): node's prefix appears as a hop in packets with
multi-byte path headers, using the `byPathHop` index. Prefix collisions
mean this isn't definitive.
- **Unknown**: no multi-byte evidence — could be pre-1.14 or 1.14+ with
default settings
- Extended `/api/analytics/hash-sizes` response with
`multiByteCapability` array
### Frontend (`public/analytics.js`)
- New `renderMultiByteCapability()` function on the Hash Stats tab
- Color-coded table: green confirmed, yellow suspected, gray unknown
- Filter buttons to show all/confirmed/suspected/unknown
- Column sorting by name, role, status, evidence, max hash size, last
seen
- Clickable rows link to node detail pages
### Tests (`cmd/server/multibyte_capability_test.go`)
- `TestMultiByteCapability_Confirmed`: advert with hash_size=2 →
confirmed
- `TestMultiByteCapability_Suspected`: path appearance only → suspected
- `TestMultiByteCapability_Unknown`: 1-byte advert only → unknown
- `TestMultiByteCapability_PrefixCollision`: two nodes sharing prefix,
one confirmed via advert, other correctly marked suspected (not
confirmed)
## Performance
- `computeMultiByteCapability()` runs once per cache cycle (15s TTL via
hash-sizes cache)
- Leverages existing `GetNodeHashSizeInfo()` cache (also 15s TTL) — no
redundant advert scanning
- Path hop scan is O(repeaters × prefix lengths) lookups in the
`byPathHop` map, with early break on first match per prefix
- Only computed for global (non-regional) requests to avoid unnecessary
work
---------
Co-authored-by: you <you@example.com>
For BYOP mode in the packet analyzer, perform signature validation on
advert packets and display whether successful or not. This is added as
we observed many corrupted advert packets that would be easily
detectable as such if signature validation checks were performed.
At present this MR is just to add this status in BYOP mode so there is
minimal impact to the application and no performance penalty for having
to perform these checks on all packets. Moving forward it probably makes
sense to do these checks on all advert packets so that corrupt packets
can be ignored in several contexts (like node lists for example).
Let me know what you think and I can adjust as needed.
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#702 — `.env` file `DISABLE_MOSQUITTO`/`DISABLE_CADDY` ignored
when using `docker run`.
## Changes
### Entrypoint sources `/app/data/.env`
The entrypoint now sources `/app/data/.env` (if present) before the
`DISABLE_*` checks. This works regardless of how the container is
started — `docker run`, compose, or `manage.sh`.
```bash
if [ -f /app/data/.env ]; then
set -a
. /app/data/.env
set +a
fi
```
### `DISABLE_CADDY` added to compose files
Both `docker-compose.yml` and `docker-compose.staging.yml` now forward
`DISABLE_CADDY` to the container environment (was missing — only
`DISABLE_MOSQUITTO` was wired).
### Deployment docs updated
- `docs/deployment.md`: bare `docker run` is now the primary/recommended
approach with a full parameter reference table
- Documents the `/app/data/.env` convenience feature
- Compose and `manage.sh` marked as legacy alternatives
- `DISABLE_CADDY` added to the environment variable reference
### README quick start updated
Shows the full `docker run` command with `--restart`, ports, and
volumes. Includes HTTPS variant. Documents `-e` flags and `.env` file.
### v3.5.0 release notes
Updated the env var documentation to mention the `.env` file support.
## Testing
- All Go server tests pass
- All Go ingestor tests pass
- No logic changes to Go code — entrypoint shell script + docs only
---------
Co-authored-by: you <you@example.com>
## Summary
Implements M1 of the draggable panels spec from #608: the `DragManager`
class with core drag mechanics.
Fixes#608 (M1: DragManager core drag mechanics)
## What's New
### `public/drag-manager.js` (~215 lines)
- **State machine:** `IDLE → PENDING → DRAGGING → IDLE`
- **5px dead zone** on `.panel-header` to disambiguate click vs drag —
prevents hijacking corner toggle and close button clicks
- **Pointer events** with `setPointerCapture` for reliable tracking
- **`transform: translate()`** during drag — zero layout reflow
- **Snap-to-edge** on release: 20px threshold snaps to 12px margin
- **Z-index management** — dragged panel comes to front (counter from
1000)
- **`_detachFromCorner()`** — transitions panel from M0 corner CSS to
fixed positioning
- **Escape key** cancels drag and reverts to pre-drag position
- **`restorePositions()`** — applies saved viewport percentages on init
- **`handleResize()`** — clamps dragged panels inside viewport on window
resize
- **`enable()`/`disable()`** — responsive gate control
### `public/live.js` integration
- Instantiates `DragManager` after `initPanelPositions()`
- Registers `liveFeed`, `liveLegend`, `liveNodeDetail` panels
- **Responsive gate:** `matchMedia('(pointer: fine) and (min-width:
768px)')` — disables drag on touch/small screens, reverts to M0 corner
toggle
- **Resize clamping** debounced at 200ms
### `public/live.css` additions
- `cursor: grab/grabbing` on `.panel-header` (desktop only via `@media
(pointer: fine)`)
- `.is-dragging` class: opacity 0.92, elevated box-shadow, `will-change:
transform`, transitions disabled
- `[data-dragged="true"]` disables corner transition animations
- `prefers-reduced-motion` support
### Persistence
- **Format:** `panel-drag-{id}` → `{ xPct, yPct }` (viewport
percentages)
- **Survives resize:** positions recalculated from percentages
- **Corner toggle still works:** clicking corner button after drag
clears drag state (handled by existing M0 code)
## Tests
14 new unit tests in `test-drag-manager.js`:
- State machine transitions (IDLE → PENDING → DRAGGING → IDLE)
- Dead zone enforcement
- Button click guard (no drag on button pointerdown)
- Snap-to-edge behavior
- Position persistence as viewport percentages
- Restore from localStorage
- Resize clamping
- Disable/enable
## Performance
- `transform: translate()` during drag — compositor-only, no layout
reflow
- `will-change: transform` only during active drag (`.is-dragging`),
removed on drop
- `localStorage` write only on `pointerup`, never during `pointermove`
- Resize handler debounced at 200ms
- Single `style.transform` assignment per pointermove frame — negligible
cost
---------
Co-authored-by: you <you@example.com>
## Problem
The neighbor graph creates separate entries for the same physical node
when observed with different prefix lengths. For example, a 1-byte
prefix `B0` (ambiguous, unresolved) and a 2-byte prefix `B05B` (resolved
to Busbee) appear as two separate neighbors of the same node.
Fixes#698
## Solution
### Part 1: Post-build resolution pass (Phase 1.5)
New function `resolveAmbiguousEdges(pm, graph)` in `neighbor_graph.go`:
- Called after `BuildFromStore()` completes the full graph, before any
API use
- Iterates all ambiguous edges and attempts resolution via
`resolveWithContext` with full graph context
- Only accepts high-confidence resolutions (`neighbor_affinity`,
`geo_proximity`, `unique_prefix`) — rejects
`first_match`/`gps_preference` fallbacks to avoid false positives
- Merges with existing resolved edges (count accumulation, max LastSeen)
or updates in-place
- Phase 1 edge collection loop is **unchanged**
### Part 2: API-layer dedup (defense-in-depth)
New function `dedupPrefixEntries()` in `neighbor_api.go`:
- Scans neighbor response for unresolved prefix entries matching
resolved pubkey entries
- Merges counts, timestamps, and observers; removes the unresolved entry
- O(n²) on ~50 neighbors per node — negligible cost
### Performance
Phase 1.5 runs O(ambiguous_edges × candidates). Per Carmack's analysis:
~50ms at 2K nodes on the 5-min rebuild cycle. Hot ingest path untouched.
## Tests
9 new tests in `neighbor_dedup_test.go`:
1. **Geo proximity resolution** — ambiguous edge resolved when candidate
has GPS near context node
2. **Merge with existing** — ambiguous edge merged into existing
resolved edge (count accumulation)
3. **No-match preservation** — ambiguous edge left as-is when prefix has
no candidates
4. **API dedup** — unresolved prefix merged with resolved pubkey in
response
5. **Integration** — node with both 1-byte and 2-byte prefix
observations shows single neighbor entry
6. **Phase 1 regression** — non-ambiguous edge collection unchanged
7. **LastSeen preservation** — merge keeps higher LastSeen timestamp
8. **No-match dedup** — API dedup doesn't merge non-matching prefixes
9. **Benchmark** — Phase 1.5 with 500+ edges
All existing tests pass (server + ingestor).
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#665 — companion nodes claimed in "My Mesh" showed "Could not load
data" because they never sent an advert, so they had no `nodes` table
entry, causing the health API to return 404.
## Three-Layer Fix
### 1. API Resilience (`cmd/server/store.go`)
`GetNodeHealth()` now falls back to building a partial response from the
in-memory packet store when `GetNodeByPubkey()` returns nil. Returns a
synthetic node stub (`role: "unknown"`, `name: "Unknown"`) with whatever
stats exist from packets, instead of returning nil → 404.
### 2. Ingestor Cleanup (`cmd/ingestor/main.go`)
Removed phantom sender node creation that used `"sender-" + name` as the
pubkey. Channel messages don't carry the sender's real pubkey, so these
synthetic entries were unreachable from the claiming/health flow — they
just polluted the nodes table with unmatchable keys.
### 3. Frontend UX (`public/home.js`)
The catch block in `loadMyNodes()` now distinguishes 404 (node not in DB
yet) from other errors:
- **404**: Shows 📡 "Waiting for first advert — this node has been seen
in channel messages but hasn't advertised yet"
- **Other errors**: Shows ❓ "Could not load data" (unchanged)
## Tests
- Added `TestNodeHealthPartialFromPackets` — verifies a node with
packets but no DB entry returns 200 with synthetic node stub and stats
- Updated `TestHandleMessageChannelMessage` — verifies channel messages
no longer create phantom sender nodes
- All existing tests pass (`cmd/server`, `cmd/ingestor`)
Co-authored-by: you <you@example.com>
## Summary
Fixes#683 — TRACE packets on the live map were showing the full path
instead of distinguishing completed vs remaining hops.
## Root Cause
Both WebSocket broadcast builders in `store.go` constructed the
`decoded` map with only `header` and `payload` keys — `path` was never
included. The frontend reads `decoded.path.hopsCompleted` to split trace
routes into solid (completed) and dashed (remaining) segments, but that
field was always `undefined`.
## Fix
For TRACE packets (payload type 9), call `DecodePacket()` on the raw hex
during broadcast and include the resulting `Path` struct in
`decoded["path"]`. This populates `hopsCompleted` which the frontend
already knows how to consume.
Both broadcast builders are patched:
- `IngestNewFromDB()` — new transmissions path (~line 1419)
- `IngestNewObservations()` — new observations path (~line 1680)
TRACE packets are infrequent, so the per-packet decode overhead is
negligible.
## Testing
- Added `TestIngestTraceBroadcastIncludesPath` — verifies that TRACE
broadcast maps include `decoded.path` with correct `hopsCompleted` value
- All existing tests pass (`cmd/server` + `cmd/ingestor`)
Co-authored-by: you <you@example.com>
Fixes#685
## Problem
Corner positioning CSS (from PR #608) sets `bottom: 12px` for
bottom-positioned panels (`bl`, `br`), but the VCR bar at the bottom of
the live page is ~50px tall. This causes the legend (and any
bottom-positioned panel) to overlap the VCR controls.
## Fix
Changed `bottom: 12px` → `bottom: 58px` for both
`.live-overlay[data-position="bl"]` and
`.live-overlay[data-position="br"]`, matching the legend's original
`bottom: 58px` value that properly clears the VCR bar.
The VCR bar height is fixed (`.vcr-bar` class with consistent padding),
so a hardcoded value is appropriate here.
## Testing
- All existing tests pass (`npm test` — 13/13)
- CSS-only change, no logic affected
Co-authored-by: you <you@example.com>
## Summary
Closes the symmetry gap flagged as a nit in PR #653 review:
> The ingestor decoder tests omit `RouteTransportDirect` zero-hop tests
— only the server decoder has those. Since the logic is identical, this
is not a blocker, but adding them would make the test suites symmetric.
- Adds `TestZeroHopTransportDirectHashSize` — `pathByte=0x00`, expects
`HashSize=0`
- Adds `TestZeroHopTransportDirectHashSizeWithNonZeroUpperBits` —
`pathByte=0xC0` (hash_size bits set, hash_count=0), expects `HashSize=0`
Both mirror the equivalent tests already present in
`cmd/server/decoder_test.go`.
## Test plan
- [ ] `cd cmd/ingestor && go test -run TestZeroHopTransportDirect -v` →
both new tests pass
- [ ] `cd cmd/ingestor && go test ./...` → no regressions
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- Release notes for 95 commits since v3.4.1
- OpenAPI/Swagger docs: /api/spec and /api/docs called out everywhere
- Deployment guide: new API Documentation section
- README: API docs link added
- FAQ: 'Where is the API documentation?' entry
- Test plans for v3.4.2 validation
## Problem
All table sorting on the Nodes page was broken — clicking column headers
did nothing. Affected:
- Nodes list table
- Node detail → Neighbors table
- Node detail → Observers table
## Root Cause
**Not a race condition** — the actual bug was a **data attribute
mismatch**.
`TableSort.init()` (in `table-sort.js`) queries for `th[data-sort-key]`
to find sortable columns. But all table headers in `nodes.js` used
`data-sort="..."` instead of `data-sort-key="..."`. The selector never
matched any headers, so no click handlers were attached and sorting
silently failed.
Additionally, `data-type="number"` was used but TableSort's built-in
comparator is named `numeric`, causing numeric columns to fall back to
text comparison.
The packets table (`packets.js`) was unaffected because it already used
the correct `data-sort-key` and `data-type="numeric"` attributes.
## Fix
1. **`public/nodes.js`**: Changed all `data-sort="..."` to
`data-sort-key="..."` on `<th>` elements (nodes list, neighbors table,
observers table)
2. **`public/nodes.js`**: Changed `data-type="number"` to
`data-type="numeric"` to match TableSort's comparator names
3. **`public/packets.js`**: Added timestamp tiebreaker to packet sort
for stable ordering when primary column values are equal
## Testing
- All existing tests pass (`npm test`)
- No changes to test infrastructure needed — this was a pure HTML
attribute fix
Fixes#679
---------
Co-authored-by: you <you@example.com>
## Fix: Channel Color Picker — Data Shape Mismatch + Redesign (#674)
### Problem
The channel color picker was completely non-functional — dead code.
Three locations in `live.js` attempted to read
`decoded.header.payloadTypeName` and `decoded.payload.channelName`, but:
1. The decoded payload structure is flat
(`decoded.payload.channelHash`), not nested with separate
`header`/`payload` objects within the payload
2. The field is `channelHash` (an integer), not `channelName`
3. `_ccChannel` was **never set** on any DOM element, so all picker
handlers exited early
Additionally, the picker had zero discoverability — hidden behind
right-click/long-press with no visual affordance.
### Changes
**M1 — Fix the data shape bug:**
- Fixed `_ccChannel` assignment in 3 locations in `live.js` to use
`decoded.payload.channelHash` (converted to string)
- Fixed `_getChannelStyle()` to use the same flat structure
- Channel colors now key on the hash string (e.g. `"5"`) matching the
channels API
**M2 — Redesign for discoverability:**
- Reduced palette from 10 to **8 maximally-distinct colors** (removed
teal/rose — too close to cyan/red)
- Removed `<input type="color">` custom picker, "Apply" button, title
bar, close button
- Popover is now just 8 circle swatches + "Clear color" — click outside
to dismiss
- Added **12px clickable color dots** next to channel names on the
channels page (primary configuration surface)
- Unassigned channels show a dashed-border empty circle; assigned show
filled
- Channel list items get `border-left: 3px solid` when colored
- **Removed long-press handler entirely** — dots handle mobile
interaction
- Mobile: bottom-sheet with 36px touch targets via `@media (pointer:
coarse)`
**M3 — Visual encoding:**
- Left border only (3px) — no background tint (per Tufte spec: minimum
effective dose)
- Consistent encoding across live feed items, channel list, packets
table
### Tests
17 new tests in `test-channel-color-picker.js`:
- `_ccChannel` correctly set for GRP_TXT with various `channelHash`
values (including 0)
- `_ccChannel` not set for non-GRP_TXT packets
- `getRowStyle` returns `border-left:3px` only (no background)
- Palette is exactly 8 colors, no teal/rose
- All existing tests pass (62 + 29 + 490)
Fixes#674
---------
Co-authored-by: you <you@example.com>
Removes linux/arm64 from multi-platform build and drops QEMU setup.
All infra (prod + staging) is x86. QEMU emulation was adding ~12min
to every CI run for an unused architecture.
The buildFieldTable test expected hash_size=4 for path byte 0xC0 with
hash_count=0. After #653, zero hash_count shows 'hash_count=0 (direct
advert)' instead. Updated test and added new test verifying hash_size
IS shown when hash_count > 0.
## Noise Floor: Line Chart → Color-Coded Column Chart
Implements M3a from the [RF Health Dashboard
spec](https://github.com/Kpa-clawbot/CoreScope/issues/600#issuecomment-2784399622)
— replacing the noise floor line chart with discrete color-coded
columns.
### What changed
**`public/analytics.js`** — replaced `rfNFLineChart()` with
`rfNFColumnChart()`:
- **Color-coded bars by threshold**: green (`< -100 dBm`), yellow (`-100
to -85 dBm`), red (`≥ -85 dBm`)
- **Instant hover tooltips**: exact dBm value + UTC timestamp via native
SVG `<title>` — no delay
- **Column highlighting on hover**: CSS `:hover` with opacity change +
border stroke
- **Inline legend**: green/yellow/red threshold key in chart header
- **Removed reference lines**: the `-100 warning` and `-85 critical`
dashed lines are eliminated — threshold info is now encoded directly in
bar color (data-ink ratio improvement)
- **No gap detection**: column charts render discrete bars — each data
point is an independent observation, so line-chart-style gap detection
doesn't apply. Every sample gets a bar.
- **Reboot markers**: vertical dashed lines with "reboot" labels at
reboot timestamps (shared `rfRebootMarkers` helper, same as other RF
charts)
- **Division-by-zero guard**: constant values or single data points use
a ±5 dBm window so bars render with visible height
- **Sparklines unchanged**: fleet overview sparklines remain as
polylines (correct at 140×24px scale)
### Why columns instead of lines
A polyline connecting discrete 5-minute noise floor samples creates
false visual continuity — it implies interpolation between measurements
that doesn't exist. When readings jump between -115 and -95 irregularly,
the line becomes a jagged mess. Column bars encode each sample as a
discrete, independent observation: one bar = one measurement.
### Testing
- 12 unit tests in `test-frontend-helpers.js` covering: SVG output,
threshold color coding, tooltips, empty/single/constant data, legend
rendering, reboot markers, shared time axis
- All existing tests pass (packet-filter: 62, aging: 29,
frontend-helpers: 490)
### No backend changes
Pure frontend change — ~150 lines in `analytics.js`.
Fixes#600
---------
Co-authored-by: you <you@example.com>
## Problem
The "Paths Through This Node" API endpoint (`/api/nodes/{pubkey}/paths`)
returns unrelated packets when two nodes share a hex prefix. For
example, querying paths for "Kpa Roof Solar" (`c0dedad4...`) returns 316
packets that actually belong to "C0ffee SF" (`C0FFEEC7...`) because both
share the `c0` prefix in the `byPathHop` index.
Fixes#655
## Root Cause
`handleNodePaths()` in `routes.go` collects candidates from the
`byPathHop` index using 2-char and 4-char hex prefixes for speed, but
never verifies that the target node actually appears in each candidate's
resolved path. The broad index lookup is intentional, but the
**post-filter was missing**.
## Fix
Added `nodeInResolvedPath()` helper in `store.go` that checks whether a
transmission's `resolved_path` (from the neighbor affinity graph via
`resolveWithContext`) contains the target node's full pubkey. The
filter:
- **Includes** packets where `resolved_path` contains the target node's
full pubkey
- **Excludes** packets where `resolved_path` resolved to a different
node (prefix collision)
- **Excludes** packets where `resolved_path` is nil/empty (ambiguous —
avoids false positives)
The check examines both the best observation's resolved_path
(`tx.ResolvedPath`) and all individual observations, so packets are
included if *any* observation resolved the target.
## Tests
- `TestNodeInResolvedPath` — unit test for the helper with 5 cases
(match, different node, nil, all-nil elements, match in observation
only)
- `TestNodePathsPrefixCollisionFilter` — integration test: two nodes
sharing `aa` prefix, verifies the collision packet is excluded from one
and included for the other
- Updated test DB schema to include `resolved_path` column and seed data
with resolved pubkeys
- All existing tests pass (165 additions, 8 modifications)
## Performance
No impact on hot paths. The filter runs once per API call on the
already-collected candidate set (typically small). `nodeInResolvedPath`
is O(observations × hops) per candidate — negligible since observations
per transmission are typically 1–5.
---------
Co-authored-by: you <you@example.com>
## Summary
TRACE packets on the live map previously animated the **full intended
route** regardless of how far the trace actually reached. This made it
impossible to distinguish a completed route from a failed one —
undermining the primary diagnostic purpose of trace packets.
## Changes
### Backend — `cmd/server/decoder.go`
- Added `HopsCompleted *int` field to the `Path` struct
- For TRACE packets, the header path contains SNR bytes (one per hop
that actually forwarded). Before overwriting `path.Hops` with the full
intended route from the payload, we now capture the header path's
`HashCount` as `hopsCompleted`
- This field is included in API responses and WebSocket broadcasts via
the existing JSON serialization
### Frontend — `public/live.js`
- For TRACE packets with `hopsCompleted < totalHops`:
- Animate only the **completed** portion (solid line + pulse)
- Draw the **unreached** remainder as a dashed/ghosted line (25%
opacity, `6,8` dash pattern) with ghost markers
- Dashed lines and ghost markers auto-remove after 10 seconds
- When `hopsCompleted` is absent or equals total hops, behavior is
unchanged
### Tests — `cmd/server/decoder_test.go`
- `TestDecodePacket_TraceHopsCompleted` — partial completion (2 of 4
hops)
- `TestDecodePacket_TraceNoSNR` — zero completion (trace not forwarded
yet)
- `TestDecodePacket_TraceFullyCompleted` — all hops completed
## How it works
The MeshCore firmware appends an SNR byte to `pkt->path[]` at each hop
that forwards a TRACE packet. The count of these SNR bytes (`path_len`)
indicates how far the trace actually got. CoreScope's decoder already
parsed the header path, but the TRACE-specific code overwrote it with
the payload hops (full intended route) without preserving the progress
information. Now we save that count first.
Fixes#651
---------
Co-authored-by: you <you@example.com>
## Summary
The "By Repeaters" section on the Hash Stats analytics page was counting
**all** node types (companions, room servers, sensors, etc.) instead of
only repeaters. This made the "By Repeaters" distribution identical to
"Multi-Byte Hash Adopters", defeating the purpose of the breakdown.
Fixes#652
## Root Cause
`computeAnalyticsHashSizes()` in `cmd/server/store.go` built its
`byNode` map from advert packet data without cross-referencing node
roles from the node store. Both `distributionByRepeaters` and
`multiByteNodes` consumed this unfiltered map.
## Changes
### `cmd/server/store.go`
- Build a `nodeRoleByPK` lookup map from `getCachedNodesAndPM()` at the
start of the function
- Store `role` in each `byNode` entry when processing advert packets
- **`distributionByRepeaters`**: filter to only count nodes whose role
contains "repeater"
- **`multiByteNodes`**: include `role` field in output so the frontend
can filter/group by node type
### `cmd/server/coverage_test.go`
- Add `TestHashSizesDistributionByRepeatersFiltersRole`: verifies that
companion nodes are excluded from `distributionByRepeaters` but included
in `multiByteNodes` with correct role
### `cmd/server/routes_test.go`
- Fix `TestHashAnalyticsZeroHopAdvert`: invalidate node cache after DB
insert so role lookup works
- Fix `TestAnalyticsHashSizeSameNameDifferentPubkey`: insert node
records as repeaters + invalidate cache
## Testing
All `cmd/server` tests pass (68 insertions, 3 deletions across 3 files).
Co-authored-by: you <you@example.com>
## Fix: Zero-hop DIRECT packets report bogus hash_size
Closes#649
### Problem
When a DIRECT packet has zero hops (pathByte lower 6 bits = 0), the
generic `hash_size = (pathByte >> 6) + 1` formula produces a bogus value
(1-4) instead of 0/unknown. This causes incorrect hash size displays and
analytics for zero-hop direct adverts.
### Solution
**Frontend (JS):**
- `packets.js` and `nodes.js` now check `(pathByte & 0x3F) === 0` to
detect zero-hop packets and suppress bogus hash_size display.
**Backend (Go):**
- Both `cmd/server/decoder.go` and `cmd/ingestor/decoder.go` reset
`HashSize=0` for DIRECT packets where `pathByte & 0x3F == 0` (hash_count
is zero).
- TRACE packets are excluded since they use hashSize to parse hop data
from the payload.
- The condition uses `pathByte & 0x3F == 0` (not `pathByte == 0x00`) to
correctly handle the case where hash_size bits are non-zero but
hash_count is zero — matching the JS frontend approach.
### Testing
**Backend:**
- Added 4 tests each in `cmd/server/decoder_test.go` and
`cmd/ingestor/decoder_test.go`:
- DIRECT + pathByte 0x00 → HashSize=0 ✅
- DIRECT + pathByte 0x40 (hash_size bits set, hash_count=0) → HashSize=0
✅
- Non-DIRECT + pathByte 0x00 → HashSize=1 (unchanged) ✅
- DIRECT + pathByte 0x01 (1 hop) → HashSize=1 (unchanged) ✅
- All existing tests pass (`go test ./...` in both cmd/server and
cmd/ingestor)
**Frontend:**
- Verified hash size display is suppressed for zero-hop direct adverts
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- **`app.js`**: `getDistanceUnit()`, `formatDistance(km)`,
`formatDistanceRound(km)` helpers. Auto mode uses `navigator.language` —
miles for `en-US`, `en-GB`, `my`, `lr`; km everywhere else.
- **`customize-v2.js`**: Distance Unit preference (km / mi / auto) in
Display Settings panel. Stored in
`localStorage['meshcore-distance-unit']` via the existing apply
pipeline. Override dot and reset work. Display tab badge counts it.
- **`nodes.js`**: Neighbor table distance cell uses `formatDistance()`.
- **`analytics.js`**: All rendered km values use `formatDistance()` or
`formatDistanceRound()`. Column headers (`km`/`mi`) respond to the
active unit. Collision classification thresholds (Local < 50 km /
Regional 50–200 km / Distant > 200 km) also adapt.
Default is `auto` — no change for existing users unless their locale
maps to miles.
## Test plan
- [x] `node test-frontend-helpers.js` — 456 passed, 0 failed (10 new
formatDistance tests)
- [ ] Set unit to **mi** in customize → Neighbors table shows `7.6 mi`
instead of `12.3 km`
- [ ] Analytics → Distance tab → stat cards, leaderboard, and column
headers all show miles
- [ ] Collision tool → Local/Regional/Distant thresholds show `31 mi` /
`124 mi`
- [ ] Route patterns popup shows miles per hop and total
- [ ] Reset override dot → unit returns to auto
Closes#621🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- `originLat` was declared with `const` inside two block-scoped
`if`/`else` branches in `resolveHopPositions` (lines 1914 and 1921) but
referenced at line 1945 outside both blocks → `ReferenceError: originLat
is not defined` thrown on every packet render on the live page.
- Fix: introduce `senderLat` derived directly from
`payload.lat`/`payload.lon` at the point of use, using the same
null/zero guard as the existing declarations.
## Test plan
- [x] Live page no longer shows `ReferenceError: originLat is not
defined` in the console
- [x] Packet path animations still render correctly for packets with GPS
coords
- [x] Packets without GPS coords still handled (senderLat === null,
anchor not added)
Closes#647🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
Fixes remaining text inconsistencies in the Prefix Tool after #643 added
the repeater filter.
The Torvalds review on #643 flagged:
1. **Must-fix (already addressed in #643):** "About these numbers" text
— fixed
2. **Out-of-scope:** Empty state says "No nodes" should say "No
repeaters"
This PR fixes ALL remaining "nodes" references in the Prefix Tool to say
"repeaters":
- Empty state: "No nodes in the network yet" → "No repeaters in the
network yet"
- Stat card label: "Total nodes" → "Total repeaters"
- Region note link: "Check all nodes →" → "Check all repeaters →"
- Recommendation text: "With N nodes" → "With N repeaters"
Verified: zero occurrences of stale "all nodes", "Total nodes", or "No
nodes" remain in the Prefix Tool section.
Closes#642
Co-authored-by: you <you@example.com>
## Summary
- **nodes.js**: `#/nodes?tab=repeater` and `#/nodes?search=foo` — role
tab and search query are now URL-addressable; state resets to defaults
on re-navigation
- **packets.js**: `#/packets?timeWindow=60` and
`#/packets?region=US-SFO` — time window and region filter survive
refresh and are shareable
- **channels.js**: `#/channels/{hash}?node=Name` — node detail panel is
URL-addressable; auto-opens on load, URL updates on open/close
- **region-filter.js**: adds `RegionFilter.setSelected(codesArray)` to
public API (needed for URL-driven init)
All changes use `history.replaceState` (not `pushState`) to avoid
polluting browser history. URL params override localStorage on load;
localStorage remains fallback.
## Implementation notes
- Router strips query string before computing `routeParam`, so all pages
read URL params directly from `location.hash`
- `buildNodesQuery(tab, searchStr)` and `buildPacketsUrl(timeWindowMin,
regionParam)` are pure functions exposed on `window` for testability
- Region URL param is applied after `RegionFilter.init()` via a
`_pendingUrlRegion` module-level var to keep ordering explicit
- `showNodeDetail` captures `selectedHash` before the async `lookupNode`
call to avoid stale URL construction
## Test plan
- [x] `node test-frontend-helpers.js` — 459 passed, 0 failed (includes 6
`buildNodesQuery` + 5 `buildPacketsUrl` unit tests)
- [x] Navigate to `#/nodes?tab=repeater` — Repeaters tab active on load
- [x] Click a tab, verify URL updates to `#/nodes?tab=room`
- [x] Navigate to `#/packets?timeWindow=60` — time window dropdown shows
60 min
- [x] Change time window, verify URL updates
- [x] Navigate to `#/channels/{hash}` and click a sender name — URL
updates to `?node=Name`
- [x] Reload that URL — node panel re-opens
Closes#536🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Hash Issues and Prefix Tool tabs showed different collision counts
because the Prefix Tool was including all node types (companions, rooms,
sensors) while Hash Issues correctly filtered to repeaters only.
**Only repeaters matter for prefix collisions** — they're the nodes that
relay packets using hash-based addressing. Non-repeater collisions are
harmless noise.
## Changes
1. **Filtered Prefix Tool to repeaters only** — matches Hash Issues'
scope
2. **Updated explanatory text** — both tabs now clearly state they cover
repeaters
3. **Added cross-reference links** between the two tabs
4. **Added hash_size badges** in Prefix Tool results
Both tabs should now agree on collision counts for each byte size.
## Review Status
- ✅ Self-review
- ✅ Torvalds review — caught stale 'regardless of role' text, fixed
- ✅ All tests pass
Fixes#642
---------
Co-authored-by: you <you@example.com>
## Summary
Replace source-grep virtual scroll tests with behavioral tests that
exercise actual logic. Fixes#405, Fixes#409.
## What changed
### packets.js
- **Extracted `_calcVisibleRange()`** — pure function containing the
binary-search range calculation logic previously inline in
`renderVisibleRows()`. Takes offsets, scroll position, viewport
dimensions, row height, thead offset, and buffer as parameters. Returns
`{ startIdx, endIdx, firstEntry, lastEntry }`.
- `renderVisibleRows()` now calls `_calcVisibleRange()` instead of
inline math — no behavioral change.
- Exported via `_packetsTestAPI` for direct testing.
### test-frontend-helpers.js
- **Removed 8 source-grep tests** that used
`packetsSource.includes(...)` to check strings exist in source code (not
behavior):
- "renderVisibleRows uses cumulative offsets not flat entry count"
- "renderVisibleRows skips DOM rebuild when range unchanged"
- "lazy row generation — HTML built only for visible slice"
- "observer filter Set is hoisted, not recreated per-packet"
- "packets.js display filter checks _children for observer match"
- "packets.js WS filter checks _children for observer match"
- "buildFlatRowHtml has null-safe decoded_json"
- "pathHops null guard in buildFlatRowHtml / detail pane"
- "destroy cleans up virtual scroll state"
- **Added 11 behavioral tests for `_calcVisibleRange()`** loaded from
the actual packets.js via sandbox:
- Top of list (scroll = 0)
- Middle of list (scroll to row 50)
- Bottom of list (scroll past end)
- Empty array (0 entries)
- Single item
- Exact row boundary
- Large dataset (30K items)
- Various row heights (24px instead of 36px)
- Thead offset shifting visible range
- Expanded groups with variable row counts
- Buffer clamped at boundaries
- **Kept all existing behavioral tests**: `cumulativeRowOffsets`,
`getRowCount`, observer filter logic (#537).
## Test count
- Removed: 8 source-grep tests
- Added: 11 behavioral tests
- Net: +3 tests (446 total, 0 failures)
## Why
Source-grep tests (`packetsSource.includes('...')`) are brittle — they
break on refactors even when behavior is preserved, and they pass even
when the tested code is buggy. Behavioral tests exercise real
inputs/outputs and catch actual regressions.
Co-authored-by: you <you@example.com>
## Summary
Implements M2 of the table sorting spec (#620): sortable nodes list +
neighbor/observer tables.
### Changes
**Shared utility (`public/table-sort.js`)**
- IIFE pattern, no dependencies, no build step
- DOM-reorder sorting (no innerHTML rebuild) — preserves event listeners
- `data-value` attributes for raw sortable values, `data-type` on `<th>`
for type detection
- Built-in comparators: text (`localeCompare`), number, date, dBm
- `aria-sort` attributes, keyboard support (Enter/Space), sort arrows
- localStorage persistence with `storageKey` option
- `onSort` callback for custom re-render triggers
**Nodes list table**
- Wired via `TableSort.init` with `onSort` callback that triggers
`renderRows()`
- Keeps JS-array-level sorting for claimed/favorites pinning (TableSort
can't handle pinned rows)
- Replaces old `sortState`, `toggleSort()`, `sortArrow()` with TableSort
controller
- Test hooks preserved for backward compatibility (fallback state for
non-DOM tests)
**Neighbor table**
- Added `data-sort` and `data-value` attributes to all columns (name,
role, score, count, last_seen, distance)
- Default sort: count descending
- `TableSort.init` called after neighbor data renders
**Observer table (full detail page)**
- Converted from plain `<table>` to sortable table with data attributes
- Sortable columns: observer, region, packets, avg SNR, avg RSSI
- Default sort: packets descending
### Testing
- 18 new unit tests for `table-sort.js` (custom DOM mock, no jsdom
dependency)
- All 445 existing frontend tests pass unchanged
- All packet-filter (62) and aging (29) tests pass
### Note
This branch includes `table-sort.js` since M1 hasn't merged yet. The
utility code is identical to the M1 spec.
---------
Co-authored-by: you <you@example.com>
## Consolidate CI Pipeline — Build + Publish to GHCR + Deploy Staging
### What
Merges the separate `publish.yml` workflow into `deploy.yml`, creating a
single CI/CD pipeline:
**`go-test → e2e-test → build-and-publish → deploy → publish-badges`**
### Why
- Two workflows doing overlapping builds was wasteful and error-prone
- `publish.yml` had a bug: `BUILD_TIME=$(date ...)` in a `with:` block
never executed (literal string)
- The old build job had duplicate/conflicting `APP_VERSION` assignments
### Changes
- **`build-and-publish` job** replaces old `build` job — builds locally
for staging, then does multi-arch GHCR push (gated to push events only,
PRs skip)
- **Build metadata** computed in a dedicated step, passed via
`GITHUB_OUTPUT` — no more shell expansion bugs
- **`APP_VERSION`** is `v1.2.3` on tag push, `edge` on master push
- **Deploy** now pulls the `edge` image from GHCR and tags for compose
compatibility, with fallback to local build
- **`publish.yml` deleted** — no duplicate workflow
- **Top-level `permissions`** block with `packages:write` for GHCR auth
- **Triggers** now include `tags: ['v*']` for release publishing
### Status
- ✅ Rebased onto master
- ✅ Self-reviewed (all checklist items pass)
- ✅ Ready for merge
Co-authored-by: you <you@example.com>
## Summary
Implements M1 of the table sorting spec (#620): a shared `TableSort`
utility module and integration with the packets table.
### What's included
**1. `public/table-sort.js` — Shared sort utility (IIFE, no
dependencies)**
- `TableSort.init(tableEl, options)` — attaches click-to-sort on `<th
data-sort-key="...">` elements
- Built-in comparators: text (localeCompare), numeric, date (ISO), dBm
(strips suffix)
- NaN/null values sort last consistently
- Visual: ▲/▼ `<span class="sort-arrow">` appended to active column
header
- Accessibility: `aria-sort="ascending|descending|none"`, keyboard
support (Enter/Space)
- DOM reorder via `appendChild` loop (no innerHTML rebuild)
- `domReorder: false` option for virtual scroll tables (packets)
- `storageKey` option for localStorage persistence
- Custom comparator override per column
- `onSort(column, direction)` callback
- `destroy()` for clean teardown
**2. Packets table integration**
- All columns sortable: region, time, hash, size, HB, type, observer,
path, rpt
- Default sort: time descending (matches existing behavior)
- Uses `domReorder: false` + `onSort` callback to sort the data array,
then re-render via virtual scroll
- Works with both grouped and ungrouped views
- WebSocket updates respect active sort column
- Sort preference persisted in localStorage (`meshcore-packets-sort`)
**3. Tests — 22 unit tests (`test-table-sort.js`)**
- All 4 built-in comparators (text, numeric, date, dBm)
- NaN/null edge cases
- Direction toggle on click
- aria-sort attribute correctness
- Visual indicator (▲/▼) presence and updates
- onSort callback
- domReorder: false behavior
- destroy() cleanup
- Custom comparator override
### Performance
Packets table sorting works at the data array level (single `Array.sort`
call), not DOM level. Virtual scroll then renders only visible rows. No
new DOM nodes are created during sort — it's purely a data reorder +
re-render of the existing visible window. Expected sort time for 30K
packets: ~50-100ms (array sort) + existing virtual scroll render time.
Closes#620 (M1)
Co-authored-by: you <you@example.com>
## Summary
Implements the `DISABLE_CADDY` environment variable in the Docker
entrypoint, fixing #629.
## Problem
The `DISABLE_CADDY` env var was documented but had no effect — the
entrypoint only handled `DISABLE_MOSQUITTO`.
## Changes
### New supervisord configs
- **`supervisord-go-no-caddy.conf`** — mosquitto + ingestor + server (no
Caddy)
- **`supervisord-go-no-mosquitto-no-caddy.conf`** — ingestor + server
only
### Updated entrypoint (`docker/entrypoint-go.sh`)
Handles all 4 combinations:
| DISABLE_MOSQUITTO | DISABLE_CADDY | Config used |
|---|---|---|
| false | false | `supervisord.conf` (default) |
| true | false | `supervisord-no-mosquitto.conf` |
| false | true | `supervisord-no-caddy.conf` |
| true | true | `supervisord-no-mosquitto-no-caddy.conf` |
### Dockerfiles
Added COPY lines for the new configs in both `Dockerfile` and
`Dockerfile.go`.
## Testing
```bash
# Verify correct config selection
docker run -e DISABLE_CADDY=true corescope
# Should log: [config] Caddy reverse proxy disabled (DISABLE_CADDY=true)
docker run -e DISABLE_CADDY=true -e DISABLE_MOSQUITTO=true corescope
# Should log both disabled messages
```
Fixes#629
Co-authored-by: you <you@example.com>
## Summary
Fixes critical and major mobile accessibility items from #630, focused
on small phone viewports (320px–375px).
### Critical fixes
1. **Touch targets ≥ 44px** — All interactive elements (filter buttons,
tab buttons, search inputs, nav buttons, region pills, dropdowns) get
`min-height: 44px; min-width: 44px` via `@media (pointer: coarse)` —
desktop/mouse users are unaffected.
2. **ARIA live regions** — Added `aria-live="polite"` to: packet list
(`#pktLeft`), node list (`#nodesLeft`), analytics content
(`#analyticsContent`), live feed (`#liveFeed` with `role="log"`). Screen
readers now announce dynamic content updates.
3. **Color-only status indicators** — Status dots in live view marked
`aria-hidden="true"` (text labels like "Online"/"Degraded"/"Offline"
already present alongside).
4. **Detail panel on mobile** — Side panel (`panel-right`) renders as a
full-screen fixed overlay on ≤640px. Close button (✕) added to nodes
detail panel. Escape key closes both nodes and packets detail panels.
### Major fixes
5. **Analytics tabs overflow** — Tabs switch to `flex-wrap: nowrap;
overflow-x: auto` on ≤640px, preventing overflow on 320px screens.
6. **Table horizontal scroll** — Added `.table-scroll-wrap` class and
`min-width: 480px` on `.data-table` at ≤640px for horizontal scrolling
when columns don't fit.
7. **SPA focus management** — On every page navigation, focus moves to
first heading (`h1`/`h2`/`h3`) or falls back to `#app`. Uses
`requestAnimationFrame` for correct DOM timing.
### Bonus
- Analytics tabs get `role="tablist"` + `aria-label` for screen reader
semantics.
### Known follow-ups (not blocking)
- Individual tab buttons should get `role="tab"` + `aria-selected` +
`aria-controls` for complete ARIA tab pattern.
- `sr-status-label` and `table-scroll-wrap` CSS classes are defined but
not yet used in JS — ready for future use when status text labels and
table wrappers are wired up.
Closes#630
Co-authored-by: you <you@example.com>
## Summary
Auto-generated OpenAPI 3.0.3 spec endpoint (`/api/spec`) and Swagger UI
(`/api/docs`) for the CoreScope API.
## What
- **`cmd/server/openapi.go`** — Route metadata map
(`routeDescriptions()`) + spec builder that walks the mux router to
generate a complete OpenAPI 3.0.3 spec at runtime. Includes:
- All 47 API endpoints grouped by tag (admin, analytics, channels,
config, nodes, observers, packets)
- Query parameter documentation for key endpoints (packets, nodes,
search, resolve-hops)
- Path parameter extraction from mux `{name}` patterns
- `ApiKeyAuth` security scheme for API-key-protected endpoints
- Swagger UI served as a self-contained HTML page using unpkg CDN
- **`cmd/server/openapi_test.go`** — Tests for spec endpoint (validates
JSON structure, required fields, path count, security schemes,
self-exclusion of `/api/spec` and `/api/docs`), Swagger UI endpoint, and
`extractPathParams` helper.
- **`cmd/server/routes.go`** — Stores router reference on `Server`
struct for spec generation; registers `/api/spec` and `/api/docs`
routes.
## Design Decisions
- **Runtime spec generation** vs static YAML: The spec walks the actual
router, so it can never drift from registered routes. Route metadata
(summaries, descriptions, tags, auth flags) is maintained in a parallel
map — the test enforces minimum path count to catch drift.
- **No external dependencies**: Uses only stdlib + existing gorilla/mux.
Swagger UI loaded from unpkg CDN (no vendored assets).
- **Security tagging**: Auth-protected endpoints (those behind
`requireAPIKey` middleware) are tagged with `security: [{ApiKeyAuth:
[]}]` in the spec, matching the actual middleware configuration.
## Testing
- `go test -run TestOpenAPI` — validates spec structure, field presence,
path count ≥ 20, security schemes
- `go test -run TestSwagger` — validates HTML response with swagger-ui
references
- `go test -run TestExtractPathParams` — unit tests for path parameter
extraction
---------
Co-authored-by: you <you@example.com>
## Zero-Config Defaults + Deployment Docs
Make CoreScope start with zero configuration — no `config.json`
required. The ingestor falls back to sensible defaults (local MQTT
broker, standard topics, default DB path) when no config file exists.
### What changed
**`cmd/ingestor/config.go`** — `LoadConfig` no longer errors on missing
config file. Instead it logs a message and uses defaults. If no MQTT
sources are configured (from file or env), defaults to
`mqtt://localhost:1883` with `meshcore/#` topic.
**`cmd/ingestor/main.go`** — Removed redundant "no MQTT sources" fatal
(now handled in config layer). Improved the "no connections established"
fatal with actionable hints.
**`README.md`** — Replaced "Docker (Recommended)" section with a
one-command quickstart using the pre-built image. No build step, no
config file, just `docker run`.
**`docs/deployment.md`** — New comprehensive deployment guide covering
Docker, Compose, config reference, MQTT setup, TLS/HTTPS, monitoring,
backup, and troubleshooting.
### Zero-config flow
```
docker run -d -p 80:80 -v corescope-data:/app/data ghcr.io/kpa-clawbot/corescope:latest
```
1. No config.json found → defaults used, log message printed
2. No MQTT sources → defaults to `mqtt://localhost:1883`
3. Internal Mosquitto broker already running in container → connection
succeeds
4. Dashboard shows empty, ready for packets
### Review fixes (commit 13b89bb)
- Removed `DISABLE_CADDY` references from all docs — this env var was
never implemented in the entrypoint
- Fixed `/api/stats` example in deployment guide — showed nonexistent
fields (`mqttConnected`, `uptimeSeconds`, `activeNodes`)
- Improved MQTT connection failure message with actionable
troubleshooting hints
Closes#610
---------
Co-authored-by: you <you@example.com>
## Summary
Mobile UX fixes for the channel color picker (addresses #619).
## Changes
### Commit 1: Mobile UX improvements
- **Bottom-sheet pattern on mobile**: Color picker renders as a fixed
bottom sheet on touch devices (`@media (pointer: coarse)`) with
`env(safe-area-inset-bottom)` for notched phones
- **40px touch targets**: Swatches enlarged from default to 40×40px on
mobile
- **Native color picker hidden on touch**: `<input type="color">` is
hidden on mobile — preset swatches only
- **Scroll lock**: `document.body.style.overflow = 'hidden'` while
popover is open, restored on close
- **CSS context menu suppression**: `-webkit-touch-callout: none` and
`user-select: none` on `.live-feed-item`
- **Long-press with `passive: true`**: touchstart listener is passive to
avoid scroll jank
### Commit 2: Remove preventDefault on touchstart
- Removed `e.preventDefault()` from the touchstart handler — it was
blocking scroll initiation on feed items
- Context menu suppression handled entirely via CSS (see above)
## Desktop behavior
Unchanged. All mobile-specific styles scoped under `@media (pointer:
coarse)`. Desktop positioning logic unchanged.
## Review Status
- ✅ Rebased onto master (no conflicts)
- ✅ Self-review complete — all checklist items verified
- ✅ Tufte analysis posted as comment
---------
Co-authored-by: you <you@example.com>
## Summary
Addresses user feedback on #600 — two improvements to RF Health detail
panel charts:
### 1. Auto-scale airtime Y-axis
Previously fixed 0-100% which made low-activity nodes unreadable (e.g.
0.1% TX barely visible). Now auto-scales to the actual data range with
20% headroom (minimum 1%), matching how the noise floor chart already
works.
### 2. Hover tooltips on all chart data points
Invisible SVG `<circle>` elements with native `<title>` tooltips on
every data point across all 4 charts:
- **Noise floor**: `NF: -112.3 dBm` + UTC timestamp
- **Airtime**: `TX: 2.1%` or `RX: 8.3%` + UTC timestamp
- **Error rate**: `Err: 0.05%` + UTC timestamp
- **Battery**: `Batt: 3.85V` + UTC timestamp
Uses native browser SVG tooltips — zero dependencies, accessible, no JS
event handlers.
### Design rationale (Tufte)
- Auto-scaling increases data-ink ratio by eliminating wasted vertical
space
- Tooltips provide detail-on-demand without cluttering the chart with
labels on every point
### Spec update
Added M2 feedback improvements section to
`docs/specs/rf-health-dashboard.md`.
---------
Co-authored-by: you <you@example.com>
## Summary
Documents the lock ordering for all five mutexes in `PacketStore`
(`store.go`) to prevent future deadlocks.
## What changed
Added a comment block above the `PacketStore` struct documenting:
- All 5 mutexes (`mu`, `cacheMu`, `channelsCacheMu`, `groupedCacheMu`,
`regionObsMu`)
- What each mutex guards
- The required acquisition order (numbered 1–5)
- The nesting relationships that exist today (`cacheMu →
channelsCacheMu` in `invalidateCachesFor` and `rebuildAnalyticsCaches`)
- Confirmation that no reverse ordering exists (no deadlock risk)
## Verification
- Grepped all lock acquisition sites to confirm no reverse nesting
exists
- `go build ./...` passes — documentation-only change
Fixes#413
---------
Co-authored-by: you <you@example.com>
## Summary
Replaces hardcoded `VSCROLL_ROW_HEIGHT = 36` and `theadHeight = 40` in
the virtual scroll logic with dynamic DOM measurement, so the values
stay correct if CSS changes.
## Changes
- `VSCROLL_ROW_HEIGHT`: measured once from the first rendered data row's
`offsetHeight` after the initial full rebuild. Falls back to 36px until
measurement occurs.
- `theadHeight`: measured from the actual `<thead>` element's
`offsetHeight` on every `renderVisibleRows` call. Falls back to 40px if
no thead is found.
- Both variables are now `let` instead of `const` to allow runtime
updates.
## Performance
No performance impact — both measurements are single `offsetHeight`
reads (no reflow triggered since the DOM was just written). Row height
measurement runs only once (guarded by `_vscrollRowHeightMeasured`
flag). Thead measurement is a single property read per scroll event.
Fixes#407
Co-authored-by: you <you@example.com>
## Summary
Fixes#420 — wires `cacheTTL` config values to server-side cache
durations that were previously hardcoded.
## Problem
`collisionCacheTTL` was hardcoded at 60s in `store.go`. The config has
`cacheTTL.analyticsHashSizes: 3600` (1 hour) but it was never read — the
`/api/config/cache` endpoint just passed the raw map to the client
without applying values server-side.
## Changes
- **`store.go`**: Add `cacheTTLSec()` helper to safely extract duration
values from the `cacheTTL` config map. `NewPacketStore` now accepts an
optional `cacheTTL` map (variadic, backward-compatible) and wires:
- `cacheTTL.analyticsHashSizes` → `collisionCacheTTL`
- `cacheTTL.analyticsRF` → `rfCacheTTL`
- **Default changed**: `collisionCacheTTL` default raised from 60s →
3600s (1 hour). Hash collision computation is expensive and data changes
rarely — 60s was causing unnecessary recomputation.
- **`main.go`**: Pass `cfg.CacheTTL` to `NewPacketStore`.
- **Tests**: Added `TestCacheTTLFromConfig` and `TestCacheTTLDefaults`
in eviction_test.go. Updated existing `TestHashCollisionsCacheTTL` for
the new default.
## Audit of other cacheTTL values
The remaining `cacheTTL` keys (`stats`, `nodeDetail`, `nodeHealth`,
`nodeList`, `bulkHealth`, `networkStatus`, `observers`, `channels`,
`channelMessages`, `analyticsTopology`, `analyticsChannels`,
`analyticsSubpaths`, `analyticsSubpathDetail`, `nodeAnalytics`,
`nodeSearch`, `invalidationDebounce`) are **client-side only** — served
via `/api/config/cache` and consumed by the frontend. They don't have
corresponding server-side caches to wire to. The only server-side caches
(`rfCache`, `topoCache`, `hashCache`, `chanCache`, `distCache`,
`subpathCache`, `collisionCache`) all use either `rfCacheTTL` or
`collisionCacheTTL`, both now configurable.
## Complexity
O(1) config lookup at store init time. No hot-path impact.
Co-authored-by: you <you@example.com>
Closes#616
## What
Adds a **Distance** column to the neighbor table on the node detail
page.
When both the viewed node and a neighbor have GPS coordinates recorded,
the table shows the haversine distance between them (e.g. `3.2 km`).
When either node lacks GPS, the cell shows `—`.
## Changes
**Backend** (`cmd/server/neighbor_api.go`):
- Added `distance_km *float64` (omitempty) to `NeighborEntry`
- In `handleNodeNeighbors`: look up source node coords from `nodeMap`,
then for each resolved (non-ambiguous) neighbor with GPS, compute
`haversineKm` and set the field
**Frontend** (`public/nodes.js`):
- Added `Distance` column header between Last Seen and Conf
- Cell renders `X.X km` or `—` (muted) when unavailable
**Tests** (`cmd/server/neighbor_api_test.go`):
- `TestNeighborAPI_DistanceKm_WithGPS`: two nodes with real coords →
`distance_km` is positive
- `TestNeighborAPI_DistanceKm_NoGPS`: two nodes at 0,0 → `distance_km`
is nil
## Verification
Test at **https://staging.on8ar.eu** — navigate to any node detail page
and scroll to the Neighbors section. Nodes with GPS coordinates show a
distance; those without show `—`.
## Summary
Adds two config knobs for controlling backfill scope and neighbor graph
data retention, plus removes the dead synchronous backfill function.
## Changes
### Config knobs
#### `resolvedPath.backfillHours` (default: 24)
Controls how far back (in hours) the async backfill scans for
observations with NULL `resolved_path`. Transmissions with `first_seen`
older than this window are skipped, reducing startup time for instances
with large historical datasets.
#### `neighborGraph.maxAgeDays` (default: 30)
Controls the maximum age of `neighbor_edges` entries. Edges with
`last_seen` older than this are pruned from both SQLite and the
in-memory graph. Pruning runs on startup (after a 4-minute stagger) and
every 24 hours thereafter.
### Dead code removal
- Removed the synchronous `backfillResolvedPaths` function that was
replaced by the async version.
### Implementation details
- `backfillResolvedPathsAsync` now accepts a `backfillHours` parameter
and filters by `tx.FirstSeen`
- `NeighborGraph.PruneOlderThan(cutoff)` removes stale edges from the
in-memory graph
- `PruneNeighborEdges(conn, graph, maxAgeDays)` prunes both DB and
in-memory graph
- Periodic pruning ticker follows the same pattern as metrics pruning
(24h interval, staggered start)
- Graceful shutdown stops the edge prune ticker
### Config example
Both knobs added to `config.example.json` with `_comment` fields.
## Tests
- Config default/override tests for both knobs
- `TestGraphPruneOlderThan` — in-memory edge pruning
- `TestPruneNeighborEdgesDB` — SQLite + in-memory pruning together
- `TestBackfillRespectsHourWindow` — verifies old transmissions are
excluded by backfill window
---------
Co-authored-by: you <you@example.com>
## Summary
Implements M2 of channel color highlighting (#271): a right-click
context menu popover for quick-assigning colors to hash channels.
Builds on M1 (PR #607) which provides `ChannelColors.set/get/remove`
storage primitives.
## What's new
### Color picker popover (`channel-color-picker.js`)
- **Right-click** any GRP_TXT/CHAN row in the **live feed** or **packets
table** → opens a color picker popover at the click point
- **Long-press** (500ms) on mobile triggers the same popover
- **10 preset swatches** — maximally distinct, ColorBrewer-inspired
palette
- **Custom hex** — native `<input type="color">` with Apply button
- **Clear button** — removes color assignment (hidden when no color
assigned)
- **Popover positioning** — auto-adjusts to avoid viewport overflow
- **Dismiss** — click outside or Escape key
### Immediate feedback
- Assigning a color instantly re-styles all visible live feed items with
that channel
- Packets table triggers `renderVisibleRows()` via exposed
`window._packetsRenderVisible`
### Wiring
- Feed items store `_ccPkt` packet reference for channel extraction
- Picker installed via `registerPage` init hooks in both `live.js` and
`packets.js`
- Single shared popover DOM element, repositioned on each open
### Styling
- Dark card with border, matching existing CoreScope dropdown patterns
- CSS in `style.css` under `.cc-picker-*` classes
- Uses CSS variables (`--surface-1`, `--border`, `--accent`, etc.) for
theme compatibility
## Files changed
| File | Change |
|------|--------|
| `public/channel-color-picker.js` | New — popover component (IIFE, no
dependencies except `ChannelColors`) |
| `public/index.html` | Script tag for picker |
| `public/live.js` | Store `_ccPkt` on feed items, install picker on
init |
| `public/packets.js` | Install picker on init, expose
`_packetsRenderVisible` |
| `public/style.css` | Popover CSS |
| `test-channel-colors.js` | 2 new tests for picker loading and graceful
degradation |
## Testing
- All 21 channel-colors tests pass (19 M1 + 2 M2)
- All 445 frontend-helpers tests pass
- All 62 packet-filter tests pass
## Performance
No hot-path impact. The popover is a single shared DOM element created
lazily on first use. Context menu handlers use event delegation on the
feed/table containers (one listener each, not per-row). The
`refreshVisibleRows` function only iterates currently-visible DOM
elements.
Closes milestone M2 of #271.
---------
Co-authored-by: you <you@example.com>
## Summary
Implements M1 of the [channel color highlighting
spec](docs/specs/channel-color-highlighting.md) for issue #271.
Allows users to assign custom highlight colors to specific hash
channels. When a `GRP_TXT` packet arrives with an assigned channel
color, the feed row and packets table row get:
- **4px colored left border** in the assigned color
- **Subtle background tint** (color at 10% opacity)
## What's included
### `public/channel-colors.js` — Storage model
- `ChannelColors.get(channel)` → hex color or null
- `ChannelColors.set(channel, color)` — assign a color
- `ChannelColors.remove(channel)` — clear assignment
- `ChannelColors.getAll()` → all assignments
- `ChannelColors.getRowStyle(typeName, channel)` → inline CSS string for
row highlighting
- Uses `localStorage` key `live-channel-colors`
- Gracefully handles corrupt/missing localStorage data
### Feed row highlighting (`public/live.js`)
- Both `addFeedItem` (live WS) and `addFeedItemDOM` (replay/DB load)
apply channel color styles
- Reads `decoded.payload.channelName` from the packet
### Packets table highlighting (`public/packets.js`)
- `buildFlatRowHtml` and `buildGroupRowHtml` apply channel color styles
to `<tr>` elements
- Reads channel from `getParsedDecoded(p).channel`
### Tests (`test-channel-colors.js`)
- 16 unit tests covering storage CRUD, edge cases (null, empty, corrupt
data), and style generation
- Tests verify only GRP_TXT/CHAN types get coloring, other types are
unaffected
## Design decisions
- **Only GRP_TXT/CHAN packets** — other types retain default
`TYPE_COLORS` styling
- **Channel color takes priority** over default type colors for row
highlighting
- **No UI for assigning colors yet** — that's M2 (right-click context
menu + color picker)
- **Storage key abstracted** behind functions to ease future migration
if customizer rework (#288) lands
- **10% opacity tint** (`#hexcolor` + `1a` suffix) ensures readability
in both dark/light modes
## Performance
- `getRowStyle()` is O(1) — single localStorage read + JSON parse per
call
- No per-packet API calls; all data is client-side
- No impact on hot rendering paths beyond one localStorage read per row
render
Closes#271 (M1 only — further milestones in separate PRs)
---------
Co-authored-by: you <you@example.com>
## Summary
Adds collapsible/minimizable UI panels on the live map page so overlay
panels don't block map content on medium-sized screens.
Fixes#279
## Changes
### Collapsible Legend Panel (all screen sizes)
- The legend toggle button (🎨/✕) is now visible at **all** screen sizes,
not just mobile
- Clicking it smoothly collapses/expands the legend with a CSS
transition
- Collapsed state persists in `localStorage` (`live-legend-hidden`)
- Feed panel already had hide/show with localStorage — no changes needed
there
### Medium Breakpoint (768px)
New `@media (max-width: 768px)` rules for tablet/small laptop screens:
- Feed panel: 360px → 280px wide, max-height 340px → 200px
- Node detail panel: 320px → 260px wide
- Legend: smaller font (10px) and tighter padding
- Header: reduced gap and padding
- Stats/toggles: smaller font sizes
### What's NOT changed
- Mobile (≤640px): existing behavior preserved (feed/legend hidden
entirely)
- Desktop (>768px): no changes — panels render at full size as before
## Testing
- `test-packet-filter.js`: 62 passed
- `test-aging.js`: 29 passed
- `test-frontend-helpers.js`: 445 passed
---------
Co-authored-by: you <you@example.com>
The button click handler used document.getElementById() which fails on
/packet/[ID] pages because renderDetail() runs before the container is
appended to the DOM. Changed to panel.querySelector() which searches
within the detached element tree.
Fixes#601
## M2: Airtime + Channel Quality + Battery Charts
Implements M2 of #600 — server-side delta computation and three new
charts in the RF Health detail view.
### Backend Changes
**Delta computation** for cumulative counters (`tx_air_secs`,
`rx_air_secs`, `recv_errors`):
- Computes per-interval deltas between consecutive samples
- **Reboot handling:** detects counter reset (current < previous), skips
that delta, records reboot timestamp
- **Gap handling:** if time between samples > 2× interval, inserts null
(no interpolation)
- Returns `tx_airtime_pct` and `rx_airtime_pct` as percentages
(delta_secs / interval_secs × 100)
- Returns `recv_error_rate` as delta_errors / (delta_recv +
delta_errors) × 100
**`resolution` query param** on `/api/observers/{id}/metrics`:
- `5m` (default) — raw samples
- `1h` — hourly aggregates (GROUP BY hour with AVG/MAX)
- `1d` — daily aggregates
**Schema additions:**
- `packets_sent` and `packets_recv` columns added to `observer_metrics`
(migration)
- Ingestor parses these fields from MQTT stats messages
**API response** now includes:
- `tx_airtime_pct`, `rx_airtime_pct`, `recv_error_rate` (computed
deltas)
- `reboots` array with timestamps of detected reboots
- `is_reboot_sample` flag on affected samples
### Frontend Changes
Three new charts in the RF Health detail view, stacked vertically below
noise floor:
1. **Airtime chart** — TX (red) + RX (blue) as separate SVG lines,
Y-axis 0-100%, direct labels at endpoints
2. **Error Rate chart** — `recv_error_rate` line, shown only when data
exists
3. **Battery chart** — voltage line with 3.3V low reference, shown only
when battery_mv > 0
All charts:
- Share X-axis and time range (aligned vertically)
- Reboot markers as vertical hairlines spanning all charts
- Direct labels on data (no legends)
- Resolution auto-selected: `1h` for 7d/30d ranges
- Charts hidden when no data exists
### Tests
- `TestComputeDeltas`: normal deltas, reboot detection, gap detection
- `TestGetObserverMetricsResolution`: 5m/1h/1d downsampling verification
- Updated `TestGetObserverMetrics` for new API signature
---------
Co-authored-by: you <you@example.com>
- Change RF Health detail view from bottom-of-page to a right-sliding side panel
- Grid stays visible and stable when detail is open (no layout shift)
- Click another observer updates panel in place; close button (×) dismisses
- On mobile (<640px): panel stacks below grid at full width
- Filter out observers with insufficient data (<2 sparkline points) from grid entirely
- Follows the same split-layout pattern used by the nodes page
## RF Health Dashboard — M1: Observer Metrics Storage, API & Small
Multiples Grid
Implements M1 of #600.
### What this does
Adds a complete RF health monitoring pipeline: MQTT stats ingestion →
SQLite storage → REST API → interactive dashboard with small multiples
grid.
### Backend Changes
**Ingestor (`cmd/ingestor/`)**
- New `observer_metrics` table via migration system (`_migrations`
pattern)
- Parse `tx_air_secs`, `rx_air_secs`, `recv_errors` from MQTT status
messages (same pattern as existing `noise_floor` and `battery_mv`)
- `INSERT OR REPLACE` with timestamps rounded to nearest 5-min interval
boundary (using ingestor wall clock, not observer timestamps)
- Missing fields stored as NULLs — partial data is always better than no
data
- Configurable retention pruning: `retention.metricsDays` (default 30),
runs on startup + every 24h
**Server (`cmd/server/`)**
- `GET /api/observers/{id}/metrics?since=...&until=...` — per-observer
time-series data
- `GET /api/observers/metrics/summary?window=24h` — fleet summary with
current NF, avg/max NF, sample count
- `parseWindowDuration()` supports `1h`, `24h`, `3d`, `7d`, `30d` etc.
- Server-side metrics retention pruning (same config, staggered 2min
after packet prune)
### Frontend Changes
**RF Health tab (`public/analytics.js`, `public/style.css`)**
- Small multiples grid showing all observers simultaneously — anomalies
pop out visually
- Per-observer cell: name, current NF value, battery voltage, sparkline,
avg/max stats
- NF status coloring: warning (amber) at ≥-100 dBm, critical (red) at
≥-85 dBm — text color only, no background fills
- Click any cell → expanded detail view with full noise floor line chart
- Reference lines with direct text labels (`-100 warning`, `-85
critical`) — not color bands
- Min/max points labeled directly on the chart
- Time range selector: preset buttons (1h/3h/6h/12h/24h/3d/7d/30d) +
custom from/to datetime picker
- Deep linking: `#/analytics?tab=rf-health&observer=...&range=...`
- All charts use SVG, matching existing analytics.js patterns
- Responsive: 3-4 columns on desktop, 1 on mobile
### Design Decisions (from spec)
- Labels directly on data, not in legends
- Reference lines with text labels, not color bands
- Small multiples grid, not card+accordion (Tufte: instant visual fleet
comparison)
- Ingestor wall clock for all timestamps (observer clocks may drift)
### Tests Added
**Ingestor tests:**
- `TestRoundToInterval` — 5 cases for rounding to 5-min boundaries
- `TestInsertMetrics` — basic insertion with all fields
- `TestInsertMetricsIdempotent` — INSERT OR REPLACE deduplication
- `TestInsertMetricsNullFields` — partial data with NULLs
- `TestPruneOldMetrics` — retention pruning
- `TestExtractObserverMetaNewFields` — parsing tx_air_secs, rx_air_secs,
recv_errors
**Server tests:**
- `TestGetObserverMetrics` — time-series query with since/until filters,
NULL handling
- `TestGetMetricsSummary` — fleet summary aggregation
- `TestObserverMetricsAPIEndpoints` — DB query verification
- `TestMetricsAPIEndpoints` — HTTP endpoint response shape
- `TestParseWindowDuration` — duration parsing for h/d formats
### Test Results
```
cd cmd/ingestor && go test ./... → PASS (26s)
cd cmd/server && go test ./... → PASS (5s)
```
### What's NOT in this PR (deferred to M2+)
- Server-side delta computation for cumulative counters
- Airtime charts (TX/RX percentage lines)
- Channel quality chart (recv_error_rate)
- Battery voltage chart
- Reboot detection and chart annotations
- Resolution downsampling (1h, 1d aggregates)
- Pattern detection / automated diagnosis
---------
Co-authored-by: you <you@example.com>
## Summary
- Adds a new **Prefix Tool** tab to the Analytics page (alongside Hash
Stats / Hash Issues)
- **Network Overview**: per-tier collision stats (1/2/3-byte) and a
network-size-based recommendation — collapsible, folded by default
- **Prefix Checker**: accepts a 1/2/3-byte hex prefix or full public
key; shows colliding nodes at each tier with severity badges (✅ / ⚠️ /
🔴); clicking a node navigates to its detail page
- **Prefix Generator**: picks a random collision-free prefix at the
chosen hash size; links to
[meshcore-web-keygen](https://agessaman.github.io/meshcore-web-keygen/)
with the prefix pre-filled
- **Hash Issues tab**: adds a "🔎 Check a prefix →" shortcut in the nav
- **Deep-link support**: `#/analytics?tab=prefix-tool&prefix=A3F1`
pre-fills and runs the checker; `?generate=2` pre-selects and runs the
generator
- **No new API endpoints** — 100% client-side using the existing
`/nodes` list
## Verification
Live on staging:
**https://staging.on8ar.eu/#/analytics?tab=prefix-tool**
## Test plan
- [x] Network Overview card is collapsed by default; expands on click;
stats are correct
- [x] Prefix Checker: 2-char input shows 1-byte results; 4-char shows
2-byte; 6-char shows 3-byte; 64-char pubkey shows all three tiers
- [x] Prefix Checker: invalid hex shows error; odd-length input shows
error
- [x] Prefix Generator: Generate picks an unused prefix; "Try another"
cycles; keygen link opens with prefix pre-filled
- [x] Deep link `?prefix=A3F1` pre-fills checker and scrolls to it
- [x] Deep link `?generate=2` pre-selects 2-byte and runs generator
- [x] Hash Issues tab shows "🔎 Check a prefix →" in the nav
- [x] FAQ link at bottom of generator opens correct MeshCore docs anchor
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
The error-state `<tbody>` row (shown when packet loading fails)
hardcoded `colspan="10"`, while the virtual scroll spacers and the
empty-state row both use `_getColCount()` (which reads from the actual
`<thead>` and falls back to 11). One-line fix: replace the hardcoded
value with `_getColCount()`.
Fixes#406
## Test plan
- [x] Trigger the error state (e.g. kill the backend mid-load) — error
row should span all columns with no gap on the right
- [x] `node test-packets.js` — 72 passed, 0 failed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Replace full \`tbody\` teardown+rebuild on every scroll frame with a
range-diff that only adds/removes the delta rows at the edges of the
visible window
- \`buildFlatRowHtml\` / \`buildGroupRowHtml\` now accept an
\`entryIdx\` parameter and emit \`data-entry-idx\` on every \`<tr>\` so
the diff can target rows precisely (including expanded group children)
- Full rebuild is retained for initial render and large scroll jumps
past the buffer (no range overlap)
- Also loads \`packet-helpers.js\` in the test sandbox, fixing 7
pre-existing test failures for the builder functions; adds 4 new tests
covering \`data-entry-idx\` output
Fixes#414
## Test plan
- [x] Open packets page with 500+ packets, scroll rapidly — DOM
inspector should show incremental \`<tr>\` adds/removes rather than full
\`tbody\` teardown
- [x] Expand a grouped packet, scroll away and back — expanded children
re-render correctly
- [x] Large scroll jump (jump to bottom via scrollbar) — full rebuild
fires, no visual glitch
- [x] \`node test-packets.js\` — 72 passed, 0 failed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
Weak passphrases with no KDF stretching are the #1 practical threat.
Timestamp in plaintext block 0 serves as known-plaintext oracle for
instant key verification from a single captured packet.
Key findings:
- decode_base64() output used directly as AES key, no KDF
- Short passphrases produce <16 byte keys (reduced key space)
- No salt means global precomputed attacks work
- 3-word passphrase crackable in ~2 min on commodity GPU
Reviewed by djb and Dijkstra personas. Corrections applied:
- GPU throughput upgraded from 10^9 to 10^10 AES/sec baseline
- Oracle strengthened: bytes 4+ (type byte, sender name) also predictable
- Dictionary size assumptions made explicit
- Zipf's law caveat added (humans don't choose uniformly)
- base64 short-passphrase key truncation issue documented
Formal analysis of MeshCore's ECB encryption for channel and direct messages.
Reviewed by djb and Dijkstra expert personas through 3 revisions.
Key findings:
- Block 0 has accidental nonce (4-byte timestamp) preventing repetition
- Blocks 1+ are pure deterministic ECB with no nonce — vulnerable to
frequency analysis for repeated message content
- Partial final block attack: zero-padding reduces search space
- HMAC key reuse: AES key is first 16 bytes of HMAC key (same material)
- Recommended fix: switch to AES-128-CTR mode
## Summary
`txToMap()` previously always allocated observation sub-maps for every
packet, even though the `/api/packets` handler immediately stripped them
via `delete(p, "observations")` unless `expand=observations` was
requested. A typical page of 50 packets with ~5 observations each caused
300+ unnecessary map allocations per request.
## Changes
- **`txToMap`**: Add variadic `includeObservations bool` parameter.
Observations are only built when `true` is passed, eliminating
allocations when they'd just be discarded.
- **`PacketQuery`**: Add `ExpandObservations bool` field to thread the
caller's intent through the query pipeline.
- **`routes.go`**: Set `ExpandObservations` based on
`expand=observations` query param. Removed the post-hoc `delete(p,
"observations")` loop — observations are simply never created when not
requested.
- **Single-packet lookups** (`GetPacketByID`, `GetPacketByHash`): Always
pass `true` since detail views need observations.
- **Multi-node/analytics queries**: Default (no flag) = no observations,
matching prior behavior.
## Testing
- Added `TestTxToMapLazyObservations` covering all three cases: no flag,
`false`, and `true`.
- All existing tests pass (`go test ./...`).
## Perf Impact
Eliminates ~250 observation map allocations per /api/packets request (at
default page size of 50 with ~5 observations each). This is a
constant-factor improvement per request — no algorithmic complexity
change.
Fixes#374
Co-authored-by: you <you@example.com>
## Summary
Optimizes `QueryGroupedPackets()` in `store.go` to eliminate two major
inefficiencies on every grouped packet list request:
### Changes
1. **Cache `UniqueObserverCount` on `StoreTx`** — Instead of iterating
all observations to count unique observers on every query
(O(total_observations) per request), we now track unique observers at
ingest time via an `observerSet` map and pre-computed
`UniqueObserverCount` field. This is updated incrementally as
observations arrive.
2. **Defer map construction until after pagination** — Previously,
`map[string]interface{}` was built for ALL 30K+ filtered results before
sorting and paginating. Now the grouped cache stores sorted `[]*StoreTx`
pointers (lightweight), and `groupedTxsToPage()` builds maps only for
the requested page (typically 50 items). This eliminates ~30K map
allocations per cache miss.
3. **Lighter cache footprint** — The grouped cache now stores
`[]*StoreTx` instead of `*PacketResult` with pre-built maps, reducing
memory pressure and GC work.
### Complexity
- Observer counting: O(1) per query (was O(total_observations))
- Map construction: O(page_size) per query (was O(n) where n = all
filtered results)
- Sort remains O(n log n) on cache miss, but the cache (3s TTL) absorbs
repeated requests
### Testing
- `cd cmd/server && go test ./...` — all tests pass
- `cd cmd/ingestor && go build ./...` — builds clean
Fixes#370
---------
Co-authored-by: you <you@example.com>
## Summary
Replace `time.Tick()` with `time.NewTicker()` in the auto-prune
goroutine so it stops cleanly during graceful shutdown.
## Problem
`time.Tick` creates a ticker that can never be garbage collected or
stopped. While the prune goroutine runs for the process lifetime, it
won't stop during graceful shutdown — the goroutine leaks past the
shutdown sequence.
## Fix
- Create a `time.NewTicker` and a done channel
- Use `select` to listen on both the ticker and done channel
- Stop the ticker and close the done channel in the shutdown path (after
`poller.Stop()`)
- Pattern matches the existing `StartEvictionTicker()` approach
## Testing
- `go build ./...` — compiles cleanly
- `go test ./...` — all tests pass
Fixes#377
Co-authored-by: you <you@example.com>
## Summary
Combines the chained `filterTxSlice` calls in `filterPackets()` into a
single pass over the packet slice.
## Problem
When multiple filter parameters are specified (e.g.,
`type=4&route=1&since=...&until=...`), each filter created a new
intermediate `[]*StoreTx` slice. With N filters, this meant N separate
scans and N-1 unnecessary allocations.
## Fix
All filter predicates (type, route, observer, hash, since, until,
region, node) are pre-computed before the loop, then evaluated in a
single `filterTxSlice` call. This eliminates all intermediate
allocations.
**Preserved behavior:**
- Fast-path index lookups for hash-only and observer-only queries remain
unchanged
- Node-only fast-path via `byNode` index preserved
- All existing filter semantics maintained (same comparison operators,
same null checks)
**Complexity:** Single `O(n)` pass regardless of how many filters are
active, vs previous `O(n * k)` where k = number of active filters (each
pass is O(n) but allocates).
## Testing
All existing tests pass (`cd cmd/server && go test ./...`).
Fixes#373
Co-authored-by: you <you@example.com>
## Summary
Sort `snrVals` and `rssiVals` once upfront in `computeAnalyticsRF()` and
read min/max/median directly from the sorted slices, instead of copying
and sorting per stat call.
## Changes
- Sort both slices once before computing stats (2 sorts total instead of
4+ copy+sorts)
- Read `min` from `sorted[0]`, `max` from `sorted[len-1]`, `median` from
`sorted[len/2]`
- Remove the now-unused `sortedF64` and `medianF64` helper closures
## Performance impact
With 100K+ observations, this eliminates multiple O(n log n) copy+sort
operations. Previously each call to `medianF64` did a full copy + sort,
and `minF64`/`maxF64` did O(n) scans on the unsorted array. Now: 2
in-place sorts total, O(1) lookups for min/max/median.
Fixes#366
Co-authored-by: you <you@example.com>
## Summary
`EvictStale()` was doing O(n) linear scans per evicted item to remove
from secondary indexes (`byObserver`, `byPayloadType`, `byNode`).
Evicting 1000 packets from an observer with 50K observations meant 1000
× 50K = 50M comparisons — all under a write lock.
## Fix
Replace per-item removal with batch single-pass filtering:
1. **Collect phase**: Walk evicted packets once, building sets of
evicted tx IDs, observation IDs, and affected index keys
2. **Filter phase**: For each affected index slice, do a single pass
keeping only non-evicted entries
**Before**: O(evicted_count × index_slice_size) per index — quadratic in
practice
**After**: O(evicted_count + index_slice_size) per affected key — linear
## Changes
- `cmd/server/store.go`: Restructured `EvictStale()` eviction loop into
collect + batch-filter pattern
## Testing
- All existing tests pass (`cd cmd/server && go test ./...`)
Fixes#368
Co-authored-by: you <you@example.com>
## Summary
`QueryMultiNodePackets()` was scanning ALL packets with
`strings.Contains` on JSON blobs — O(packets × pubkeys × json_length).
With 30K+ packets and multiple pubkeys, this caused noticeable latency
on `/api/packets?nodes=...`.
## Fix
Replace the full scan with lookups into the existing `byNode` index,
which already maps pubkeys to their transmissions. Merge results with
hash-based deduplication, then apply time filters.
**Before:** O(N × P × J) where N=all packets, P=pubkeys, J=avg JSON
length
**After:** O(M × P) where M=packets per pubkey (typically small), plus
O(R log R) sort for pagination correctness
Results are sorted by `FirstSeen` after merging to maintain the
oldest-first ordering expected by the pagination logic.
Fixes#357
Co-authored-by: you <you@example.com>
## Problem
`GetNodeAnalytics()` in `store.go` scans ALL 30K+ packets doing
`strings.Contains` on every JSON blob when the node has a name, then
filters by time range *after* the full scan. This is `O(packets ×
json_length)` on every `/api/nodes/{pubkey}/analytics` request.
## Fix
Move the `fromISO` time check inside the scan loop so old packets are
skipped **before** the expensive `strings.Contains` matching. For the
non-name path (indexed-only), the time filter is also applied inline,
eliminating the separate `allPkts` intermediate slice.
### Before
1. Scan all packets → collect matches (including old ones) → `allPkts`
2. Filter `allPkts` by time → `packets`
### After
1. Scan packets, skip `tx.FirstSeen <= fromISO` immediately → `packets`
This avoids `strings.Contains` calls on packets outside the requested
time window (typically 7 days out of months of data).
## Complexity
- **Before:** `O(total_packets × avg_json_length)` for name matching
- **After:** `O(recent_packets × avg_json_length)` — only packets within
the time window are string-matched
## Testing
- `cd cmd/server && go test ./...` — all tests pass
Fixes#367
Co-authored-by: you <you@example.com>
## Summary
Consolidates the 4 parallel `/api/analytics/subpaths` calls in the Route
Patterns tab into a single `/api/analytics/subpaths-bulk` endpoint,
eliminating 3 redundant server-side scans of the subpath index on cache
miss.
## Changes
### Backend (`cmd/server/routes.go`, `cmd/server/store.go`)
- New `GET
/api/analytics/subpaths-bulk?groups=2-2:50,3-3:30,4-4:20,5-8:15`
endpoint
- Groups format: `minLen-maxLen:limit` comma-separated
- `GetAnalyticsSubpathsBulk()` iterates `spIndex` once, bucketing
entries into per-group accumulators by hop length
- Hop name resolution is done once per raw hop and shared across groups
- Results are cached per-group for compatibility with existing
single-key cache lookups
- Region-filtered queries fall back to individual
`GetAnalyticsSubpaths()` calls (region filtering requires
per-transmission observer checks)
### Frontend (`public/analytics.js`)
- `renderSubpaths()` now makes 1 API call instead of 4
- Response shape: `{ results: [{ subpaths, totalPaths }, ...] }` —
destructured into the same `[d2, d3, d4, d5]` variables
### Tests (`cmd/server/routes_test.go`)
- `TestAnalyticsSubpathsBulk`: validates 3-group response shape, missing
params error, invalid format error
## Performance
- **Before:** 4 API calls → 4 scans of `spIndex` + 4× hop resolution on
cache miss
- **After:** 1 API call → 1 scan of `spIndex` + 1× hop resolution
(shared cache)
- Cache miss cost reduced by ~75% for this tab
- No change on cache hit (individual group caching still works)
Fixes#398
Co-authored-by: you <you@example.com>
## Summary
Fixes the N+1 API call pattern when changing observation sort mode on
the packets page. Previously, switching sort to Path or Time fired
individual `/api/packets/{hash}` requests for **every**
multi-observation group without cached children — potentially 100+
concurrent requests.
## Changes
### Backend: Batch observations endpoint
- **New endpoint:** `POST /api/packets/observations` accepts `{"hashes":
["h1", "h2", ...]}` and returns all observations keyed by hash in a
single response
- Capped at 200 hashes per request to prevent abuse
- 4 test cases covering empty input, invalid JSON, too-many-hashes, and
valid requests
### Frontend: Use batch endpoint
- `packets.js` sort change handler now collects all hashes needing
observation data and sends a single POST request instead of N individual
GETs
- Same behavior, single round-trip
## Performance
- **Before:** Changing sort with 100 visible groups → 100 concurrent API
requests, browser connection queueing (6 per host), several seconds of
lag
- **After:** Single POST request regardless of group count, response
time proportional to store lookup (sub-millisecond per hash in memory)
Fixes#389
---------
Co-authored-by: you <you@example.com>
## Summary
Coalesce WS-triggered `renderTableRows()` calls using
`requestAnimationFrame` instead of `setTimeout` debouncing.
Fixes#396
## Problem
During high WebSocket throughput, multiple WS batches could each trigger
a `renderTableRows()` call via `setTimeout(..., 200)`. With rapid
batches, this caused the 50K-row table to be fully rebuilt every few
hundred milliseconds, causing UI jank.
## Solution
Replace the `setTimeout`-based debounce with a `requestAnimationFrame`
coalescing pattern:
1. **`scheduleWSRender()`** — sets a dirty flag and schedules a single
rAF callback
2. **Dirty flag** — multiple WS batches within the same frame just set
the flag; only one render fires
3. **Cleanup** — `destroy()` cancels any pending rAF and resets the
dirty flag
This ensures at most **one `renderTableRows()` per animation frame**
(~16ms), regardless of how many WS batches arrive.
## Performance justification
- **Before:** Each WS batch → `setTimeout(renderTableRows, 200)` — N
batches in <200ms = N renders
- **After:** N batches in one frame → 1 render on next rAF (~16ms)
- Worst case goes from O(N) renders per second to O(60) renders per
second (frame-capped)
## Changes
- `public/packets.js`: Add `scheduleWSRender()` with rAF + dirty flag;
replace setTimeout in WS handler; clean up in `destroy()`
- `test-frontend-helpers.js`: Update tests to verify rAF coalescing
pattern instead of setTimeout debounce
## Testing
- All existing tests pass (`npm test` — 0 failures)
- Updated 2 test cases to verify new rAF coalescing behavior
Co-authored-by: you <you@example.com>
## Summary
Compress `public/og-image.png` from **1,159,050 bytes (1.1MB)** to
**234,899 bytes (235KB)** — an **80% reduction**.
## What Changed
- Applied lossy PNG quantization via `pngquant` (quality 45-65, speed 1)
- Image dimensions unchanged: 1200×630px (standard OG image size)
- Visual quality remains suitable for social media previews
## Why
A 1.1MB OpenGraph image is excessive. Typical OG images are 50-200KB.
This reduces deployment size and Git repo bloat without affecting
functionality (browsers don't preload OG images).
## Testing
- Unit tests pass (`npm run test:unit`)
- No code changes — image-only commit
- `index.html` reference unchanged (`<meta property="og:image"
content="/og-image.png">`)
Fixes#397
Co-authored-by: you <you@example.com>
## Summary
Reduces the analytics nodes tab from 3 parallel API calls to 2 by
computing network status (active/degraded/silent counts) client-side
instead of fetching from `/nodes/network-status`.
## What Changed
**`public/analytics.js` — `renderNodesTab()`:**
- Removed the `/nodes/network-status` API call from the `Promise.all`
batch
- Added client-side computation of active/degraded/silent counts using
the shared `getHealthThresholds()` function from `roles.js`
- Uses `nodesResp.total` and `nodesResp.counts` (already returned by
`/nodes` endpoint) for total node count and role breakdown
## Why This Works
The `/nodes` response already includes:
- `total` — count of all matching nodes (server-computed across full DB)
- `counts` — role counts across all nodes (from `GetAllRoleCounts()`)
- Per-node `last_seen`/`last_heard` timestamps
The `getHealthThresholds()` function in `roles.js` provides the same
degraded/silent thresholds used server-side, so client-side status
computation produces equivalent results for the loaded node set.
## Performance
- **Before:** 3 parallel API calls (`/nodes`, `/nodes/bulk-health`,
`/nodes/network-status`)
- **After:** 2 parallel API calls (`/nodes`, `/nodes/bulk-health`)
- Network status computation is O(n) over the 200 loaded nodes —
negligible client-side cost
- The `/nodes/network-status` endpoint scanned ALL nodes in the DB on
every call; this eliminates that server-side work entirely
## Testing
- All frontend helper tests pass (445/445)
- All packet filter tests pass (62/62)
- All aging tests pass (29/29)
- All Go backend tests pass
Fixes#392
---------
Co-authored-by: you <you@example.com>
## Summary
Eliminates visible marker flicker on zoom/resize events in the map page
when displaying 500+ nodes.
## Problem
`renderMarkers()` was called on every `zoomend` and `resize` event,
which did `markerLayer.clearLayers()` followed by a full rebuild of all
markers. With many nodes, this caused a visible flash where all markers
disappeared briefly before being re-added.
## Solution
Instead of rebuilding all markers from scratch on zoom/resize:
1. **Store Leaflet layer references** on marker data objects
(`_leafletMarker`, `_leafletLine`, `_leafletDot`) during the initial
full render
2. **Add `_repositionMarkers()`** — re-runs `deconflictLabels()` at the
new zoom level and updates existing marker positions via
`setLatLng()`/`setLatLngs()` without clearing the layer group
3. **Debounce zoom/resize handlers** (150ms) to coalesce rapid events
during animated zooms
4. **Dynamically manage offset indicators** — adds/removes deconfliction
offset lines and dots as positions change at different zoom levels
Full `renderMarkers()` is still called for filter changes, data updates,
and theme changes — only zoom/resize uses the lightweight repositioning
path.
## Complexity
- `_repositionMarkers()`: O(n) — single pass over stored marker data
- `deconflictLabels()`: O(n × k) where k is max spiral offsets (48) —
unchanged
- No new API calls, no DOM rebuilds
Fixes#393
---------
Co-authored-by: you <you@example.com>
## Summary
`replayRecent()` in `live.js` fetched observation details for 8 packet
groups **sequentially** — each `await fetch()` waited for the previous
to complete before starting the next.
## Change
Replaced the sequential `for` loop with `Promise.all()` to fetch all 8
detail API calls **concurrently**. The mapping from results to live
packets is unchanged.
**Before:** 8 sequential fetches (total time ≈ sum of all request
durations)
**After:** 8 parallel fetches (total time ≈ max of all request
durations)
## Notes
- `replayRecent()` is currently disabled (commented out at line 856), so
this is dormant code — no runtime risk
- No behavioral change: same data mapping, same rendering, same VCR
buffer population
- All existing tests pass
Fixes#394
---------
Co-authored-by: you <you@example.com>
## Summary
Eliminates the N+1 API call storm when toggling off "Group by Hash" in
the packets table.
## Problem
When ungrouped mode was active, `loadPackets()` fired individual
`/api/packets/{hash}` requests for every multi-observation packet. With
200+ multi-obs packets, this created 200+ parallel HTTP requests —
overwhelming both browser connection limits and the server.
## Fix
The server already supports `expand=observations` on the `/api/packets`
endpoint, which returns observations inline. Instead of:
1. Always fetching grouped (`groupByHash=true`)
2. Then N+1 fetching each packet's children individually
We now:
1. Fetch grouped when grouped mode is active (`groupByHash=true`)
2. Fetch with `expand=observations` when ungrouped — **single API call**
3. Flatten observations client-side
**Result: 200+ API calls → 1 API call.**
## Changes
- `public/packets.js`: Replaced N+1 observation fetching loop with
single `expand=observations` query parameter, flatten inline
observations client-side.
## Testing
- All frontend tests pass (packet-filter: 62/62, frontend-helpers:
445/445)
- All Go backend tests pass
Fixes#382
Co-authored-by: you <you@example.com>
## Summary
`handleObservers()` in `routes.go` was calling `GetNodeLocations()`
which fetches ALL nodes from the DB just to match ~10 observer IDs
against node public keys. With 500+ nodes this is wasteful.
## Changes
- **`db.go`**: Added `GetNodeLocationsByKeys(keys []string)` — queries
only the rows matching the given public keys using a parameterized
`WHERE LOWER(public_key) IN (?, ?, ...)` clause.
- **`routes.go`**: `handleObservers` now collects observer IDs and calls
the targeted method instead of the full-table scan.
- **`coverage_test.go`**: Added `TestGetNodeLocationsByKeys` covering
known key, empty keys, and unknown key cases.
## Performance
With ~10 observers and 500+ nodes, the query goes from scanning all 500
rows to fetching only ~10. The original `GetNodeLocations()` is
preserved for any other callers.
Fixes#378
Co-authored-by: you <you@example.com>
## Summary
Skip `updateTimeline()` canvas redraws in `bufferPacket()` when the
browser tab is hidden (`_tabHidden === true`). Instead, batch-update the
timeline once when the tab becomes visible again via the
`visibilitychange` handler.
Fixes#385
## What Changed
**`public/live.js`** — two surgical edits:
1. **`bufferPacket()`**: Removed `updateTimeline()` call from the
`_tabHidden` early-return path. When the tab is backgrounded, packets
are still buffered (for VCR) but no canvas work is done.
2. **`visibilitychange` handler**: Added `updateTimeline()` call when
the tab is restored, so the timeline catches up in a single repaint
instead of N repaints (one per buffered packet).
## Performance Impact
At 5+ packets/sec with a backgrounded tab, this eliminates continuous
canvas redraws (`updateTimeline()` calls `ctx.clearRect` + full canvas
redraw + `updateTimelinePlayhead()`) that are invisible to the user. CPU
usage drops to near-zero for timeline rendering while backgrounded.
## Tests
All existing tests pass:
- `test-packet-filter.js` — 62 passed
- `test-aging.js` — 29 passed
- `test-frontend-helpers.js` — 445 passed
Co-authored-by: you <you@example.com>
## Summary
Replace N+1 per-hop DB queries in `handleResolveHops` with O(1) lookups
against the in-memory prefix map that already exists in the packet
store.
## Problem
Each hop in the `resolve-hops` API triggered a separate `SELECT ... LIKE
?` query against the nodes table. With 10 hops, that's 10 DB round-trips
— unnecessary when `getCachedNodesAndPM()` already maintains an
in-memory prefix map that can resolve hops instantly.
## Changes
- **routes.go**: Replace the per-hop DB query loop with `pm.m[hopLower]`
lookups from the prefix map. Convert `nodeInfo` → `HopCandidate` inline.
Remove unused `rows`/`sql.Scan` code.
- **store.go**: Add `InvalidateNodeCache()` method to force prefix map
rebuild (needed by tests that insert nodes after store initialization).
- **routes_test.go**: Give `TestResolveHopsAmbiguous` a proper store so
hops resolve via the prefix map.
- **resolve_context_test.go**: Call `InvalidateNodeCache()` after
inserting test nodes. Fix confidence assertion — with GPS candidates and
no affinity context, `resolveWithContext` correctly returns
`gps_preference` (previously masked because the prefix map didn't have
the test nodes).
## Complexity
O(1) per hop lookup via hash map vs O(n) DB scan per hop. No hot-path
impact — this endpoint is called on-demand, not in a render loop.
Fixes#369
---------
Co-authored-by: you <you@example.com>
## Summary
Replace full `buildDistanceIndex()` rebuild with incremental
`removeTxFromDistanceIndex`/`addTxToDistanceIndex` for only the
transmissions whose paths actually changed during
`IngestNewObservations`.
## Problem
When any transmission's best path changed during observation ingestion,
the **entire distance index was rebuilt** — iterating all 30K+ packets,
resolving all hops, and computing haversine distances. This
`O(total_packets × avg_hops)` operation ran under a write lock, blocking
all API readers.
A 30-second debounce (`distRebuildInterval`) was added in #557 to
mitigate this, but it only delayed the pain — the full rebuild still
happened, just less frequently.
## Fix
- Added `removeTxFromDistanceIndex(tx)` — filters out all
`distHopRecord` and `distPathRecord` entries for a specific transmission
- Added `addTxToDistanceIndex(tx)` — computes and appends new distance
records for a single transmission
- In `IngestNewObservations`, changed path-change handling to call
remove+add for each affected tx instead of marking dirty and waiting for
a full rebuild
- Removed `distDirty`, `distLast`, and `distRebuildInterval` since
incremental updates are cheap enough to apply immediately
## Complexity
- **Before:** `O(total_packets × avg_hops)` per rebuild (30K+ packets)
- **After:** `O(changed_txs × avg_hops + total_dist_records)` — the
remove is a linear scan of the distance slices, but only for affected
txs; the add is `O(hops)` per changed tx
The remove scan over `distHops`/`distPaths` slices is linear in slice
length, but this is still far cheaper than the full rebuild which also
does JSON parsing, hop resolution, and haversine math for every packet.
## Tests
- Updated `TestDistanceRebuildDebounce` →
`TestDistanceIncrementalUpdate` to verify incremental behavior and check
for duplicate path records
- All existing tests pass (`go test ./...` in both `cmd/server` and
`cmd/ingestor`)
Fixes#365
---------
Co-authored-by: you <you@example.com>
## Summary
Cache `resolveRegionObservers()` results with a 30-second TTL to
eliminate repeated database queries for region→observer ID mappings.
## Problem
`resolveRegionObservers()` queried the database on every call despite
the observers table changing infrequently (~20 rows). It's called from
10+ hot paths including `filterPackets()`, `GetChannels()`, and multiple
analytics compute functions. When analytics caches are cold, parallel
requests each hit the DB independently.
## Solution
- Added a dedicated `regionObsMu` mutex + `regionObsCache` map with 30s
TTL
- Uses a separate mutex (not `s.mu`) to avoid deadlocks — callers
already hold `s.mu.RLock()`
- Cache is lazily populated per-region and fully invalidated after TTL
expires
- Follows the same pattern as `getCachedNodesAndPM()` (30s TTL,
on-demand rebuild)
## Changes
- **`cmd/server/store.go`**: Added `regionObsMu`, `regionObsCache`,
`regionObsCacheTime` fields; rewrote `resolveRegionObservers()` to check
cache first; added `fetchAndCacheRegionObs()` helper
- **`cmd/server/coverage_test.go`**: Added
`TestResolveRegionObserversCaching` — verifies cache population, cache
hits, and nil handling for unknown regions
## Testing
- All existing Go tests pass (`go test ./...`)
- New test verifies caching behavior (population, hits, nil for unknown
regions)
Fixes#362
---------
Co-authored-by: you <you@example.com>
## Summary
`GetStoreStats()` ran 5 sequential DB queries on every call. This
combines them into **2 concurrent queries**:
1. **Node/observer counts** — single query using subqueries: `SELECT
(SELECT COUNT(*) FROM nodes WHERE ...), (SELECT COUNT(*) FROM nodes),
(SELECT COUNT(*) FROM observers)`
2. **Observation counts** — single query using conditional aggregation:
`SUM(CASE WHEN timestamp > ? THEN 1 ELSE 0 END)` scoped to the 24h
window, avoiding a full table scan for the 1h count
Both queries run concurrently via goroutines + `sync.WaitGroup`.
## What changed
- `cmd/server/store.go`: Rewrote `GetStoreStats()` — 5 sequential
`QueryRow` calls → 2 concurrent combined queries
- Error handling now propagates query errors instead of silently
ignoring them
## Performance justification
- **Before:** 5 sequential round-trips to SQLite, with 2 potentially
expensive `COUNT(*)` scans on the `observations` table
- **After:** 2 concurrent round-trips; the observation query scans the
24h window once instead of separately scanning for 1h and 24h
- The 10s cache (`statsTTL`) remains, so this fires at most once per 10s
— but when it does fire, it's ~2.5x fewer round-trips and the
observation scan is halved
## Tests
- `go test ./...` passes for both `cmd/server` and `cmd/ingestor`
Fixes#363
---------
Co-authored-by: you <you@example.com>
## Summary
Extracts a shared `fetchNodeDetail(pubkey)` helper in `nodes.js` that
fetches both `/nodes/{pubkey}` and `/nodes/{pubkey}/health` in parallel.
Both `selectNode()` (side panel) and `loadFullNode()` (full-screen view)
now call this single function instead of duplicating the fetch logic.
## What Changed
- **New:** `fetchNodeDetail(pubkey)` — shared async function that
returns node data with `.healthData` attached
- **Modified:** `loadFullNode()` — uses `fetchNodeDetail()` instead of
inline `Promise.all`
- **Modified:** `selectNode()` — uses `fetchNodeDetail()` instead of
inline `Promise.all`
## Why
The duplicate `api()` calls weren't a major perf issue (TTL caching
mitigates most cases), but the duplicated logic was unnecessary tech
debt. On mobile, `selectNode()` redirects to `loadFullNode()` via hash
change, so the two code paths could fire sequentially with expired
cache.
## Testing
- All frontend helper tests pass (445/445)
- All packet filter tests pass (62/62)
- All aging tests pass (29/29)
- No behavioral change — only code structure improvement
Fixes#391
Co-authored-by: you <you@example.com>
## Summary
`GetSubpathDetail()` iterated ALL packets to find those containing a
specific subpath — `O(packets × hops × subpath_length)`. With 30K+
packets this caused user-visible latency on every subpath detail click.
## Changes
### `cmd/server/store.go`
- Added `spTxIndex map[string][]*StoreTx` alongside existing `spIndex` —
tracks which transmissions contain each subpath key
- Extended `addTxToSubpathIndexFull()` and
`removeTxFromSubpathIndexFull()` to maintain both indexes simultaneously
- Original `addTxToSubpathIndex()`/`removeTxFromSubpathIndex()` wrappers
preserved for backward compatibility
- `buildSubpathIndex()` now populates both `spIndex` and `spTxIndex`
during `Load()`
- All incremental update sites (ingest, path change, eviction) use the
`Full` variants
- `GetSubpathDetail()` rewritten: direct `O(1)` map lookup on
`spTxIndex[key]` instead of scanning all packets
### `cmd/server/coverage_test.go`
- Added `TestSubpathTxIndexPopulated`: verifies `spTxIndex` is
populated, counts match `spIndex`, and `GetSubpathDetail` returns
correct results for both existing and non-existent subpaths
## Complexity
- **Before:** `O(total_packets × avg_hops × subpath_length)` per request
- **After:** `O(matched_txs)` per request (direct map lookup)
## Tests
All tests pass: `cmd/server` (4.6s), `cmd/ingestor` (25.6s)
Fixes#358
---------
Co-authored-by: you <you@example.com>
## Summary
`buildPrefixMap()` was generating map entries for every prefix length
from 2 to `len(pubkey)` (up to 64 chars), creating ~31 entries per node.
With 500 nodes that's ~15K map entries; with 1K+ nodes it balloons to
31K+.
## Changes
**`cmd/server/store.go`:**
- Added `maxPrefixLen = 8` constant — MeshCore path hops use 2–6 char
prefixes, 8 gives headroom
- Capped the prefix generation loop at `maxPrefixLen` instead of
`len(pk)`
- Added full pubkey as a separate map entry when key is longer than
`maxPrefixLen`, ensuring exact-match lookups (used by
`resolveWithContext`) still work
**`cmd/server/coverage_test.go`:**
- Added `TestPrefixMapCap` with subtests for:
- Short prefix resolution still works
- Full pubkey exact-match resolution still works
- Intermediate prefixes beyond the cap correctly return nil
- Short keys (≤8 chars) have all prefix entries
- Map size is bounded
## Impact
- Map entries per node: ~31 → ~8 (one per prefix length 2–8, plus one
full-key entry)
- Total map size for 500 nodes: ~15K entries → ~4K entries (~75%
reduction)
- No behavioral change for path hop resolution (2–6 char prefixes)
- No behavioral change for exact pubkey lookups
## Tests
All existing tests pass:
- `cmd/server`: ✅
- `cmd/ingestor`: ✅Fixes#364
---------
Co-authored-by: you <you@example.com>
## Summary
Index node path lookups in `handleNodePaths()` instead of scanning all
packets on every request.
## Problem
`handleNodePaths()` iterated ALL packets in the store (`O(total_packets
× avg_hops)`) with prefix string matching on every hop. This caused
user-facing latency on every node detail page load with 30K+ packets.
## Fix
Added a `byPathHop` index (`map[string][]*StoreTx`) that maps lowercase
hop prefixes and resolved full pubkeys to their transmissions. The
handler now does direct map lookups instead of a full scan.
### Index lifecycle
- **Built** during `Load()` via `buildPathHopIndex()`
- **Incrementally updated** during `IngestNewFromDB()` (new packets) and
`IngestNewObservations()` (path changes)
- **Cleaned up** during `EvictStale()` (packet removal)
### Query strategy
The handler looks up candidates from the index using:
1. Full pubkey (matches resolved hops from `resolved_path`)
2. 2-char prefix (matches short raw hops)
3. 4-char prefix (matches medium raw hops)
4. Any longer raw hops starting with the 4-char prefix
This reduces complexity from `O(total_packets × avg_hops)` to
`O(matching_txs + unique_hop_keys)`.
## Tests
- `TestNodePathsEndpointUsesIndex` — verifies the endpoint returns
correct results using the index
- `TestPathHopIndexIncrementalUpdate` — verifies add/remove operations
on the index
All existing tests pass.
Fixes#359
Co-authored-by: you <you@example.com>
## Summary
Fixes#566 — The "Inconsistent Hash Sizes" list on the Analytics page
included all node types and had no time window, causing false positives.
## Changes
### 1. Role filter on inconsistent nodes (`cmd/server/store.go`)
Added role filter to the `inconsistentNodes` loop in
`computeHashCollisions()` so only repeaters and room servers are
included. Companions are excluded since they were never affected by the
firmware bug. This matches the existing role filter on collision
bucketing from #441.
```go
// Before:
if cn.HashSizeInconsistent {
// After:
if cn.HashSizeInconsistent && (cn.Role == "repeater" || cn.Role == "room_server") {
```
### 2. 7-day time window on hash size computation
(`cmd/server/store.go`)
Added a 7-day recency cutoff to `computeNodeHashSizeInfo()`. Adverts
older than 7 days are now skipped, preventing legitimate historical
config changes (e.g., testing different byte sizes) from creating
permanent false positives.
### 3. Frontend description text (`public/analytics.js`)
Updated the description to reflect the filtered scope: now says
"Repeaters and room servers" instead of "Nodes", mentions the 7-day
window, and notes that companions are excluded.
## Tests
- `TestInconsistentNodesExcludesCompanions` — verifies companions are
excluded while repeaters and room servers are included
- `TestHashSizeInfoTimeWindow` — verifies adverts older than 7 days are
excluded from hash size computation
- Updated existing hash size tests to use recent timestamps (compatible
with the new time window)
- All existing tests pass: `cmd/server` ✅, `cmd/ingestor` ✅
## Perf justification
The time window filter adds a single string comparison per advert in the
scan loop — O(n) with a tiny constant. No impact on hot paths.
---------
Co-authored-by: you <you@example.com>
## Summary
Replace O(n) map iteration in `MaxTransmissionID()` and
`MaxObservationID()` with O(1) field lookups.
## What Changed
- Added `maxTxID` and `maxObsID` fields to `PacketStore`
- Updated `Load()`, `IngestNewFromDB()`, and `IngestNewObservations()`
to track max IDs incrementally as entries are added
- `MaxTransmissionID()` and `MaxObservationID()` now return the tracked
field directly instead of iterating the entire map
## Performance
Before: O(n) iteration over 30K+ map entries under a read lock
After: O(1) field return
## Tests
- Added `TestMaxTransmissionIDIncremental` verifying the incremental
field matches brute-force iteration over the maps
- All existing tests pass (`cmd/server` and `cmd/ingestor`)
Fixes#356
Co-authored-by: you <you@example.com>
## Summary
Adds a byte-size filter to the map page, allowing users to filter
repeater markers by their hash prefix size (1-byte, 2-byte, or 3-byte).
## What changed
**`public/map.js`** — single file change:
1. **New filter state**: Added `byteSize` to the `filters` object
(default: `'all'`), persisted in `localStorage`
2. **New UI section**: Added a "Byte Size" fieldset with button group
(`All | 1-byte | 2-byte | 3-byte`) in the map controls panel, between
"Node Types" and "Display"
3. **Filter logic**: In `_renderMarkersInner`, when `byteSize !==
'all'`, repeater nodes are filtered by their `hash_size` field.
Non-repeater nodes (companions, rooms, sensors) are unaffected — they
pass through regardless of the byte-size filter setting
4. **Event binding**: Button click handlers update the filter, persist
to localStorage, and re-render markers
## Design decisions
- **Client-side only** — no backend changes needed. The `hash_size`
field is already included in the `/api/nodes` response
- **Repeaters only** — byte size is a repeater configuration concept;
other node roles don't have configurable path prefix sizes
- **Matches existing pattern** — uses the same button-group UI as the
Status filter (All/Active/Stale)
- **`hash_size` defaults to 1** — consistent with how the rest of the
codebase treats missing `hash_size` (`node.hash_size || 1`)
## Performance
No new API calls. Filter is a simple string comparison inside the
existing `nodes.filter()` loop in `_renderMarkersInner` — O(1) per node,
negligible overhead.
Fixes#565
Co-authored-by: you <you@example.com>
## Problem
Closes#563. Addresses the *Packet store estimated memory* item in #559.
`estimatedMemoryMB()` used a hardcoded formula:
```go
return float64(len(s.packets)*5120+s.totalObs*500) / 1048576.0
```
This ignored three data structures that grow continuously with every
ingest cycle:
| Structure | Production size | Heap not counted |
|---|---|---|
| `distHops []distHopRecord` | 1,556,833 records | ~300 MB |
| `distPaths []distPathRecord` | 93,090 records | ~25 MB |
| `spIndex map[string]int` | 4,113,234 entries | ~400 MB |
Result: formula reported ~1.2 GB while actual heap was ~5 GB. With
`maxMemoryMB: 1024`, eviction calculated it only needed to shed ~200 MB,
removed a handful of packets, and stopped. Memory kept growing until the
OOM killer fired.
## Fix
Replace `estimatedMemoryMB()` with `runtime.ReadMemStats` so all data
structures are automatically counted:
```go
func (s *PacketStore) estimatedMemoryMB() float64 {
if s.memoryEstimator != nil {
return s.memoryEstimator()
}
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
return float64(ms.HeapAlloc) / 1048576.0
}
```
Replace the eviction simulation loop (which re-used the same wrong
formula) with a proportional calculation: if heap is N× over budget,
evict enough packets to keep `(1/N) × 0.9` of the current count. The 0.9
factor adds a 10% buffer so the next ingest cycle doesn't immediately
re-trigger. All major data structures (distHops, distPaths, spIndex)
scale with packet count, so removing a fraction of packets frees roughly
the same fraction of total heap.
## Testing
- Updated `TestEvictStale_MemoryBasedEviction` to inject a deterministic
estimator via the new `memoryEstimator` field.
- Added `TestEvictStale_MemoryBasedEviction_UnderestimatedHeap`:
verifies that when actual heap is 5× over limit (the production failure
scenario), eviction correctly removes ~80%+ of packets.
```
=== RUN TestEvictStale_MemoryBasedEviction
[store] Evicted 538 packets (1076 obs)
--- PASS
=== RUN TestEvictStale_MemoryBasedEviction_UnderestimatedHeap
[store] Evicted 820 packets (1640 obs)
--- PASS
```
Full suite: `go test ./...` — ok (10.3s)
## Perf note
`runtime.ReadMemStats` runs once per eviction tick (every 60 s) and once
per `/api/perf/store` call. Cost is negligible.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Fixes critical performance issue in neighbor graph computation that
consumed 65% of CPU (30+ seconds) on a 325K packet dataset.
## Changes
### Fix 1: Cache strings.ToLower results
- Added cachedToLower() helper that caches lowercased strings in a local
map
- Pubkeys repeat across hundreds of thousands of observations
- Pre-computes fromLower once per transaction instead of once per
observation
- **Impact:** Eliminates ~8.4s (25.3% CPU)
### Fix 2: Cache parsed DecodedJSON via StoreTx.ParsedDecoded()
- Added ParsedDecoded() method on StoreTx using sync.Once for
thread-safe lazy caching
- json.Unmarshal on decoded_json now runs at most once per packet
lifetime
- Result reused by extractFromNode, indexByNode, trackAdvertPubkey
- **Impact:** Eliminates ~8.8s (26.3% CPU)
### Fix 3: Extend neighbor graph TTL from 60s to 5 minutes
- The graph depends on traffic patterns, not individual packets
- Reduces rebuild frequency 5x
- **Impact:** ~80% reduction in sustained CPU from graph rebuilds
## Tests
- 7 new tests added, all 26+ existing neighbor graph tests pass
- BenchmarkBuildFromStore: 727us/op, 237KB/op, 6030 allocs/op
Related: #559
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: you <you@example.com>
detectSchema() runs at DB open time before ensureResolvedPathColumn()
adds the column during Load(). On first run (or any run where the column
was just added), hasResolvedPath stayed false, causing Load() to skip
reading resolved_path from SQLite. This forced a full backfill of all
observations on every restart, burning CPU for minutes on large DBs.
Fix: set hasResolvedPath = true after ensureResolvedPathColumn succeeds.
## Summary
Implements server-side hop prefix resolution at ingest time with a
persisted neighbor graph. Hop prefixes in `path_json` are now resolved
to full 64-char pubkeys at ingest and stored as `resolved_path` on each
observation, eliminating the need for client-side resolution via
`HopResolver`.
Fixes#555
## What changed
### New file: `cmd/server/neighbor_persist.go`
SQLite persistence layer for the neighbor graph and resolved paths:
- `neighbor_edges` table creation and management
- Load/build/persist neighbor edges from/to SQLite
- `resolved_path` column migration on observations
- `resolvePathForObs()` — resolves hop prefixes using
`resolveWithContext` with 4-tier priority (affinity → geo → GPS → first
match)
- Cold startup backfill for observations missing `resolved_path`
- Async persistence of edges and resolved paths during ingest
(non-blocking)
### Modified: `cmd/server/store.go`
- `StoreObs` gains `ResolvedPath []*string` field
- `StoreTx` gains `ResolvedPath []*string` (cached from best
observation)
- `Load()` dynamically includes `resolved_path` in SQL query when column
exists
- `IngestNewFromDB()` resolves paths at ingest time and persists
asynchronously
- `pickBestObservation()` propagates `ResolvedPath` to transmission
- `txToMap()` and `enrichObs()` include `resolved_path` in API responses
- All 7 `pm.resolve()` call sites migrated to `pm.resolveWithContext()`
with the persisted graph
- Broadcast maps include `resolved_path` per observation
### Modified: `cmd/server/db.go`
- `DB` struct gains `hasResolvedPath bool` flag
- `detectSchema()` checks for `resolved_path` column existence
- Graceful degradation when column is absent (test DBs, old schemas)
### Modified: `cmd/server/main.go`
- Startup sequence: ensure tables → load/build graph → backfill resolved
paths → re-pick best observations
### Modified: `cmd/server/routes.go`
- `mapSliceToTransmissions()` and `mapSliceToObservations()` propagate
`resolved_path`
- Node paths handler uses `resolveWithContext` with graph
### Modified: `cmd/server/types.go`
- `TransmissionResp` and `ObservationResp` gain `ResolvedPath []*string`
with `omitempty`
### New file: `cmd/server/neighbor_persist_test.go`
16 tests covering:
- Path resolution (unambiguous, empty, unresolvable prefixes)
- Marshal/unmarshal of resolved_path JSON
- SQLite table creation and column migration (idempotent)
- Edge persistence and loading
- Schema detection
- Full Load() with resolved_path
- API response serialization (present when set, omitted when nil)
## Design decisions
1. **Async persistence** — resolved paths and neighbor edges are written
to SQLite in a goroutine to avoid blocking the ingest loop. The
in-memory state is authoritative.
2. **Schema compatibility** — `DB.hasResolvedPath` flag allows the
server to work with databases that don't yet have the `resolved_path`
column. SQL queries dynamically include/exclude the column.
3. **`pm.resolve()` retained** — Not removed as dead code because
existing tests use it directly. All production call sites now use
`resolveWithContext` with the persisted graph.
4. **Edge persistence is conservative** — Only unambiguous edges (single
candidate) are persisted to `neighbor_edges`. Ambiguous prefixes are
handled by the in-memory `NeighborGraph` via Jaccard disambiguation.
5. **`null` = unresolved** — Ambiguous prefixes store `null` in the
resolved_path array. Frontend falls back to prefix display.
## Performance
- `resolveWithContext` per hop: ~1-5μs (map lookups, no DB queries)
- Typical packet has 0-5 hops → <25μs total resolution overhead per
packet
- Edge/path persistence is async → zero impact on ingest latency
- Backfill is one-time on first startup with the new column
## Test results
```
cd cmd/server && go test ./... -count=1 → ok (4.4s)
cd cmd/ingestor && go test ./... -count=1 → ok (25.5s)
```
---------
Co-authored-by: you <you@example.com>
## Summary
Implements **M4 (frontend consumers)** from the [resolved-path
spec](https://github.com/Kpa-clawbot/CoreScope/blob/resolved-path-spec/docs/specs/resolved-path.md)
for #555.
The server (PR #556, M1-M3) now returns `resolved_path` on all
packet/observation API responses and WebSocket broadcasts. This PR
updates all frontend consumers to **prefer `resolved_path`** over
client-side HopResolver, with full fallback for old packets.
## What changed
### `hop-resolver.js`
- Added `resolveFromServer(hops, resolvedPath)` — takes the short hex
prefixes and aligned array of full pubkeys from `resolved_path`, looks
up node names from the existing nodesList. Returns the same `{ [hop]: {
name, pubkey, ... } }` format as `resolve()`.
### `packet-helpers.js`
- Added `getResolvedPath(p)` — cached JSON parser for the new
`resolved_path` field (mirrors `getParsedPath`).
- Updated `clearParsedCache()` to also clear `_parsedResolvedPath`.
### `packets.js`
- **Bulk load** (`loadPackets`): calls `cacheResolvedPaths(packets)`
before the existing `resolveHops` fallback.
- **WebSocket updates**: pre-populates `hopNameCache` from
`resolved_path` on incoming packets before falling back to HopResolver
for any remaining unknown hops.
- **Group expansion** (`pktToggleGroup`): caches resolved paths from
child observations.
- **Packet detail** (`selectPacket`): prefers `resolveFromServer` when
`resolved_path` is available.
- **Show Route button**: uses `resolved_path` pubkeys directly instead
of client-side disambiguation.
- **Observation spreading**: carries `resolved_path` field when
constructing observation packets.
### `live.js`
- `resolveHopPositions` accepts optional `resolvedPath` parameter;
prefers server-resolved pubkeys, falls back to HopResolver for null
entries.
- Normalized WS packet objects now carry `resolved_path`.
### Files NOT changed (no resolution changes needed)
- **`analytics.js`** — only uses `HopResolver.haversineKm` (a utility
function). Topology, subpath, and hop distance data comes pre-resolved
from the server API (handled by M2/M3).
- **`nodes.js`** — gets pre-resolved path data from
`/nodes/:pubkey/paths` API; no client-side hop resolution.
- **`map.js`** — `drawPacketRoute` already handles full 64-char pubkeys
via exact match. The updated `packets.js` now passes full pubkeys from
`resolved_path` to the map.
## Fallback pattern
```javascript
// In hop-resolver.js
function resolveFromServer(hops, resolvedPath) {
// Returns resolved entries for non-null pubkeys
// Skips null entries (unresolved) — caller falls back to HopResolver
}
// In packets.js — bulk load
await cacheResolvedPaths(packets); // server-side first
await resolveHops([...allHops]); // client-side fallback for remaining
```
Old packets without `resolved_path` continue to work exactly as before
via the existing HopResolver. `hop-resolver.js` is NOT removed — it
remains the fallback.
## Tests
- 10 new tests for `resolveFromServer()` and `getResolvedPath()`
- All 445 frontend helper tests pass
- All 62 packet filter tests pass
- All 29 aging tests pass
Closes#555 (M4 milestone)
---------
Co-authored-by: you <you@example.com>
## Problem
On busy meshes (325K+ transmissions, 50 observers), the distance index
rebuild runs on **every ingest poll** (~1s interval), computing
haversine distances for 1M+ hop records. Each rebuild takes 2-3 seconds
but new observations arrive faster than it can finish, creating a CPU
hot loop that starves the HTTP server.
Discovered on the Cascadia Mesh instance where `corescope-server` was
consuming 15 minutes of CPU time in 10 minutes of uptime, the API was
completely unresponsive, and health checks were timing out.
### Server logs showing the hot loop:
```
[store] Built distance index: 1797778 hop records, 207072 path records
[store] Built distance index: 1797806 hop records, 207075 path records
[store] Built distance index: 1797811 hop records, 207075 path records
[store] Built distance index: 1797820 hop records, 207075 path records
```
Every 2 seconds, nonstop.
## Root Cause
`IngestNewObservations` calls `buildDistanceIndex()` synchronously
whenever `pickBestObservation` selects a longer path. With 50 observers
sending observations every second, paths change on nearly every poll
cycle, triggering a full rebuild each time.
## Fix
- Mark distance index dirty on path changes instead of rebuilding inline
- Rebuild at most every **30 seconds** (configurable via `distLast`
timer)
- Set `distLast` after initial `Load()` to prevent immediate re-rebuild
on first ingest
- Distance data is at most 30s stale — acceptable for an analytics view
## Testing
- `go build`, `go vet`, `go test` all pass
- No behavioral change for the initial load or the analytics API
response shape
- Distance data freshness goes from real-time to 30s max staleness
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: you <you@example.com>
## Summary
Fixes#388 — expanded groups were fetched sequentially with O(n)
`packets.find()` lookups.
## Changes
1. **Parallel fetch**: Replaced sequential `for...of + await` loop in
`loadPackets()` with `Promise.all()` so all expanded group children are
fetched concurrently.
2. **O(1) Map lookup**: Replaced 3 instances of `packets.find(p =>
p.hash === hash)` with `hashIndex.get(hash)`:
- `loadPackets()` expanded group restore (~line 553)
- `select-observation` click handler (~line 1015)
- `pktToggleGroup()` (~line 2012)
## Perf justification
- **Before**: N expanded groups → N sequential API calls + N ×
O(packets.length) array scans
- **After**: N parallel API calls + N × O(1) Map lookups
- Typical N is 1-3 (minor severity as noted in issue), but the fix is
trivial and correct
## Tests
All existing tests pass: `test-packet-filter.js` (62), `test-aging.js`
(29), `test-frontend-helpers.js` (433).
Co-authored-by: you <you@example.com>
## Summary
Show the transport badge ("T") in the live packet feed, matching the
packets table (#337).
## Changes
- Add `transportBadge(pkt.route_type)` to all 4 feed rendering paths in
`live.js`:
- Grouped feed items (initial history load)
- `addFeedItemDOM()` (VCR replay)
- Dedup new feed items (live WebSocket updates)
- Node detail panel recent packets list
- Uses existing `transportBadge()` from `app.js` and `.badge-transport`
CSS from `style.css`
## Testing
- 2 new source-level assertions in `test-live.js` verifying
`transportBadge()` calls exist
- All existing tests pass (67 passed in test-live.js, no new failures)
Fixes#338
Co-authored-by: you <you@example.com>
## Summary
`nodeActivity` (an object tracking per-node packet counts for heatmap
intensity) grows without bound — entries are added on every packet flash
but never removed, even when stale nodes are pruned.
## Changes
- **Delete `nodeActivity[key]`** alongside `nodeMarkers[key]` and
`nodeData[key]` when removing stale WS-only nodes in `pruneStaleNodes()`
- **Prune orphaned entries** — after the main prune loop, sweep
`nodeActivity` and delete any key that has no corresponding `nodeData`
entry (catches edge cases where nodes were removed by other code paths)
- Both run every 60s via the existing `pruneStaleNodes` interval timer
## Testing
- Added 2 regression tests in `test-frontend-helpers.js` verifying stale
node cleanup and orphan removal
- All 435 frontend helper tests pass, plus packet-filter (62) and aging
(29)
Fixes#390
---------
Co-authored-by: you <you@example.com>
## Summary
Augments the shared `HopResolver` with neighbor-graph affinity data so
that when multiple nodes match a hop prefix, the resolver prefers
candidates that are known neighbors of the adjacent hop — instead of
relying solely on geo-distance.
Fixes#528
## Changes
### `public/hop-resolver.js`
- Added `affinityMap` — stores bidirectional neighbor adjacency with
scores
- Added `setAffinity(graph)` — ingests `/api/analytics/neighbor-graph`
edge data into O(1) Map lookups
- Added `getAffinity(pubkeyA, pubkeyB)` — returns affinity score between
two nodes (0 if not neighbors)
- Added `pickByAffinity(candidates, adjacentPubkey, anchor, ...)` —
picks best candidate: affinity-neighbor first (highest score), then
geo-distance fallback
- Modified forward and backward passes in `resolve()` to track the
previously-resolved pubkey and use `pickByAffinity` instead of raw
geo-sort
### `public/live.js`
- Added `fetchAffinityData()` — fetches `/api/analytics/neighbor-graph`
once and calls `HopResolver.setAffinity()`
- Added `startAffinityRefresh()` — refreshes affinity data every 60
seconds
- Both are called from `loadNodes()` after HopResolver is initialized
### `test-hop-resolver-affinity.js` (new)
- Affinity prefers neighbor candidate over geo-closest
- Cold start (no affinity data) falls back to geo-closest
- Null/undefined affinity doesn't crash
- Bidirectional score lookup
- Highest affinity score wins among multiple neighbors
- Unambiguous hops unaffected by affinity
## Performance
- API calls: 1 at load + 1 per 60s (no per-packet calls)
- Per-packet resolve: O(1) Map lookups, <0.5ms
- Memory: ~50KB for 2K-node graph
---------
Co-authored-by: you <you@example.com>
Fixes#441
## Summary
Hash collision analysis was including ALL node types, inflating
collision counts with irrelevant data. Per MeshCore firmware analysis,
**only repeaters matter for collision analysis** — they're the only role
that forwards packets and appears in routing `path[]` arrays.
## Root Causes Fixed
1. **`hash_size==0` nodes counted in all buckets** — nodes with unknown
hash size were included via `cn.HashSize == bytes || cn.HashSize == 0`,
polluting every bucket
2. **Non-repeater roles included** — companions, rooms, sensors, and
observers were counted even though their hash collisions never cause
routing ambiguity
## Fix
Changed `computeHashCollisions()` filter from:
```go
// Before: include everything except companions
if cn.HashSize == bytes && cn.Role != "companion" {
```
To:
```go
// After: only include repeaters (per firmware analysis)
if cn.HashSize == bytes && cn.Role == "repeater" {
```
## Why only repeaters?
From [MeshCore firmware
analysis](https://github.com/Kpa-clawbot/CoreScope/issues/441#issuecomment-4185218547):
- Only repeaters override `allowPacketForward()` to return `true`
- Only repeaters append their hash to `path[]` during relay
- Companions, rooms, sensors, observers never forward packets
- Cross-role collisions are benign (companion silently drops, real
repeater still forwards)
## Tests
- `TestHashCollisionsOnlyRepeaters` — verifies companions, rooms,
sensors, and hash_size==0 nodes are all excluded
---------
Co-authored-by: you <you@example.com>
## Summary
VCR replay functions (`vcrReplayFromTs`, `vcrRewind`,
`fetchNextReplayPage`) fetch up to 10K packets and process them all
synchronously on the main thread via `expandToBufferEntries`, causing
multi-second UI freezes — especially on mobile.
## Fix
- Added `expandToBufferEntriesAsync()` — processes packets in chunks of
200, yielding to the event loop via `setTimeout(0)` between chunks
- Updated all three VCR replay callers to use the async variant
- Kept the synchronous `expandToBufferEntries()` for backward
compatibility (tests, small datasets)
- Exposed `_liveExpandToBufferEntriesAsync` on window for test access
## Perf justification
- **Before:** 10K packets × ~2 observations = 20K+ objects created
synchronously, blocking the main thread for 1-3 seconds on mobile
- **After:** Same work split into chunks of 200 packets (~400 entries)
with event loop yields between chunks. Each chunk takes <5ms, keeping
the UI responsive (well under the 16ms frame budget)
- Chunk size of 200 is tunable via `VCR_CHUNK_SIZE`
## Tests
- Added regression test: sync expand correctness at scale (500 packets →
1000 entries)
- Added structural test: verifies `VCR_CHUNK_SIZE` exists and async
function yields via `setTimeout`
- All existing tests pass (`npm test`)
Fixes#395
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#410 — virtual scroll height miscalculation for expanded group
rows.
## Root Cause
When WebSocket messages add children to an already-expanded packet
group, `_rowCounts` becomes stale during the 200ms render debounce
window. Scroll events during this window call `renderVisibleRows()` with
stale row counts, causing wrong total height, spacer heights, and
visible range calculations.
## Changes
**public/packets.js:**
- Added `_rowCountsDirty` flag to track when row counts need
recomputation
- Added `_invalidateRowCounts()` — marks row counts as stale and clears
cumulative cache
- Added `_refreshRowCountsIfDirty()` — lazily recomputes `_rowCounts`
from `_displayPackets`
- Called `_invalidateRowCounts()` when WS handler adds children to
expanded groups (line ~402)
- Called `_refreshRowCountsIfDirty()` at top of `renderVisibleRows()`
before using row counts
- Reset `_rowCountsDirty` in all cleanup paths (destroy, empty display)
**test-packets.js:**
- Added 4 regression tests for `_invalidateRowCounts` /
`_refreshRowCountsIfDirty`
## Complexity
O(n) recomputation of `_rowCounts` when dirty (same as existing
`renderTableRows` path). Only triggers when WS modifies expanded group
children, which is infrequent relative to scroll events.
Co-authored-by: you <you@example.com>
## Summary
Fixes#533 — server cache hit rate always 0%.
## Root Cause
`invalidateCachesFor()` is called at the end of every
`IngestNewFromDB()` and `IngestNewObservations()` cycle (~2-5s). Since
new data arrives continuously, caches are cleared faster than any
analytics request can hit them, resulting in a permanent 0% cache hit
rate. The cache TTL (15s/60s) is irrelevant because entries are evicted
by invalidation long before they expire.
## Fix
Rate-limit cache invalidation with a 10-second cooldown:
- First call after cooldown goes through immediately
- Subsequent calls during cooldown accumulate dirty flags in
`pendingInv`
- Next call after cooldown merges pending + current flags and applies
them
- Eviction bypasses cooldown (data removal requires immediate clearing)
Analytics data may be at most ~10s stale, which is acceptable for a
dashboard.
## Changes
- **`store.go`**: Added `lastInvalidated`, `pendingInv`, `invCooldown`
fields. Refactored `invalidateCachesFor()` to rate-limit non-eviction
invalidation. Extracted `applyCacheInvalidation()` helper.
- **`cache_invalidation_test.go`**: Added 4 new tests:
- `TestInvalidationRateLimited` — verifies caches survive during
cooldown
- `TestInvalidationCooldownAccumulatesFlags` — verifies flag merging
- `TestEvictionBypassesCooldown` — verifies eviction always clears
immediately
- `BenchmarkCacheHitDuringIngestion` — confirms 100% hit rate during
rapid ingestion (was 0%)
## Perf Proof
```
BenchmarkCacheHitDuringIngestion-16 3467889 1018 ns/op 100.0 hit%
```
Before: 0% hit rate under continuous ingestion. After: 100% hit rate
during cooldown periods.
Co-authored-by: you <you@example.com>
## Summary
`GetPerfStoreStats()` and `GetPerfStoreStatsTyped()` iterated **all**
ADVERT packets and called `json.Unmarshal` on each one — under a read
lock — on every `/api/perf` and `/api/health` request. With 5K+ adverts,
each health check triggered thousands of JSON parses.
## Fix
Added a refcounted `advertPubkeys map[string]int` to `PacketStore` that
tracks distinct pubkeys incrementally during `Load()`,
`IngestNewFromDB()`, and eviction. The perf/health handlers now just
read `len(s.advertPubkeys)` — O(1) with zero allocations.
## Benchmark Results (5K adverts, 200 distinct pubkeys)
| Method | ns/op | allocs/op |
|--------|-------|-----------|
| `GetPerfStoreStatsTyped` | **78** | **0** |
| `GetPerfStoreStats` | **2,565** | **9** |
Before this change, both methods performed O(N) JSON unmarshals per
call.
## Tests Added
- `TestAdvertPubkeyTracking` — verifies incremental tracking through
add/evict lifecycle
- `TestAdvertPubkeyPublicKeyField` — covers the `public_key` JSON field
variant
- `TestAdvertPubkeyNonAdvert` — ensures non-ADVERT packets don't affect
count
- `BenchmarkGetPerfStoreStats` — 5K adverts benchmark
- `BenchmarkGetPerfStoreStatsTyped` — 5K adverts benchmark
Fixes#360
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#534 — mobile filter dropdown doesn't expand on packets page.
## Root Cause
CSS specificity battle in the mobile media query. The hide rule uses
`:not()` pseudo-classes which add specificity:
```css
/* Higher specificity due to :not() */
.filter-bar > *:not(.filter-toggle-btn):not(.col-toggle-wrap) { display: none; }
/* Lower specificity — loses even with .filters-expanded */
.filter-bar.filters-expanded > * { display: inline-flex; }
```
The JS toggle correctly adds/removes `.filters-expanded`, but the CSS
expanded rule could never win.
## Fix
Match the `:not()` selectors in the expanded rule so `.filters-expanded`
makes it strictly more specific:
```css
.filter-bar.filters-expanded > *:not(.filter-toggle-btn):not(.col-toggle-wrap) { display: inline-flex; }
```
Added a comment explaining the specificity dependency so future devs
don't repeat this.
## Tests
Added Playwright E2E test: mobile viewport (480×800), navigates to
packets page, clicks filter toggle, verifies filter inputs become
visible.
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#355 — replaces O(n²) observation dedup in `Load()`,
`IngestNewFromDB()`, and `IngestNewObservations()` with an O(1)
map-based lookup.
## Changes
- Added `obsKeys map[string]bool` field to `StoreTx` for O(1) dedup
keyed on `observerID + "|" + pathJSON`
- Replaced all 3 linear-scan dedup sites in `store.go` with map lookups
- Lazy-init `obsKeys` for transmissions created before this change (in
`IngestNewFromDB` and `IngestNewObservations`)
- Added regression test (`TestObsDedupCorrectness`) verifying dedup
correctness
- Added nil-map safety test (`TestObsDedupNilMapSafety`)
- Added benchmark comparing map vs linear scan
## Benchmark Results (ARM64, 16 cores)
| Observations | Map (O(1)) | Linear (O(n)) | Speedup |
|---|---|---|---|
| 10 | 34 ns/op | 41 ns/op | 1.2x |
| 50 | 34 ns/op | 186 ns/op | 5.5x |
| 100 | 34 ns/op | 361 ns/op | 10.6x |
| 500 | 34 ns/op | 4,903 ns/op | **146x** |
Map lookup is constant time regardless of observation count. The linear
scan degrades quadratically — at 500 observations per transmission
(realistic for popular packets seen by many observers), the old code is
146x slower per dedup check.
All existing tests pass.
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#354
Replaces the O(n²) selection sort in `sortedCopy()` with Go's built-in
`sort.Float64s()` (O(n log n)).
## Changes
- **`cmd/server/routes.go`**: Replaced manual nested-loop selection sort
with `sort.Float64s(cp)`
- **`cmd/server/helpers_test.go`**: Added regression test with
1000-element random input + benchmark
## Benchmark Results (ARM64)
```
BenchmarkSortedCopy/n=256 ~16μs/op 1 alloc
BenchmarkSortedCopy/n=1000 ~95μs/op 1 alloc
BenchmarkSortedCopy/n=10000 ~1.3ms/op 1 alloc
```
With the old O(n²) sort, n=10000 would take ~50ms+. The new
implementation scales as O(n log n).
## Testing
- All existing `TestSortedCopy` tests pass (unchanged behavior)
- New `TestSortedCopyLarge` validates correctness on 1000 random
elements
- `go test ./...` passes in `cmd/server`
Co-authored-by: you <you@example.com>
Fixes#537
## Problem
Observer filter in grouped mode only checked `p.observer_id` (the
primary observer), ignoring child observations. Grouped packets seen by
multiple observers would be hidden when filtering for a non-primary
observer.
## Fix
Two filter paths updated to also check `p._children`:
1. **Client-side display filter** (line ~1293): removed the
`!groupByHash` guard and added `_children` check so grouped packets are
included when any child observation matches
2. **WS real-time filter** (line ~360): added `_children` fallback check
The grouped row rendering (line ~1042) already correctly uses
`_observerFilterSet` for child filtering — no changes needed there.
## Tests
Added 5 tests in `test-frontend-helpers.js`:
- Grouped packet with matching child observer is shown
- Grouped packet with no matching observers is hidden
- WS filter passes/rejects grouped packets correctly
- Source code assertions verifying both filter paths check `_children`
Co-authored-by: you <you@example.com>
## Summary
- When `groupByHash=true`, each group only carries its representative
(best-path) `observer_id`. The client-side filter was checking only that
field, silently dropping groups that were seen by the selected observer
but had a different representative.
- `loadPackets` now passes the `observer` param to the server so
`filterPackets`/`buildGroupedWhere` do the correct "any observation
matches" check.
- Client-side observer filter in `renderTableRows` is skipped for
grouped mode (server already filtered correctly).
- Both `db.go` and `store.go` observer filtering extended to support
comma-separated IDs (multi-select UI).
## Test plan
- [ ] Set an observer filter on the Packets screen with grouping enabled
— all groups that have **any** observation from the selected observer(s)
should appear, not just groups where that observer is the representative
- [ ] Multi-select two observers — groups seen by either should appear
- [ ] Toggle to flat (ungrouped) mode — per-observation filter still
works correctly
- [ ] Existing grouped packets tests pass: `cd cmd/server && go test
./...`
Fixes#464🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
Fixes#525
The `checklist()` function in `home.js` treated steps and FAQ/checklist
as mutually exclusive — if `homeCfg.checklist` existed, steps were
skipped entirely. Adding a single FAQ via the customizer made all intro
steps disappear.
Now renders steps first, then FAQ below with a '❓ FAQ' header. Falls
back to Bay Area hardcoded defaults only when neither exists.
---------
Co-authored-by: you <you@example.com>
## Summary
Part of #523 — fixes bugs 5 and 7 (bug 6 was a duplicate of bug 7).
### Bug 5: Show Neighbors button throws `window._mapSelectRefNode is not
a function`
**Root cause:** Map popup HTML used inline `onclick` calling
`window._mapSelectRefNode`, which was deleted on SPA page destroy. If a
popup persisted after navigation, clicks would throw.
**Fix:** Replaced inline `onclick` with event delegation. A
document-level click handler catches all `[data-show-neighbors]` clicks
and calls `selectReferenceNode` directly. The global
`window._mapSelectRefNode` is still exposed for existing Playwright
tests but is no longer relied upon by the UI.
### Bug 7: Blue text on dark blue background (dark mode contrast)
**Root cause:** Neighbor table cells inside `.node-detail-section` /
`.node-full-card` inherited accent/link color instead of using
`var(--text)`, making text unreadable in dark mode.
**Fix:** Added explicit `color: var(--text)` on `.node-detail-section
.data-table td` and `.node-full-card .data-table td`. Only `<a>` tags
within those cells retain `color: var(--accent)`.
### Files changed
- `public/map.js` — event delegation for Show Neighbors
- `public/style.css` — contrast fix for neighbor table cells
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#525 — Customizer v2 home section shows empty fields and adding
FAQ kills steps.
## Root Cause
Server returned `home: null` from `/api/config/theme` when no home
config existed in config.json or theme.json. The customizer had no
built-in defaults, so all home fields appeared empty. When a user added
a single override (e.g. FAQ), `computeEffective` started from `home:
null`, created `home: {}`, and only applied the user's override — wiping
steps and everything else.
## Fix
### Server-side (primary)
In `handleConfigTheme()`, replaced the conditional `home` assignment
with `mergeMap` using built-in defaults matching what `home.js`
hardcodes:
- `heroTitle`: "CoreScope"
- `heroSubtitle`: "Real-time MeshCore LoRa mesh network analyzer"
- `steps`: 4 default getting-started steps
- `footerLinks`: Packets + Network Map links
Config/theme overrides merge on top, so customization still works.
### Client-side (defense-in-depth)
Added `DEFAULT_HOME` constant in `customize-v2.js`. `computeEffective()`
now falls back to these defaults when server returns `home: null`,
ensuring the customizer works even without server defaults.
## Tests
- **Go**: `TestConfigThemeHomeDefaults` — verifies `/api/config/theme`
returns non-null home with heroTitle, steps, footerLinks when no config
is set
- **JS**: Two new tests in `test-frontend-helpers.js` — verifies
`computeEffective` provides defaults when home is null, and that user
overrides merge correctly with defaults
Co-authored-by: you <you@example.com>
## Summary
Fixes the neighbor affinity graph returning empty results despite
abundant ADVERT data in the store.
**Root cause:** `extractFromNode()` in `neighbor_graph.go` only checked
for `"from_node"` and `"from"` fields in the decoded JSON, but real
ADVERT packets store the originator public key as `"pubKey"`. This meant
`fromNode` was always empty, so:
- Zero-hop edges (originator↔observer) were never created
- Originator↔path[0] edges were never created
- Only observer↔path[last] edges could be created (and only for
non-empty paths)
**Fix:** Check `"pubKey"` first in `extractFromNode()`, then fall
through to `"from_node"` and `"from"` for other packet types.
## Bugs Fixed
| Bug | Issue | Fix |
|-----|-------|-----|
| Empty graph results | #522 | `extractFromNode()` now reads `pubKey`
field from ADVERTs |
| 3-4s response time | #523 comment | Graph was rebuilding correctly
with 60s TTL cache — the slow response was due to iterating all packets
finding zero matches. With edges now being found, the cache works as
designed. |
| Incomplete visualization | #523 comment | Downstream of bug 1+2 —
fixed by fixing the builder |
| Accessibility | #523 comment | Added text-based neighbor list, dynamic
aria-label, keyboard focus CSS, dashed lines for ambiguous edges,
confidence symbols |
## Changes
- **`cmd/server/neighbor_graph.go`** — Fixed `extractFromNode()` to
check `pubKey` field (real ADVERT format)
- **`cmd/server/neighbor_graph_test.go`** — Added 2 new tests:
`TestBuildNeighborGraph_AdvertPubKeyField` (real ADVERT format) and
`TestBuildNeighborGraph_OneByteHashPrefixes` (1-byte prefix collision
scenario)
- **`public/analytics.js`** — Added accessible text-based neighbor list,
dynamic aria-label, dashed line pattern for ambiguous edges
- **`public/style.css`** — Added `:focus-visible` keyboard focus
indicator for canvas
## Testing
All Go tests pass (`go test ./... -count=1`). New tests verify the fix
prevents regression.
Fixes#523, Fixes#522
---------
Co-authored-by: you <you@example.com>
Fixes#518, Fixes#514, Fixes#515, Fixes#516
## Summary
Fixes all customizer v2 bugs from the consolidated tracker (#518). Both
server and client changes.
## Server Changes (`routes.go`)
- **typeColors defaults** — added all 10 type color defaults matching
`roles.js` `TYPE_COLORS`. Previously returned `{}`, causing all type
colors to render as black.
- **themeDark defaults** — added 22 dark mode color defaults matching
the Default preset. Previously returned `{}`, causing dark mode to have
no server-side defaults.
## Client Changes (`customize-v2.js`)
- [x] **P0: Phantom override cleanup on init** — new
`_cleanPhantomOverrides()` runs on startup, scanning
`cs-theme-overrides` and removing any values that match server defaults
(arrays via `JSON.stringify`, scalars via `===`).
- [x] **P1: `setOverride` auto-prunes matching defaults** — after
debounced write, iterates the delta and removes any key whose value
matches the server default. Prevents phantom overrides from
accumulating.
- [x] **P1: `_countOverrides` counts only real diffs** — now iterates
keys and calls `_isOverridden()` instead of blindly counting
`Object.keys().length`. Badge count reflects actual overrides only.
- [x] **P1: `_isOverridden` handles arrays/objects** — uses
`JSON.stringify` comparison for non-scalar values (home.steps,
home.checklist, etc.).
- [x] **P1: Type color fallback** — `_renderNodes()` falls back to
`window.TYPE_COLORS` when effective typeColors are empty, preventing
black color swatches.
- [x] **P1: Dark/light toggle re-renders panel** — MutationObserver on
`data-theme` now calls `_refreshPanel()` when panel is open, so
switching modes updates the Theme tab immediately.
## Tests
6 new unit tests added to `test-customizer-v2.js`:
- Phantom scalar overrides cleaned on init
- Phantom array overrides cleaned on init
- Real overrides preserved after cleanup
- `isOverridden` handles matching arrays (returns false)
- `isOverridden` handles differing arrays (returns true)
- `setOverride` prunes value matching server default
All 48 tests pass. Go tests pass.
---------
Co-authored-by: you <you@example.com>
Removes a stale `<<<<<<< HEAD` conflict marker that was accidentally
left in during the PR #510 rebase. This breaks Playwright E2E tests in
CI.
One-line fix — line 1311 deletion.
Co-authored-by: you <you@example.com>
## Summary
Adds a **Neighbor Graph** tab to the Analytics page — an interactive
force-directed graph visualization of the mesh network's neighbor
affinity data.
Part of #482 (Milestone 7 — Analytics Graph Visualization)
## What's New
### Neighbor Graph Tab
- New "Neighbor Graph" tab in the analytics tab bar
- Force-directed graph layout using HTML5 Canvas (vanilla JS, no
external libs)
- Nodes rendered as circles, colored by role using existing
`ROLE_COLORS`
- Edges as lines with thickness proportional to affinity score
- Ambiguous edges highlighted in yellow
### Interactions
- **Click node** → navigates to node detail page (`#/nodes/{pubkey}`)
- **Hover node** → tooltip showing name, role, neighbor count
- **Drag nodes** → rearrange layout interactively
- **Mouse wheel** → zoom in/out (towards cursor position)
- **Drag background** → pan the view
### Filters
- **Role checkboxes** — toggle repeater, companion, room, sensor
visibility
- **Minimum score slider** — filter out weak edges (0.00–1.00)
- **Confidence filter** — show all / high confidence only / hide
ambiguous
### Stats Summary
Displays above the graph: total nodes, total edges, average score,
resolved %, ambiguous count
### Data Source
Uses `GET /api/analytics/neighbor-graph` endpoint from M2, with region
filtering via the shared RegionFilter component.
## Performance
- Canvas-based rendering (not SVG) for performance with large graphs
- Force simulation uses `requestAnimationFrame` with cooling/dampening —
stops iterating when layout stabilizes
- O(n²) repulsion is acceptable for typical mesh sizes (~500 nodes); for
larger meshes, a Barnes-Hut approximation could be added later
- Animation frame is properly cleaned up on page destroy
## Tests
- Updated tab count assertion (≥10 tabs)
- New Playwright test: tab loads, canvas renders, stats shown (≥3 stat
cards)
- New Playwright test: filter changes update stats
## Files Changed
- `public/analytics.js` — new tab + full graph visualization
implementation
- `test-e2e-playwright.js` — 2 new tests + updated assertion
---------
Co-authored-by: you <you@example.com>
## Summary
Milestone 4 of #482: adds affinity-aware hop resolution to improve
disambiguation accuracy across all hop resolution in the app.
### What changed
**Backend — `prefixMap.resolveWithContext()` (store.go)**
New method that applies a 4-tier disambiguation priority when multiple
nodes match a hop prefix:
| Priority | Strategy | When it wins |
|----------|----------|-------------|
| 1 | **Affinity graph score** | Neighbor graph has data, score ratio ≥
3× runner-up |
| 2 | **Geographic proximity** | Context nodes have GPS, pick closest
candidate |
| 3 | **GPS preference** | At least one candidate has coordinates |
| 4 | **First match** | No signal — current naive fallback |
The existing `resolve()` method is unchanged for backward compatibility.
New callers that have context (originator, observer, adjacent hops) can
use `resolveWithContext()` for better results.
**API — `handleResolveHops` (routes.go)**
Enhanced `/api/resolve-hops` endpoint:
- New query params: `from_node`, `observer` — provide context for
affinity scoring
- New response fields on `HopCandidate`: `affinityScore` (float,
0.0–1.0)
- New response fields on `HopResolution`: `bestCandidate` (pubkey when
confident), `confidence` (one of `unique_prefix`, `neighbor_affinity`,
`ambiguous`)
- Backward compatible: without context params, behavior is identical to
before (just adds `confidence` field)
**Types (types.go)**
- `HopCandidate.AffinityScore *float64`
- `HopResolution.BestCandidate *string`
- `HopResolution.Confidence string`
### Tests
- 7 unit tests for `resolveWithContext` covering all 4 priority tiers +
edge cases
- 2 unit tests for `geoDistApprox`
- 4 API tests for enhanced `/api/resolve-hops` response shape
- All existing tests pass (no regressions)
### Impact
This improves ALL hop resolution across the app — analytics, route
display, subpath analysis, and any future feature that resolves hop
prefixes. The affinity graph (from M1/M2) now feeds directly into
disambiguation decisions.
Part of #482
---------
Co-authored-by: you <you@example.com>
## Summary
Replace broken client-side path walking in `selectReferenceNode()` with
server-side `/api/nodes/{pubkey}/neighbors` API call, fixing #484 where
Show Neighbors returned zero results due to hash collision
disambiguation failures.
**Fixes #484** | Part of #482
## What changed
### `public/map.js` — `selectReferenceNode()` function
**Before:** Client-side path walking — fetched
`/api/nodes/{pubkey}/paths`, walked each path to find hops adjacent to
the selected node by comparing full pubkeys. This fails on hash
collisions because path hops only contain short prefixes (1-2 bytes),
and the hop resolver can pick the wrong collision candidate.
**After:** Server-side affinity resolution — fetches
`/api/nodes/{pubkey}/neighbors?min_count=3` which uses the neighbor
affinity graph (built in M1/M2) to return disambiguated neighbors. For
ambiguous edges, all candidates are included in the neighbor set (better
to show extra markers than miss real neighbors).
**Fallback:** When the affinity API returns zero neighbors (cold start,
insufficient data), the function falls back to the original path-walking
approach. This ensures the feature works even before the affinity graph
has accumulated enough observations.
## Tests
4 new Playwright E2E tests (in both `test-show-neighbors.js` and
`test-e2e-playwright.js`):
1. **Happy path** — Verifies the `/neighbors` API is called and the
reference node UI activates
2. **Hash collision disambiguation** — Two nodes sharing prefix "C0" get
different neighbor sets via the affinity API (THE critical test for
#484)
3. **Fallback to path walking** — Empty affinity response triggers
fallback to `/paths` API
4. **Ambiguous candidates** — Ambiguous edge candidates are included in
the neighbor set
All tests use Playwright route interception to mock API responses,
testing the frontend logic independently of server state.
## Spec reference
See [neighbor-affinity-graph.md](docs/specs/neighbor-affinity-graph.md),
sections:
- "Replacing Show Neighbors on the map" (lines ~461-504)
- "Milestone 3: Show Neighbors Fix (#484)" (lines ~1136-1152)
- Test specs a & b (lines ~754-800)
---------
Co-authored-by: you <you@example.com>
## Summary
Implements the customizer v2 per the [approved
spec](docs/specs/customizer-rework.md), replacing the v1 customizer's
scattered state management with a clean event-driven architecture.
Resolves#502.
## What Changed
### New: `public/customize-v2.js`
Complete rewrite of the customizer as a self-contained IIFE with:
- **Single localStorage key** (`cs-theme-overrides`) replacing 7
scattered keys
- **Three state layers:** server defaults (immutable) → user overrides
(delta) → effective config (computed)
- **Full data flow pipeline:** `write → read-back → merge → atomic
SITE_CONFIG assign → apply CSS → dispatch theme-changed`
- **Color picker optimistic CSS** (Decision #12): `input` events update
CSS directly for responsiveness; `change` events trigger the full
pipeline
- **Override indicator dots** (●) on each field — click to reset
individual values
- **Section-level override count badges** on tabs
- **Browser-local banner** in panel header: "These settings are saved in
your browser only"
- **Auto-save status indicator** in footer: "All changes saved" /
"Saving..." / "⚠️ Storage full"
- **Export/Import** with full shape validation (`validateShape()`)
- **Presets** flow through the standard pipeline
(`writeOverrides(presetData) → pipeline`)
- **One-time migration** from 7 legacy localStorage keys (exact field
mapping per spec)
- **Validation** on all writes: color format, opacity range, timestamp
enum values
- **QuotaExceededError handling** with visible user warning
### Modified: `public/app.js`
Replaced ~80 lines of inline theme application code with a 15-line
`_customizerV2.init(cfg)` call. The customizer v2 handles all merging,
CSS application, and global state updates.
### Modified: `public/index.html`
Swapped `customize.js` → `customize-v2.js` script tag.
### Added: `docs/specs/customizer-rework.md`
The full approved spec, included in the repo for reference.
## Migration
On first page load:
1. Checks if `cs-theme-overrides` already exists → skip if yes
2. Reads all 7 legacy keys (`meshcore-user-theme`,
`meshcore-timestamp-*`, `meshcore-heatmap-opacity`,
`meshcore-live-heatmap-opacity`)
3. Maps them to the new delta format per the spec's field-by-field
mapping
4. Writes to `cs-theme-overrides`, removes all legacy keys
5. Continues with normal init
Users with existing customizations will see them preserved
automatically.
## Dark/Light Mode
- `theme` section stores light mode overrides, `themeDark` stores dark
mode overrides
- `meshcore-theme` localStorage key remains **separate** (view
preference, not customization)
- Switching modes re-runs the full pipeline with the correct section
## Testing
- All existing tests pass (`test-packet-filter.js`, `test-aging.js`,
`test-frontend-helpers.js`)
- Old `customize.js` is NOT modified — left in place for reference but
no longer loaded
## Not in Scope (per spec)
- Undo/redo stack
- Cross-tab synchronization
- Server-side admin import endpoint
- Map config / geo-filter overrides
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#483 — navigating away from the live page while matrix/hop
animations are running throws `TypeError: Cannot read properties of null
(reading 'addLayer')`.
## Root Cause
`destroy()` sets `animLayer = null` and `pathsLayer = null`, but
in-flight `requestAnimationFrame` callbacks continue executing and
attempt to call `.addTo(animLayer)` or `.removeLayer()` on the now-null
references.
The entry guards at the top of `drawMatrixLine()` and
`drawAnimatedLine()` only protect the initial call — not the rAF
continuation loops inside `tick()`, `fadeOut()`, `animateLine()`, and
`animateFade()`.
## Fix
Added null-guards (`if (!animLayer || !pathsLayer) return`) at the top
of all four rAF callback functions in `live.js`:
1. **`tick()`** (line ~2203) — matrix animation main loop
2. **`fadeOut()`** (line ~2253) — matrix animation fade-out
3. **`animateLine()`** (line ~2302) — standard line animation main loop
4. **`animateFade()`** (line ~2337) — standard line fade-out
This pattern is already used elsewhere in the file (e.g., line 1873,
1886) for the same purpose.
## Testing
- All unit tests pass (`npm test` — 0 failures)
- Go server tests pass (`cmd/server` + `cmd/ingestor`)
- Change is defensive only (early return on null) — no behavioral change
when layers exist
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#504 — Expanding a packet in the packets UI showed the same path
on every observation instead of each observation's unique path.
## Root Cause
PR #400 (fixing #387) added caching of `JSON.parse` results as
`_parsedPath` and `_parsedDecoded` properties on packet objects. When
observation packets are created via object spread (`{...parentPacket,
...obs}`), these cache properties are copied from the parent. Subsequent
calls to `getParsedPath(obsPacket)` hit the stale cache and return the
parent's path, ignoring the observation's own `path_json`.
## Fix
After every object spread that creates an observation packet from a
parent packet, delete the cache properties so they get re-parsed from
the observation's own data:
```js
delete obsPacket._parsedPath;
delete obsPacket._parsedDecoded;
```
Applied to all 5 spread sites in `public/packets.js`:
- Line 271: detail pane observation selection
- Line 504: flat view observation expansion
- Line 840: grouped view observation expansion
- Line 1012: child observation selection in grouped view
- Line 1982: WebSocket live update observation expansion
## Tests
Added 2 new tests in `test-frontend-helpers.js`:
1. Verifies observation packets get their own path after cache
invalidation (not the parent's)
2. Verifies observation path differs from parent path after cache
invalidation
All 431 frontend helper tests pass. All 62 packet filter tests pass.
---------
Co-authored-by: you <you@example.com>
## Problem
As described in #387, `JSON.parse()` is called repeatedly on the same
packet data across render cycles. With 30K packets, each render cycle
parses 60K+ JSON strings unnecessarily.
## Analysis
The server sends `decoded_json` and `path_json` as JSON strings. The
frontend parses them on-demand in multiple locations:
- `renderTableRows()` — for every row, every render
- WebSocket handling — when processing filtered packets
- `loadPackets()` — during packet loading
- Detail view rendering — when showing packet details
This creates O(n×m) parsing overhead where n = packet count and m =
render cycles.
## Solution
Add cached parse helpers that store parsed results on the packet object:
```javascript
function getParsedPath(p) {
if (p._parsedPath === undefined) {
try { p._parsedPath = JSON.parse(p.path_json || '[]'); } catch { p._parsedPath = []; }
}
return p._parsedPath;
}
```
Same pattern for `getParsedDecoded()`.
## Changes
- `public/packets.js`: Add helpers + replace 15+ JSON.parse calls
- `public/live.js`: Add helpers + replace 5 JSON.parse calls
## Benchmarks
Before: 60K+ JSON.parse calls per render cycle (30K packets)
After: ~30K parse calls (one per packet, cached thereafter)
Memory impact: Negligible (stores parsed objects that were already
created temporarily)
## Notes
- Cache uses `undefined` check to distinguish "not cached" from "cached
empty result"
- Property names `_parsedPath` and `_parsedDecoded` prefixed to avoid
collision with server fields
- No breaking changes to existing code paths
Fixes#387
---------
Co-authored-by: P. Clawmogorov <262173731+Alm0stSurely@users.noreply.github.com>
Co-authored-by: you <you@example.com>
## Summary
- `db.GetNodes` accepted a `region` param from the HTTP handler but
never used it — every region-filter selection was silently ignored and
all nodes were always returned
- Added a subquery filtering `nodes.public_key` against ADVERT
transmissions (payload_type=4) observed by observers with matching IATA
codes
- Handles both v2 (`observer_id TEXT`) and v3 (`observer_idx INT`)
schemas
## Test plan
- [x] 4 new subtests added to `TestGetNodesFiltering`: SJC (1 node), SFO
(1 node), SJC,SFO multi (1 node deduped), AMS unknown (0 nodes)
- [x] All existing Go tests still pass
- [x] Deploy to staging, open `/nodes`, select a region in the filter
bar — only nodes observed by observers in that region should appear
Closes#496🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- `indexByNode` was calling `json.Unmarshal` for every packet during
`Load()` and `IngestNewFromDB()`, even channel messages and other
payloads that can never contain node pubkey fields
- All three target fields (`"pubKey"`, `"destPubKey"`, `"srcPubKey"`)
share the common substring `"ubKey"` — added a `strings.Contains`
pre-check that skips the JSON parse entirely for packets that don't
match
- At 30K+ packets on startup, this eliminates the majority of
`json.Unmarshal` calls in `indexByNode` (channel messages, status
packets, etc. all bypass it)
## Test plan
- [x] 5 new subtests in `TestIndexByNodePreCheck`: ADVERT with pubKey
indexed, destPubKey indexed, channel message skipped, empty JSON
skipped, duplicate hash deduped
- [x] All existing Go tests pass
- [x] Deploy to staging and verify node-filtered packet queries still
work correctly
Closes#376🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- `bufferPacket()` was overwriting `_ts` with `Date.now()` (receive
time) for every live WS packet
- Packets arriving in the same batch all got identical timestamps,
making the message history show the same "Xs ago" for every entry (e.g.,
all show "5s ago")
- Fix: use `pkt.timestamp || pkt.created_at` (mirroring
`dbPacketToLive`) so each packet reflects its actual origination time,
falling back to `Date.now()` only when the packet has no timestamp
## Root cause
```js
// before
pkt._ts = Date.now();
// after
pkt._ts = new Date(pkt.timestamp || pkt.created_at || Date.now()).getTime();
```
The WS broadcast includes `timestamp` (= `tx.FirstSeen`) in the packet
map (store.go:1182), so the field is always present for real packets.
## Test plan
- [x] Open Live page, observe packets arriving — each should show its
own relative time, not all the same value
- [x] `node test-frontend-helpers.js` passes (235 tests, 0 failures)
Closes#475🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- `BuildBreakdown` was never ported from the deleted Node.js
`decoder.js` to Go — the server has returned `breakdown: {}` since the
Go migration (commit `742ed865`), so `createColoredHexDump()` and
`buildHexLegend()` in the frontend always received an empty `ranges`
array and rendered everything as monochrome
- Implemented `BuildBreakdown()` in `decoder.go` — computes labeled byte
ranges matching the frontend's `LABEL_CLASS` map: `Header`, `Transport
Codes`, `Path Length`, `Path`, `Payload`; ADVERT packets get sub-ranges:
`PubKey`, `Timestamp`, `Signature`, `Flags`, `Latitude`, `Longitude`,
`Name`
- Wired into `handlePacketDetail` (was `struct{}{}`)
- Also adds per-section color classes to the field breakdown table
(`section-header`, `section-transport`, `section-path`,
`section-payload`) so the table rows get matching background tints
## Test plan
- [x] Open any packet detail pane — hex dump should show color-coded
sections (red header, orange path length, blue transport codes, green
path hops, yellow/colored payload)
- [x] Legend below action buttons should appear with color swatches
- [x] ADVERT packets: PubKey/Timestamp/Signature/Flags each get their
own distinct color
- [x] Field breakdown table section header rows should be tinted per
section
- [x] 8 new Go tests: all pass
Closes#329🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Addresses review feedback from PR #487 (nodes.js coverage).
### Changes
1. **Replace fragile `exportInternals` regex source patching with stable
test hooks** — `getStatusInfo` and `getStatusTooltip` are now exposed
via `window._nodesGetStatusInfo` and `window._nodesGetStatusTooltip`,
matching the existing pattern used by all other test-accessible
functions. The brittle regex `.replace()` approach that modified source
code at runtime has been removed entirely.
2. **Strengthen weak null assertion** — The `renderNodeTimestampHtml
handles null` test previously asserted `html.includes('—') ||
html.length > 0`, which is a near-tautology (any non-empty string
passes). Now strictly asserts `html.includes('—')`.
### Files changed
- `public/nodes.js` — 2 new test hook lines
- `test-frontend-helpers.js` — removed 21-line `exportInternals` branch,
updated tests to use hooks
### Testing
- All 309 frontend helper tests pass
- All 62 packet filter tests pass
- All 29 aging tests pass
Closes review items from #487.
Co-authored-by: you <you@example.com>
## Summary
Add 67 new unit tests for `nodes.js`, raising frontend helper test count
from 233 to 300.
Part of #344 — nodes.js coverage.
## What's Tested
### Sort System (`toggleSort`, `sortNodes`, `sortArrow`)
- Direction toggling on same column (asc↔desc)
- Default sort directions per column type (name→asc, last_seen→desc,
advert_count→desc)
- localStorage persistence of sort state
- All 5 sort columns: `name`, `public_key`, `role`, `last_seen`,
`advert_count`
- Both ascending and descending for each column
- Case-insensitive name sorting
- Unnamed nodes sort last
- Timestamp fallback chain: `last_heard` → `last_seen` → 0
- Missing timestamp handling
- Empty array edge case
- Unknown column graceful handling
- `sortArrow` rendering for active (▲/▼) and inactive columns
### Status Calculation (`getStatusInfo`, `getStatusTooltip`)
- `_lastHeard` takes priority over `last_heard`
- `last_seen` used as fallback when `last_heard` missing
- No-timestamp nodes return stale with `lastHeardMs: 0`
- Infrastructure threshold (72h) for rooms
- Standard threshold (24h) for sensors and companions
- Explanation text varies by role and status
- Unknown role defaults to gray color `#6b7280`
- All role/status tooltip combinations
### Timestamp Rendering (`renderNodeTimestampHtml`,
`renderNodeTimestampText`)
- HTML output includes tooltip and `timestamp-text` class
- Future timestamps show ⚠️ warning icon
- Null input produces dash
- Text output is plain (no HTML tags)
### Favorites Sync (`syncClaimedToFavorites`)
- Claimed pubkeys added to favorites
- No-op when all already synced
- Empty my-nodes handled
- Missing localStorage keys don't crash
## Implementation
- Added test hooks on `window` for closure-scoped functions
(non-invasive, follows existing pattern)
- Tests use `vm.createContext` to load real `nodes.js` code — no copies
- No new dependencies
## Test Results
```
Frontend helpers: 300 passed, 0 failed
```
---------
Co-authored-by: you <you@example.com>
## Problem
On a long-running session the packets page consumed 8 GB of browser
memory and 20%+ CPU on an 8-core machine. Root causes:
1. **Unbounded `packets` array growth via WebSocket** —
`packets.unshift()` was called for every new unique hash, but nothing
ever trimmed the array. After hours of live traffic the array grew well
past the initial 50 k load limit.
2. **Unbounded `pauseBuffer`** — all WS messages queued while paused, no
cap.
3. **Unbounded `_children` growth** — expanded groups received a
`unshift(p)` on every matching WS message with no size limit.
4. **O(n) `observers.find()` inside the O(n) render loop** — with 50 k
rows, each render triggered up to 50 k linear scans through the
observers list.
5. **Full DOM rebuild on every WS message** — `renderTableRows()` was
called synchronously on every WebSocket batch, reconstructing the entire
table on each incoming packet.
## Changes
- `packets[]` is now trimmed to `PACKET_LIMIT` after each WS batch;
evicted entries are also removed from `hashIndex` to prevent stale
references.
- `pauseBuffer` capped at 2 000 entries (oldest dropped).
- `_children` capped at 200 entries on WS prepend.
- `renderTableRows()` on the WS path is debounced to 200 ms, batching
rapid updates into a single redraw.
- `observersById = new Map()` pre-built from the observers array; all
`observers.find()` calls in the render loop and WS filter replaced with
O(1) `Map.get()`.
## Test plan
- [x] Load the packets page and leave it running for several minutes
with live WebSocket traffic — memory in DevTools should remain stable
rather than growing continuously
- [x] Pause live updates, wait for several messages, then resume —
buffer replays correctly and display updates
- [x] Expand a packet group and leave it open during live traffic —
children update but don't grow past 200
- [x] Region filter still works correctly (relies on the observer Map
lookup)
- [x] Observer name / IATA badge renders correctly in grouped and flat
mode
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Summary
Fixes#485 — the app version was derived from `package.json` via
Node.js, which is a meaningless artifact for this Go project. This
caused version mismatches (e.g., v3.3.0 release showing "3.2.0") when
someone forgot to bump `package.json`.
## Changes
### `manage.sh`
- **Line 43**: Replace `node -p "require('./package.json').version"`
with `git describe --tags --match "v*"` — version is now derived
automatically from git tags
- **Line 515**: Add `--force` to `git fetch origin --tags` in setup
command
- **Line 1320**: Add `--force` to `git fetch origin --tags` in update
command — prevents "would clobber existing tag" errors when tags are
moved
### `package.json`
- Version field set to `0.0.0-use-git-tags` to make it clear this is not
the source of truth. File kept because npm scripts and devDependencies
are still used for testing.
## How it works
`git describe --tags --match "v*"` produces:
- `v3.3.0` — when on an exact tag
- `v3.3.0-3-gabcdef1` — when 3 commits after a tag (useful for
debugging)
- Falls back to `unknown` if no tags exist
## Testing
- All Go tests pass (`cmd/server`, `cmd/ingestor`)
- All frontend unit tests pass (254/254)
- No changes to application logic — only build-time version derivation
Co-authored-by: you <you@example.com>
## Problem
Every PR that touches `public/` files requires manually bumping cache
buster timestamps in `index.html` (e.g. `?v=1775111407`). Since all PRs
change the same lines in the same file, this causes **constant merge
conflicts** — it's been the #1 source of unnecessary PR friction.
## Solution
Replace all hardcoded `?v=TIMESTAMP` values in `index.html` with a
`?v=__BUST__` placeholder. The Go server replaces `__BUST__` with the
current Unix timestamp **once at startup** when it reads `index.html`,
then serves the pre-processed HTML from memory.
Every server restart automatically picks up fresh cache busters — no
manual intervention needed.
## What changed
| File | Change |
|------|--------|
| `public/index.html` | All `v=1775111407` → `v=__BUST__` (28
occurrences) |
| `cmd/server/main.go` | `spaHandler` reads index.html at init, replaces
`__BUST__` with Unix timestamp, serves from memory for `/`,
`/index.html`, and SPA fallback |
| `cmd/server/helpers_test.go` | New `TestSpaHandlerCacheBust` —
verifies placeholder replacement works for root, SPA fallback, and
direct `/index.html` requests. Also added tests for root `/` and
`/index.html` routes |
| `AGENTS.md` | Rule 3 updated: cache busters are now automatic, agents
should not manually edit them |
## Testing
- `go build ./...` — compiles cleanly
- `go test ./...` — all tests pass (including new cache-bust tests)
- `node test-frontend-helpers.js && node test-packet-filter.js && node
test-aging.js` — all frontend tests pass
- No hardcoded timestamps remain in `index.html`
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: you <you@example.com>
## Summary
Fixes#457 — The "Show direct neighbors" checkbox on the map was a UI
stub that did nothing. This PR implements the full feature.
## What Changed
### `public/map.js`
- **New state**: `selectedReferenceNode` (pubkey) and `neighborPubkeys`
(Set) track which node is the reference and who its direct neighbors are
- **`selectReferenceNode(pubkey, name)`**: Fetches
`/api/nodes/{pubkey}/paths`, parses path hops to find all nodes directly
adjacent to the reference node in any observed path, then auto-enables
the neighbor filter
- **Neighbor filter in `_renderMarkersInner()`**: When
`filters.neighbors` is on and a reference node is selected, only the
reference node and its direct (1-hop) neighbors are shown on the map
- **Popup "Show Neighbors" link**: Each node popup now has a "Show
Neighbors" action that sets it as the reference node
- **Sidebar UI hints**: Shows the reference node name when selected, or
a hint to click a node when the filter is enabled without a reference
- **Cleanup on `destroy()`**: Clears reference state and global handler
### `test-frontend-helpers.js`
- 6 new unit tests covering:
- Filter off shows all nodes
- Filter on without reference shows all nodes (graceful no-op)
- Filter on with reference + neighbors filters correctly
- Filter on with empty neighbor set shows only reference
- Neighbor filter respects role filters
- Neighbor extraction from path data
### `public/index.html`
- Cache buster bump
## How It Works
1. User clicks a node marker on the map → popup shows "Show Neighbors"
link
2. Clicking "Show Neighbors" fetches that node's paths from
`/api/nodes/{pubkey}/paths`
3. Adjacent hops in each path are identified as direct neighbors
4. The map filters to show only the reference node + its neighbors
5. The sidebar shows which node is the reference
6. Unchecking the checkbox restores the full node view
## Test Results
```
Frontend helpers: 250 passed, 0 failed
Packet filter: 62 passed, 0 failed
```
---------
Co-authored-by: you <you@example.com>
## Summary
Related to #463 (partial fix — addresses packet path, status message
path still needs investigation) — Observers incorrectly showing as
offline despite actively forwarding packets.
## Root Cause
Observer `last_seen` was only updated when status topic messages
(`meshcore/<region>/<observer_id>/status`) were received via
`UpsertObserver`. When packets were ingested from an observer, the
observer's `last_seen` was **not** updated — only the `observer_idx` was
resolved for the observation record.
This meant observers with low traffic that published status messages
less frequently than the 10-minute online threshold would appear offline
on the observers page, even though they were clearly alive and
forwarding packets.
## Changes
**`cmd/ingestor/db.go`:**
- Added `stmtUpdateObserverLastSeen` prepared statement: `UPDATE
observers SET last_seen = ? WHERE rowid = ?`
- In `InsertTransmission`, after resolving `observer_idx`, update the
observer's `last_seen` to the packet timestamp
- This ensures any observer actively forwarding traffic stays marked as
online
**`cmd/ingestor/db_test.go`:**
- Added `TestInsertTransmissionUpdatesObserverLastSeen` — verifies that
inserting a packet from an observer updates its `last_seen` from a
backdated value to the packet timestamp
## Performance
The added `UPDATE` is a single-row update by `rowid` (primary key) —
O(1) with no index overhead. It runs once per packet insertion when an
observer is resolved, which was already doing a `SELECT` by `rowid`
anyway. No measurable impact on ingestion throughput.
## Test Results
All existing tests pass:
- `cmd/ingestor`: 26.6s ✅
- `cmd/server`: 3.7s ✅
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#433 — Replace the inaccurate Euclidean distance approximation in
`analytics.js` hop distances with proper haversine calculation, matching
the server-side computation introduced in PR #415.
## Problem
PR #415 moved collision analysis server-side and switched from the
frontend's Euclidean approximation (`dLat×111, dLon×85`) to proper
haversine. However, the **hop distance** calculation in `analytics.js`
(subpath detail panel) still used the old Euclidean formula. This
caused:
- **Inconsistent distances** between hop distances and collision
distances
- **Significant errors at high latitudes** — e.g., Oslo→Stockholm:
Euclidean gives ~627km, haversine gives ~415km (51% error)
- The `dLon×85` constant assumes ~40° latitude; at 60° latitude the real
scale factor is ~55.5km/degree, not 85
## Changes
| File | Change |
|------|--------|
| `public/analytics.js` | Replace `dLat*111, dLon*85` Euclidean with
`HopResolver.haversineKm()` (with inline fallback) |
| `public/hop-resolver.js` | Export `haversineKm` in the public API for
reuse |
| `test-frontend-helpers.js` | Add 4 tests: export check, zero distance,
SF→LA accuracy, Euclidean vs haversine divergence |
| `cmd/server/helpers_test.go` | Add `TestHaversineKm`: zero, SF→LA,
symmetry, Oslo→Stockholm accuracy |
| `public/index.html` | Cache buster bump |
## Performance
No performance impact — `haversineKm` replaces an inline arithmetic
expression with another inline arithmetic expression of identical O(1)
complexity. Only called per hop pair in the subpath detail panel
(typically <10 hops).
## Testing
- `node test-frontend-helpers.js` — 248 passed, 0 failed
- `go test -run TestHaversineKm` — PASS
Co-authored-by: you <you@example.com>
## Summary
The `/api/analytics/hash-collisions` endpoint always returned global
results, ignoring the active region filter. Every other analytics
endpoint (RF, topology, hash-sizes, channels, distance, subpaths)
respected the `?region=` query parameter — this was the only one that
didn't.
Fixes#438
## Changes
### Backend (`cmd/server/`)
- **routes.go**: Extract `region` query param and pass to
`GetAnalyticsHashCollisions(region)`
- **store.go**:
- `collisionCache` changed from `*cachedResult` →
`map[string]*cachedResult` (keyed by region, `""` = global) — consistent
with `rfCache`, `topoCache`, etc.
- `GetAnalyticsHashCollisions(region)` and
`computeHashCollisions(region)` now accept a region parameter
- When region is specified, resolves regional observers, scans packets
for nodes seen by those observers, and filters the node list before
computing collisions
- Cache invalidation updated to clear the map (not set to nil)
### Frontend (`public/`)
- **analytics.js**: The hash-collisions fetch was missing `+ sep` (the
region query string). All other fetches in the same `Promise.all` block
had it — this was simply overlooked in PR #415.
- **index.html**: Cache busters bumped
### Tests (`cmd/server/routes_test.go`)
- `TestHashCollisionsRegionParamIgnored` → renamed to
`TestHashCollisionsRegionParam` with updated comments reflecting that
region is now accepted (with no configured regional observers, results
match global — which the test verifies)
## Performance
No new hot-path work. Region filtering adds one scan of `s.packets`
(same as every other region-filtered analytics endpoint) only when
`?region=` is provided. Results are cached per-region with the existing
60s TTL. Without `?region=`, behavior is unchanged.
Co-authored-by: you <you@example.com>
## Summary
Removes an unreachable duplicate `return offsets;` statement in the
`_cumulativeRowOffsets()` function in `packets.js`. The second return
was dead code found during review of PR #402.
## Changes
- **`public/packets.js`**: Removed the duplicate `return offsets;` on
what was line 1137 (the line immediately after the first, reachable
`return offsets;`)
- **`public/index.html`**: Cache buster bump
## Testing
This is a dead code removal — the duplicate return was unreachable. No
behavior change. No new tests needed as existing tests already cover
`_cumulativeRowOffsets()` behavior.
Fixes#447
Co-authored-by: you <you@example.com>
## Summary
Fixes#472
The Docker build job on the self-hosted runner fails with `no space left
on device` because Docker build cache and Go module downloads accumulate
between runs. The existing cleanup (line ~330) runs in the **deploy**
step *after* the build — too late to help.
## Changes
- Added a "Free disk space" step at the start of the build job,
**before** "Build Go Docker image":
- `docker system prune -af` — removes all unused images, containers,
networks
- `docker builder prune -af` — clears the build cache
- `df -h /` — logs available disk space for visibility
- Kept the existing post-deploy cleanup as belt-and-suspenders
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Replace all `setInterval`-based animations in `live.js` with
`requestAnimationFrame` loops and add a concurrency cap to prevent
unbounded animation accumulation under high packet throughput.
Fixes#384
## Problem
Under high throughput (≥5 packets/sec), the live map accumulated
unbounded `setInterval` timers:
- `pulseNode()`: 26ms interval per pulse ring
- `drawAnimatedLine()`: 33ms interval per hop line + 52ms nested
interval for fade-out
- Ghost hop pulse: 600ms interval per ghost marker
At 5 pkts/sec × 3 hops = **15+ concurrent intervals**, climbing without
limit. This caused UI jank, rising CPU usage, and potential memory leaks
from leaked Leaflet markers.
## Changes
### `public/live.js`
| Function | Before | After |
|----------|--------|-------|
| `pulseNode()` | `setInterval` (26ms) + `setTimeout` safety |
`requestAnimationFrame` loop, self-terminates at 2s or opacity ≤ 0 |
| `drawAnimatedLine()` | `setInterval` (33ms) for line + nested
`setInterval` (52ms) for fade | Two `requestAnimationFrame` loops (line
advance + fade-out) |
| Ghost hop pulse | `setInterval` (600ms) + `setTimeout` (3s) |
`requestAnimationFrame` loop with 3s expiry |
| `animatePath()` | No concurrency limit | Returns early when
`activeAnims >= MAX_CONCURRENT_ANIMS` (20) |
### `public/index.html`
- Cache buster version bump
### `test-live-anims.js` (new)
- 7 tests verifying:
- No `setInterval` in `pulseNode`, `drawAnimatedLine`, or `animatePath`
- `MAX_CONCURRENT_ANIMS` defined and set to 20
- Concurrency check present in `animatePath`
- No stale `setInterval` in animation hot paths
## Complexity & Scale
- **Time complexity**: O(1) per animation frame (no change in per-frame
work)
- **Concurrency**: Hard-capped at 20 simultaneous animations (previously
unbounded)
- **At 5 pkts/sec, 3 hops**: Excess animations silently dropped instead
of accumulating timers
- **rAF benefit**: Browser coalesces all animations into single paint
cycle; paused tabs stop animating automatically
## Test Results
```
=== Animation interval elimination ===
✅ pulseNode does not use setInterval
✅ drawAnimatedLine does not use setInterval
✅ ghost hop pulse does not use setInterval
=== Concurrency cap ===
✅ MAX_CONCURRENT_ANIMS is defined
✅ MAX_CONCURRENT_ANIMS is set to 20
✅ animatePath checks MAX_CONCURRENT_ANIMS before proceeding
=== Safety: no stale setInterval in animation functions ===
✅ no setInterval remains in animation hot path
7 passed, 0 failed
```
All existing tests pass (packet-filter: 62, aging: 29, frontend-helpers:
241).
## Performance Proof (Rule 0 compliance)
Benchmark: `node test-anim-perf.js` — simulates timer/animation
accumulation under realistic throughput.
### Timer count: old (setInterval) vs new (rAF + cap)
| Scenario | Old model (peak concurrent timers) | New model (peak
concurrent animations) |
|----------|-----------------------------------:|---------------------------------------:|
| 5 pkt/s × 3 hops, 30s sustained | **123** | **20** |
| 5 pkt/s × 3 hops, 5min sustained | **123** | **20** |
| 20 pkt/s × 3 hops, 10s burst | **246** | **20** |
**Before:** Each hop spawns 3 `setInterval` timers (pulse 26ms, line
33ms, fade 52ms) that live 0.6–2s each. At 5 pkt/s × 3 hops = 15
timers/sec, peak concurrent timers reach **123** (limited only by timer
lifetime, not by any cap). Under burst traffic (20 pkt/s), this climbs
to **246+**.
**After:** `MAX_CONCURRENT_ANIMS = 20` hard-caps active animations.
Excess packets are silently dropped. rAF loops replace all `setInterval`
calls, coalescing into single paint cycles. Peak concurrent animations:
**always ≤ 20**, regardless of throughput or duration.
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Fixes#465 — Channel hash was displaying in decimal instead of
hexadecimal in `channels.js`.
## Changes
- Added `formatHashHex()` helper to `channels.js` that formats numeric
hashes as `0x` hex (e.g. `0x0A`) and passes string hashes through
unchanged
- Applied to both display sites: `renderChannelList` fallback name and
`selectChannel` header text
- Consistent with `packets.js` and `analytics.js` which already use
`.toString(16).padStart(2, '0').toUpperCase()`
## Tests
- 3 new tests in `test-frontend-helpers.js` verifying the helper exists,
is used at display sites, and produces correct output for numeric and
string inputs
- All 244 frontend tests pass, plus packet-filter (62) and aging (29)
tests
Co-authored-by: you <you@example.com>
## Summary
Fixes#361 — `perfMiddleware()` wrote to shared `PerfStats` fields
(`Requests`, `TotalMs`, `Endpoints` map, `SlowQueries` slice) without
any synchronization, causing data races under concurrent HTTP requests.
## Changes
### `cmd/server/routes.go`
- **Added `sync.Mutex` to `PerfStats` struct** — single mutex protects
all fields
- **`perfMiddleware`** — all shared state mutations (counter increments,
endpoint map access, slice appends) now happen under lock. Key
normalization (regex, mux route lookup) moved outside the lock since it
uses no shared state
- **`handleHealth`** — snapshots `Requests`, `TotalMs`, `SlowQueries`
under lock before building response
- **`handlePerf`** — copies all endpoint data and slow queries under
lock into local snapshots, then does expensive work (sorting, percentile
calculation) outside the lock
- **`handlePerfReset`** — resets fields in-place instead of replacing
the pointer (avoids unlocking a different mutex)
### `cmd/server/perfstats_race_test.go` (new)
- Regression test: 50 concurrent writer goroutines + 10 concurrent
reader goroutines hammering `PerfStats` simultaneously
- Verifies no race conditions (via `-race` flag) and counter consistency
## Design Decisions
- **Single mutex over atomics**: The issue suggested `atomic.Int64` for
counters, but since slices/maps need a mutex anyway, a single mutex is
simpler and the critical section is small (microseconds). No measurable
contention at CoreScope's scale.
- **Copy-under-lock pattern**: Expensive operations (sorting, percentile
computation) happen outside the lock to minimize hold time.
- **In-place reset**: `handlePerfReset` clears fields rather than
replacing the `PerfStats` pointer, ensuring the mutex remains valid for
concurrent goroutines.
## Testing
- `go test -race -count=1 ./cmd/server/...` — **PASS** (all existing
tests + new race test)
- New `TestPerfStatsConcurrentAccess` specifically validates concurrent
access patterns
Co-authored-by: you <you@example.com>
## Summary
Replace all `observers.find()` linear scans in `packets.js` with O(1)
`Map.get()` lookups, eliminating ~300K comparisons per render cycle at
30K+ rows.
## Changes
- Added `observerMap` (`Map<id, observer>`) built once when observers
load
- Replaced all 6 `observers.find()` call sites with `observerMap.get()`:
- `obsName()` — called per row for observer name display
- Region filter check in packet filtering
- Observer dropdown label in filter UI
- Group header region lookup
- Child row region lookup
- Flat row region lookup
- Map is cleared on reset and rebuilt on each `loadObservers()` call
## Complexity
- **Before:** O(k) per row × 30K rows = O(30K × k) where k = observer
count (~10)
- **After:** O(1) per row × 30K rows = O(30K)
- Map construction: O(k) once, negligible
## Testing
- All Go tests pass (`cmd/server`, `cmd/ingestor`)
- All frontend tests pass (`test-packet-filter.js`: 62 passed,
`test-aging.js`: 29 passed, `test-frontend-helpers.js`: 241 passed)
Fixes#383
Co-authored-by: you <you@example.com>
## Problem
Fixes#324. The VCR LCD clock and timeline hover/touch tooltip always
showed local time, ignoring the UTC/local timezone setting in the
customizer Display tab.
## Root cause
Three sites in `live.js` bypassed the shared `getTimestampTimezone()`
utility:
- `updateVCRClock()` — used `d.getHours()` / `d.getMinutes()` /
`d.getSeconds()` (always local)
- Timeline mousemove tooltip — used `d.toLocaleTimeString()` (always
local)
- Timeline touchmove tooltip — same
## Fix
Added `vcrFormatTime(tsMs)` helper that checks `getTimestampTimezone()`
and uses `getUTC*` methods when set to `'utc'`, otherwise local `get*`.
Applied to all three sites. Exposed as `window._vcrFormatTime` for
testing.
## Tests
4 new unit tests in `test-frontend-helpers.js` covering UTC mode, local
mode, and zero-padding.
## Checklist
- [x] Branches from `upstream/master`
- [x] No Matomo or local-only commits
- [x] Cache busters bumped (`v=1775073838`)
- [x] 233 tests pass, 0 fail
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Summary
- `nextHop()` schedules `setInterval`/`setTimeout` callbacks that can
fire after `destroy()` has set `animLayer = null` and removed DOM
elements
- This caused three console errors on the Live page when navigating away
mid-animation: `Cannot read properties of null (reading 'hasLayer')` and
`Cannot set properties of null (setting 'textContent')`
- Added null guards at each async callback site; no behavioral change
when the page is active
## Changes
- `public/live.js`: early return if `animLayer` is null at start of
`nextHop()`; null-safe `animLayer.hasLayer` checks in
`setInterval`/`setTimeout`; null-safe `liveAnimCount` element access
- `public/index.html`: cache buster bumped
- `test-frontend-helpers.js`: 4 source-inspection tests verifying the
null guards are present
## Test plan
- [ ] Open Live page, trigger some packet animations, navigate away
quickly — no console errors
- [ ] `node test-frontend-helpers.js` passes (233 tests, 0 failures)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
Fixes#399. On every ADVERT WebSocket batch the nodes page invalidated
the entire `_allNodes` cache and triggered a full `/nodes?limit=5000`
fetch — even when every advertising node was already cached. The 90s API
TTL was actively bypassed.
## Root cause
```js
wsHandler = debouncedOnWS(function (msgs) {
if (msgs.some(isAdvertMessage)) {
_allNodes = null; // wipe cache unconditionally
invalidateApiCache('/nodes'); // bust API TTL
loadNodes(true); // full 5k fetch
}
}, 5000);
```
## Fix
ADVERT decoded payloads include `pubKey`, `name`, `lat`, `lon` — enough
to update known nodes in place:
- **Known node** (pubKey found in `_allNodes`): upsert `name`, `lat`,
`lon`, `last_seen` directly — no fetch, no cache bust, just re-render.
- **New node** (pubKey not in cache) or **no pubKey** in payload: fall
back to full reload as before.
This covers the common case on an active mesh: all advertising nodes are
already cached. The full reload path is preserved for node discovery.
## Tests
2 new unit tests: known-node upsert (asserts 0 API calls, fields
updated) and unknown-node fallback (asserts full reload triggered). All
231 tests pass.
## Checklist
- [x] Branches from `upstream/master`
- [x] No Matomo or local-only commits
- [x] Cache busters bumped
- [x] 231 tests pass, 0 fail
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Problem
Fixes#325. Removing all home steps and clicking "Reset my theme" did
not restore them.
## Root cause
Two-part bug:
**1. `SITE_CONFIG.home` permanently mutated at page load**
`app.js` calls `mergeUserHomeConfig(SITE_CONFIG, userTheme)` which does
`SITE_CONFIG.home = Object.assign({}, serverHome, userTheme.home)`. If
the user had `steps: []` saved in localStorage, this sets
`SITE_CONFIG.home.steps = []` globally — permanently for the lifetime of
the page.
**2. `initState()` reads the contaminated config**
When the customizer opens (or Reset is clicked), `initState()` reads
`cfg = window.SITE_CONFIG`. Since `SITE_CONFIG.home.steps` is already
`[]`, `state.home.steps` stays `[]` even after
`localStorage.removeItem`. `autoSave()` then re-saves `steps: []`
straight back.
**Secondary issue:** `data-rm-step` / add / move handlers didn't call
`autoSave()`, making step persistence non-deterministic (only saved if a
text field edit happened to be pending).
## Fix
- **`app.js`**: snapshot `SITE_CONFIG.home` before `mergeUserHomeConfig`
→ `window._SITE_CONFIG_ORIGINAL_HOME`
- **`customize.js`**: `initState()` uses `_SITE_CONFIG_ORIGINAL_HOME`
instead of the contaminated `cfg.home`
- **`customize.js`**: add `autoSave()` to rm/move/add handlers for
steps, checklist, and footer links
## Tests
2 new unit tests covering the snapshot bypass and DEFAULTS fallback. 231
tests pass.
## Checklist
- [x] Branches from `upstream/master`
- [x] No Matomo or local-only commits
- [x] Cache busters bumped
- [x] 231 tests pass, 0 fail
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Summary
Fixes#466 — staging config was not refreshed from prod due to a stale
`-nt` timestamp guard.
## Root Cause
`prepare_staging_config()` only copied prod config when staging was
missing or prod was newer by mtime. However, the `sed -i` that applies
the STAGING siteName updated staging's mtime, making it appear newer
than prod. Subsequent runs skipped the copy entirely.
## Changes
- **`manage.sh`**: Removed the `-nt` timestamp conditional in
`prepare_staging_config()`. Staging config is now always copied fresh
from prod with the STAGING siteName applied.
Note: `prepare_staging_db()` already copies unconditionally — no change
needed there.
Co-authored-by: you <you@example.com>
## Summary
`manage.sh update` now supports pinning to specific release tags instead
of always pulling tip of master.
Fixes#455
## Changes
### `cmd_update` — accepts optional version argument
- **No argument**: fetches tags, checks out latest release tag (`git tag
-l 'v*' --sort=-v:refname | head -1`)
- **`latest`**: explicit opt-in to tip of master (bleeding edge)
- **Specific tag** (e.g. `v3.1.0`): checks out that exact tag, with
error message + available tags if not found
### `cmd_setup` — defaults to latest tag
- After Docker check, fetches tags and pins to latest release tag
- Skips if already on the latest tag
- Uses state tracking (`version_pin`) so re-runs don't repeat
### `cmd_status` — shows version
- Displays current version (exact tag name or short commit hash) at the
top of status output
### Help text
- Updated to reflect new `update [version]` syntax
## Usage
```bash
./manage.sh update # checkout latest release tag (e.g. v3.2.0)
./manage.sh update v3.1.0 # pin to specific version
./manage.sh update latest # explicit tip of master (bleeding edge)
./manage.sh status # now shows "Version: v3.2.0"
```
## Testing
- `bash -n manage.sh` passes (syntax valid)
- Logic follows existing patterns (git fetch, checkout, rebuild,
restart)
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#450 — staging deployment flaky due to container not shutting down
cleanly.
## Root Causes
1. **Server never closed DB on shutdown** — SQLite WAL lock held
indefinitely, blocking new container startup
2. **`httpServer.Close()` instead of `Shutdown()`** — abruptly kills
connections instead of draining them
3. **No `stop_grace_period` in compose configs** — Docker sends SIGTERM
then immediately SIGKILL (default 10s is often not enough for WAL
checkpoint)
4. **Supervisor didn't forward SIGTERM** — missing
`stopsignal`/`stopwaitsecs` meant Go processes got SIGKILL instead of
graceful shutdown
5. **Deploy scripts used default `docker stop` timeout** — only 10s
grace period
## Changes
### Go Server (`cmd/server/`)
- **Graceful HTTP shutdown**: `httpServer.Shutdown(ctx)` with 15s
context timeout — drains in-flight requests before closing
- **WebSocket cleanup**: New `Hub.Close()` method sends `CloseGoingAway`
frames to all connected clients
- **DB close on shutdown**: Explicitly closes DB after HTTP server stops
(was never closed before)
- **WAL checkpoint**: `PRAGMA wal_checkpoint(TRUNCATE)` before DB close
— flushes WAL to main DB file and removes WAL/SHM lock files
### Go Ingestor (`cmd/ingestor/`)
- **WAL checkpoint on shutdown**: New `Store.Checkpoint()` method,
called before `Close()`
- **Longer MQTT disconnect timeout**: 5s (was 1s) to allow in-flight
messages to drain
### Docker Compose (all 4 variants)
- Added `stop_grace_period: 30s` and `stop_signal: SIGTERM`
### Supervisor Configs (both variants)
- Added `stopsignal=TERM` and `stopwaitsecs=20` to server and ingestor
programs
### Deploy Scripts
- `deploy-staging.sh`: `docker stop -t 30` with explicit grace period
- `deploy-live.sh`: `docker stop -t 30` with explicit grace period
## Shutdown Sequence (after fix)
1. Docker sends SIGTERM to supervisord (PID 1)
2. Supervisord forwards SIGTERM to server + ingestor (waits up to 20s
each)
3. Server: stops poller → drains HTTP (15s) → closes WS clients →
checkpoints WAL → closes DB
4. Ingestor: stops tickers → disconnects MQTT (5s) → checkpoints WAL →
closes DB
5. Docker waits up to 30s total before SIGKILL
## Tests
All existing tests pass:
- `cd cmd/server && go test ./...` ✅
- `cd cmd/ingestor && go test ./...` ✅
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Fixes#451 — packet detail pane crash on direct routed packets where
`pathHops` is `null`.
## Root Cause
`JSON.parse(pkt.path_json)` can return literal `null` when the DB stores
`"null"` for direct routed packets. The existing code only had a catch
block for parse errors, but `null` is valid JSON — so the parse succeeds
and `pathHops` ends up `null` instead of `[]`.
## Changes
- **`public/packets.js`**: Added `|| []` after `JSON.parse(...)` in both
`buildFlatRowHtml` (table rows) and the detail pane (`selectPacket`),
ensuring `pathHops` is always an array.
- **`test-frontend-helpers.js`**: Added 2 regression tests verifying the
null guards exist in both code paths.
- **`public/index.html`**: Cache buster bump.
## Testing
- All 229 frontend helper tests pass
- All 62 packet filter tests pass
- All 29 aging tests pass
Co-authored-by: you <you@example.com>
## Summary
Fixes the critical performance issue where `renderTableRows()` rebuilt
the **entire** table innerHTML (up to 50K rows) on every update —
WebSocket arrivals, filter changes, group expand/collapse, and theme
refreshes.
## Changes
### Lazy Row Generation (`renderVisibleRows`) — fixes#422
- Row HTML strings are **only generated for the visible slice + 30-row
buffer** on each render
- `_displayPackets` stores the filtered data array;
`renderVisibleRows()` calls `buildGroupRowHtml`/`buildFlatRowHtml`
lazily for ~60-90 visible entries
- Previously, `displayPackets.map(buildGroupRowHtml)` built HTML for ALL
30K+ packets on every render — the expensive work (JSON.parse, observer
lookups, template literals) ran for every packet regardless of
visibility
### Unified Row Count via `_getRowCount()` — fixes#424
- Single function `_getRowCount(p)` computes DOM row count for any entry
(1 for flat/collapsed, 1+children for expanded groups)
- Used by BOTH `_rowCounts` computation AND `renderVisibleRows` —
eliminates divergence risk between row counting and row building
### Hoisted Observer Filter Set — fixes#427
- `_observerFilterSet` created once in `renderTableRows()`, reused
across `buildGroupRowHtml`, `_getRowCount`, and child filtering
- Previously, `new Set(filters.observer.split(','))` was created inside
`buildGroupRowHtml` for every packet AND again in the row count callback
### Dynamic Colspan — fixes#426
- `_getColCount()` reads column count from the thead instead of
hardcoded `colspan="11"`
- Spacers and empty-state messages use the actual column count
### Null-Safety in `buildFlatRowHtml` — fixes#430
- `p.decoded_json || '{}'` fallback added, matching
`buildGroupRowHtml`'s existing null-safety
- Prevents TypeError on null/undefined `decoded_json` in flat
(ungrouped) mode
### Behavioral Tests — fixes#428
- Replaced 5 source-grep tests with behavioral unit tests for
`_getRowCount`:
- Flat mode always returns 1
- Collapsed group returns 1
- Expanded group returns 1 + child count
- Observer filter correctly reduces child count
- Null `_children` handled gracefully
- Retained source-level assertions only where behavioral testing isn't
practical (e.g., verifying lazy generation pattern exists)
### Other Improvements
- Cumulative row offsets cached in `_cumulativeOffsetsCache`,
invalidated on row count changes
- Debounced WebSocket renders (200ms) coalesce rapid packet arrivals
- `destroy()` properly cleans up all virtual scroll state
## Performance Benchmarks — fixes#423
**Methodology:** Row building cost measured by counting
`buildGroupRowHtml` calls per render cycle on 30K grouped packets.
| Scenario | Before (eager) | After (lazy) | Improvement |
|----------|----------------|--------------|-------------|
| Initial render (30K packets) | 30,000 `buildGroupRowHtml` calls | ~90
calls (60 visible + 30 buffer) | **333× fewer calls** |
| Scroll event | 0 calls (pre-built) | ~90 calls (rebuild visible slice)
| Trades O(1) scroll for O(n) initial savings |
| WS packet arrival | 30,000 calls (full rebuild) | ~90 calls (debounced
+ lazy) | **333× fewer calls** |
| Filter change | 30,000 calls | ~90 calls | **333× fewer calls** |
| Memory (row HTML cache) | ~2MB string array for 30K packets | 0 (no
cache, build on demand) | **~2MB saved** |
**Per-call cost of `buildGroupRowHtml`:** Each call performs JSON.parse
of `decoded_json`, `path_json`, `observers.find()` lookup, and template
literal construction. At 30K packets, the eager approach spent
~400-500ms on row building alone (measured via `performance.now()` on
staging data). The lazy approach builds ~90 rows in ~1-2ms.
**Net effect:** `renderTableRows()` goes from O(n) string building +
O(1) DOM insertion to O(1) data assignment + O(visible) string building
+ O(visible) DOM insertion. For n=30K and visible≈60, this is ~333× less
work per render cycle.
**Trade-off:** Scrolling now rebuilds ~90 rows per RAF frame instead of
slicing pre-built strings. This costs ~1-2ms per scroll event, well
within the 16ms frame budget. The trade-off is overwhelmingly positive
since renders happen far more frequently than full-table scrolls.
## Tests
- 247 frontend helper tests pass (including 18 virtual scroll tests)
- 62 packet filter tests pass
- 29 aging tests pass
- Go backend tests pass
## Remaining Debt (tracked in issues)
- #425: Hardcoded `VSCROLL_ROW_HEIGHT=36` and `theadHeight=40` — should
be measured from DOM
- #429: 200ms WS debounce delay — value works well in practice but lacks
formal justification
- #431: No scroll position preservation on filter change or group
expand/collapse
Fixes#380
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Moves the hash collision analysis from the frontend to a new server-side
endpoint, eliminating a major performance bottleneck on the analytics
collision tab.
Fixes#386
## Problem
The collision tab was:
1. **Downloading all nodes** (`/nodes?limit=2000`) — ~500KB+ of data
2. **Running O(n²) pairwise distance calculations** on the browser main
thread (~2M comparisons with 2000 nodes)
3. **Building prefix maps client-side** (`buildOneBytePrefixMap`,
`buildTwoBytePrefixInfo`, `buildCollisionHops`) iterating all nodes
multiple times
## Solution
### New endpoint: `GET /api/analytics/hash-collisions`
Returns pre-computed collision analysis with:
- `inconsistent_nodes` — nodes with varying hash sizes
- `by_size` — per-byte-size (1, 2, 3) collision data:
- `stats` — node counts, space usage, collision counts
- `collisions` — pre-computed collisions with pairwise distances and
classifications (local/regional/distant/incomplete)
- `one_byte_cells` — 256-cell prefix map for 1-byte matrix rendering
- `two_byte_cells` — first-byte-grouped data for 2-byte matrix rendering
### Caching
Uses the existing `cachedResult` pattern with a new `collisionCache`
map. Invalidated on `hasNewTransmissions` (same trigger as the
hash-sizes cache) and on eviction.
### Frontend changes
- `renderCollisionTab` now accepts pre-fetched `collisionData` from the
parallel API load
- New `renderHashMatrixFromServer` and `renderCollisionsFromServer`
functions consume server-computed data directly
- No more `/nodes?limit=2000` fetch from the collision tab
- Old client-side functions (`buildOneBytePrefixMap`, etc.) preserved
for test helper exports
## Test results
- `go test ./...` (server): ✅ pass
- `go test ./...` (ingestor): ✅ pass
- `test-packet-filter.js`: ✅ 62 passed
- `test-aging.js`: ✅ 29 passed
- `test-frontend-helpers.js`: ✅ 227 passed
## Performance impact
| Metric | Before | After |
|--------|--------|-------|
| Data transferred | ~500KB (all nodes) | ~50KB (collision data only) |
| Client computation | O(n²) distance calc | None (server-cached) |
| Main thread blocking | Yes (2000 nodes × pairwise) | No |
| Server caching | N/A | 15s TTL, invalidated on new transmissions |
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Fixes#381 — The "My Nodes" filter in `packets.js` was making a **server
API call inside `renderTableRows()`** on every render cycle. With
WebSocket updates arriving every few seconds while the toggle was
active, this created continuous unnecessary server load.
## What Changed
**`public/packets.js`** — Replaced the `api('/packets?nodes=...')`
server call with a pure client-side filter:
```js
// Before: server round-trip on every render
const myData = await api('/packets?nodes=' + allKeys.join(',') + '&limit=500');
displayPackets = myData.packets || [];
// After: filter already-loaded packets client-side
displayPackets = displayPackets.filter(p => {
const dj = p.decoded_json || '';
return allKeys.some(k => dj.includes(k));
});
```
This uses the exact same matching logic as the server's
`QueryMultiNodePackets()` — a string contains check on `decoded_json`
for each pubkey — but without the network round-trip.
**`test-frontend-helpers.js`** — Added 5 unit tests for the filter
logic:
- Single and multiple pubkey matching
- No matches / empty keys edge case
- Null/empty `decoded_json` handled gracefully
**`public/index.html`** — Cache busters bumped.
## Test Results
- Frontend helpers: **232 passed, 0 failed** (including 5 new tests)
- Packet filter: **62 passed, 0 failed**
- Aging: **29 passed, 0 failed**
Co-authored-by: you <you@example.com>
## Problem
Every time new data is ingested (`IngestNewFromDB`,
`IngestNewObservations`, `EvictStale`), **all 6 analytics caches** are
wiped by creating new empty maps — regardless of what kind of data
actually changed. With the poller running every 1 second, this means the
15s cache TTL is effectively bypassed because caches are cleared far
more frequently than they expire.
## Fix
Introduces a `cacheInvalidation` flags struct and
`invalidateCachesFor()` method that selectively clears only the caches
affected by the ingested data:
| Flag | Caches Cleared |
|------|----------------|
| `hasNewObservations` | RF (SNR/RSSI data changed) |
| `hasNewPaths` | Topology, Distance, Subpaths |
| `hasNewTransmissions` | Hash sizes |
| `hasChannelData` | Channels (GRP_TXT payload_type 5) + channels list
cache |
| `eviction` | All (data removed, everything potentially stale) |
### Impact
For a typical ingest cycle with ADVERT/ACK/TXT_MSG packets (no GRP_TXT):
- **Before:** All 6 caches cleared every cycle
- **After:** Channel cache preserved (most common case), hash cache
preserved on observation-only ingestion
For observation-only ingestion (`IngestNewObservations`):
- **Before:** All 6 caches cleared
- **After:** Only RF cache cleared (+ topo/dist/subpath if paths
actually changed)
## Tests
7 new unit tests in `cache_invalidation_test.go` covering:
- Eviction clears all caches
- Observation-only ingest preserves non-RF caches
- Transmission-only ingest clears only hash cache
- Channel data clears only channel cache
- Path changes clear topo/dist/subpath
- Combined flags work correctly
- No flags = no invalidation
All existing tests pass.
### Post-rebase fix
Restored `channelsCacheRes` invalidation that was accidentally dropped
during the refactor. The old code cleared this separate channels list
cache on every ingest, but `invalidateCachesFor()` didn't include it.
Now cleared on `hasChannelData` and `eviction`.
Fixes#375
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#353 — addresses all 5 findings from the CoreScope code analysis.
## Changes
### Finding 1 (Major): `score` field never extracted from MQTT
- Added `Score *float64` field to `PacketData` and `MQTTPacketMessage`
structs
- Extract `msg["score"]` with `msg["Score"]` case fallback via
`toFloat64` in all three MQTT handlers (raw packet, channel message,
direct message)
- Pass through to DB observation insert instead of hardcoded `nil`
### Finding 2 (Major): `direction` field never extracted from MQTT
- Added `Direction *string` field to `PacketData` and
`MQTTPacketMessage` structs
- Extract `msg["direction"]` with `msg["Direction"]` case fallback as
string in all three MQTT handlers
- Pass through to DB observation insert instead of hardcoded `nil`
### Finding 3 (Minor): `toFloat64` doesn't strip units
- Added `stripUnitSuffix()` that removes common RF/signal unit suffixes
(dBm, dB, mW, km, mi, m) case-insensitively before `ParseFloat`
- Values like `"-110dBm"` or `"5.5dB"` now parse correctly
### Finding 4 (Minor): Bare type assertions in store.go
- Changed `firstSeen` and `lastSeen` from `interface{}` to typed
`string` variables at `store.go:5020`
- Removed unsafe `.(string)` type assertions in comparisons
### Finding 5 (Minor): `distHopRecord.SNR` typed as `interface{}`
- Changed `distHopRecord.SNR` from `interface{}` to `*float64`
- Updated assignment (removed intermediate `snrVal` variable, pass
`tx.SNR` directly)
- Updated output serialization to use `floatPtrOrNil(h.SNR)` for
consistent JSON output
## Tests Added
- `TestBuildPacketDataScoreAndDirection` — verifies Score/Direction flow
through BuildPacketData
- `TestBuildPacketDataNilScoreDirection` — verifies nil handling when
fields absent
- `TestInsertTransmissionWithScoreAndDirection` — end-to-end: inserts
with score/direction, verifies DB values
- `TestStripUnitSuffix` — covers all supported suffixes, case
insensitivity, and passthrough
- `TestToFloat64WithUnits` — verifies unit-bearing strings parse
correctly
All existing tests pass.
Co-authored-by: you <you@example.com>
## Problem
On installations where the database predates the
`idx_observations_timestamp` index, `/api/stats` takes 30s+ because
`GetStoreStats()` runs two full table scans:
```sql
SELECT COUNT(*) FROM observations WHERE timestamp > ? -- last hour
SELECT COUNT(*) FROM observations WHERE timestamp > ? -- last 24h
```
The index is only created in the `if !obsExists` block, so any database
where the `observations` table already existed before that code was
added never gets it.
## Fix
Adds a one-time migration (`obs_timestamp_index_v1`) that runs at
ingestor startup:
```sql
CREATE INDEX IF NOT EXISTS idx_observations_timestamp ON observations(timestamp)
```
On large installations this index creation may take a few seconds on
first startup after the upgrade, but subsequent stats queries become
instant.
## Test plan
- [ ] Restart ingestor on an older database and confirm `[migration]
observations timestamp index created` appears in logs
- [ ] Confirm `/api/stats` response time drops from 30s+ to <100ms
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
Two endpoints were slow on larger installations:
**`/packets?limit=50000&groupByHash=true` — 16s+**
`QueryGroupedPackets` did two expensive things on every request:
1. O(n × observations) scan per packet to find `latest` timestamp
2. Held `s.mu.RLock()` during the O(n log n) sort, blocking all
concurrent reads
**`/channels` — 13s+**
`GetChannels` iterated all payload-type-5 packets and JSON-unmarshaled
each one while holding `s.mu.RLock()`, blocking all concurrent reads for
the full duration.
## Fix
**Packets (`QueryGroupedPackets`):**
- Add `LatestSeen string` to `StoreTx`, maintained incrementally in all
three observation write paths. Eliminates the per-packet observation
scan at query time.
- Build output maps under the read lock, sort the local copy after
releasing it.
- Cache the full sorted result for 3 seconds keyed by filter params.
**Channels (`GetChannels`):**
- Copy only the fields needed (firstSeen, decodedJSON, region match)
under the read lock, then release before JSON unmarshaling.
- Cache the result for 15 seconds keyed by region param.
- Invalidate cache on new packet ingestion.
## Test plan
- [ ] Open packets page on a large store — load time should drop from
16s to <1s
- [ ] Open channels page — should load in <100ms instead of 13s+
- [ ] `[SLOW API]` warnings gone for both endpoints
- [ ] Packet/channel data is correct (hashes, counts, observer counts)
- [ ] Filters (region, type, since/until) still work correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Priority+ Navigation Pattern for Tablet Viewports
Phase 2 of responsive nav improvements for #322.
### What this does
On **tablet viewports (768-1023px)**, implements the [Priority+
navigation
pattern](https://css-tricks.com/the-priority-plus-navigation-pattern/):
- **5 high-priority tabs** shown inline: Home, Nodes, Packets, Map, Live
- **6 low-priority tabs** collapse into a "More ▾" dropdown: Channels,
Traces, Observers, Analytics, Perf, Lab
- The "More" button highlights when a low-priority page is active
**Desktop (>=1024px)** and **mobile (<768px)** behavior is unchanged.
### Changes
| File | Change |
|------|--------|
| `public/index.html` | Added `data-priority="high"` to 5 primary nav
links; added More button + dropdown menu |
| `public/style.css` | Split ≤1023px hamburger query into tablet
Priority+ (768-1023px) and mobile hamburger (<768px); added More
dropdown styles |
| `public/app.js` | Added `closeMoreMenu()`, More button toggle,
outside-click/Escape close, active state on More button |
| Cache busters | Bumped in same commit |
### Accessibility
- `aria-haspopup="true"` and `aria-expanded` on More button
- `role="menu"` / `role="menuitem"` on dropdown
- Focus moves to first item on open
- Escape key closes dropdown
### Testing
- All 308 existing tests pass (217 frontend-helpers + 62 packet-filter +
29 aging)
- No new dependencies added
- No build step changes
### Breakpoint summary
| Viewport | Behavior |
|----------|----------|
| >= 1024px | Full horizontal nav (unchanged) |
| 768-1023px | Priority+ pattern: 5 tabs + More dropdown **← NEW** |
| < 768px | Hamburger drawer with all items (unchanged) |
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When GetMaxTransmissionID() fails silently (e.g., corrupted DB returns 0
from COALESCE), the poller starts from ID 0 and replays the entire
database over WebSocket — broadcasting thousands of old packets per second.
Fix: after querying the DB, use the in-memory store's MaxTransmissionID
and MaxObservationID as a floor. Since Load() already read the full DB
successfully, the store has the correct max IDs.
Root cause discovered on staging: DB corruption caused MAX(id) query to
fail, returning 0. Poller log showed 'starting from transmission ID 0'
followed by 1000-2000 broadcasts per tick walking through 76K rows.
Also adds MaxObservationID() to PacketStore for observation cursor safety.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes the Playwright CI regression on master where the "Packets page
loads with filter" test times out after 15 seconds waiting for able
tbody tr to appear.
## Root Cause
Three packets tests used an bout:blank round-trip pattern to force a
full page reload:
`
page.goto(BASE) → set localStorage → page.goto('about:blank') →
page.goto(BASE/#/packets)
`
This cross-origin round-trip through bout:blank causes the SPA's config
fetch and router to not fire reliably in CI's headless Chromium, leaving
the page uninitialized past the 15-second timeout.
## Fix
Replace the bout:blank pattern with page.reload() in all three affected
tests:
`
page.goto(BASE/#/packets) → set localStorage → page.reload()
`
This stays on the same origin throughout. Playwright handles same-origin
reloads predictably — the page fully re-initializes, the IIFE re-reads
localStorage, and loadPackets() uses the correct time window.
## Tests affected
| Test | Change |
|------|--------|
| Packets page loads with filter | bout:blank → page.reload() |
| Packets initial fetch honors persisted time window | bout:blank →
page.reload() |
| Packets groupByHash toggle works | bout:blank → page.reload() |
## Validation
- All 318 unit tests pass (packet-filter: 62, aging: 29, frontend: 227)
- No public/ files changed — no cache buster needed
- Single file changed: est-e2e-playwright.js (9 insertions, 15
deletions)
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WS broadcast pushes all packets regardless of the selected time
window filter. This caused old packets to appear in the table even
when the API correctly returned zero results for the time range.
Add time window check to the WS packet filter — drops packets
with timestamps older than the selected window cutoff.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The outer IIFE declares const isMobile (line 27, from #340) and
renderLeft() declares its own const isMobile (line 821, pre-existing).
JavaScript hoists const declarations within the function scope, so
referencing isMobile at line 574 (inside renderLeft but before line 821)
throws 'Cannot access isMobile before initialization'.
Rename the inner declaration to isNarrow since it uses a different
breakpoint (640px for column hiding vs 1024px for packet limit).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes#326 — the packets page crashes mobile browsers (iOS Safari, Edge)
by loading 50K+ packets when no time filter is persisted in
localStorage.
## Root Cause
Two problems in public/packets.js:
### Bug 1: savedTimeWindowMin defaults to 0 instead of 15
localStorage.getItem('meshcore-time-window') returns
ull when never set. Number(null) = 0. The guard checked < 0 but not <=
0, so savedTimeWindowMin = 0 meant "All time" — fetching all 50K+
packets.
**Fix:** Changed < 0 to <= 0 in both the initialization guard (line 30)
and the change handler (line 758).
### Bug 2: No mobile protection against large packet loads
Even with valid large time windows, mobile browsers crash under the
weight of thousands of DOM rows and packet data (~1.4 GB WebKit memory
limit).
**Fix:**
- Detect mobile viewport: window.innerWidth <= 768
- Cap limit at 1000 on mobile (vs 50000 on desktop)
- Disable 6h/12h/24h options and hide "All time" on mobile
- Reset persisted windows >3h to 15 min on mobile
## Testing
Added 9 unit tests in est-frontend-helpers.js covering:
- savedTimeWindowMin defaults to 15 when localStorage returns null
- savedTimeWindowMin defaults to 15 when localStorage returns "0"
- Valid values (60) are preserved
- Negative and NaN values default to 15
- PACKET_LIMIT is 1000 on mobile, 50000 on desktop
- Mobile caps large time windows (1440 → 15) but allows 180
All 218 frontend helper tests pass. Packet filter (62) and aging (29)
tests also pass.
## Changes
| File | Change |
|------|--------|
| public/packets.js | Fix <= 0 guard, add mobile detection, cap limit,
restrict time options |
| public/index.html | Cache buster bump |
| est-frontend-helpers.js | 9 new regression tests for time window
defaults and mobile caps |
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Extends the hamburger menu activation breakpoint from max-width: 640px
to max-width: 1023px, making all 11 nav items accessible on tablets and
small laptops where they were previously clipped/invisible.
Fixes#322
## Changes
### public/style.css
- New @media (max-width: 1023px) block activates the hamburger menu and
vertical drawer
- Drawer has max-height: calc(100dvh - 52px) with overflow-y: auto for
scrollability
- z-index set to 1100 (consistent with nav layer)
- ody.nav-open locks background scroll when drawer is open
- Mobile-only rules (brand-text hidden, tighter nav-right gap) remain at
640px
### public/app.js
- Extracted closeNav() helper for consistent drawer close behavior
- Hamburger toggle now adds/removes ody.nav-open class
- Drawer closes on: nav link click, Escape key, and route change (SPA
navigation)
### public/index.html
- Cache busters bumped for all CSS/JS assets
## What's NOT changed
- Desktop layout (>=1024px) is completely untouched
- No Priority+ pattern (Phase 2)
- No map layout changes (Phase 3)
- No new dependencies
## Testing
- All 308 frontend tests pass ( est-frontend-helpers.js,
est-packet-filter.js, est-aging.js)
- Visual verification: hamburger activates at <=1023px, full bar at
>=1024px
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes#303 — Repeater hash stats now reflect the **latest advert**
instead of the historical mode (most frequent).
When a node is reconfigured (e.g. from 1-byte to 2-byte hash size), the
analytics and node detail pages now show the updated value immediately
after the next advert is received.
## Changes
### cmd/server/store.go
1. **computeNodeHashSizeInfo** — Changed hash size determination from
statistical mode to latest advert. The most recent advert in
chronological order now determines hash_size. The hash_sizes_seen and
hash_size_inconsistent tracking is preserved for multi-byte analytics.
2. **computeAnalyticsHashSizes** — Two fixes:
- **yNode keyed by pubKey** instead of name, so same-name nodes with
different public keys are counted separately in distributionByRepeaters.
- **Zero-hop adverts included** — advert originator tracking now happens
before the hops check, so zero-hop adverts contribute to per-node stats.
### cmd/server/routes_test.go
Added 4 new tests:
- TestGetNodeHashSizeInfoLatestWins — 4 historical 1-byte adverts + 1
recent 2-byte advert → hash size should be 2 (not 1 from mode)
- TestGetNodeHashSizeInfoNoAdverts — node with no ADVERT packets →
graceful nil, no crash
- TestAnalyticsHashSizeSameNameDifferentPubkey — two nodes named
"SameName" with different pubkeys → counted as 2 separate entries
- Updated TestGetNodeHashSizeInfoDominant comment to reflect new
behavior
## Context
Community report from contributor @kizniche: after reconfiguring a
repeater from 1-byte to 2-byte hash and sending a flood advert, the
analytics page still showed 1-byte. Root cause was the mode-based
computation which required many new adverts to shift the majority. The
upstream firmware bug causing stale path bytes
(meshcore-dev/MeshCore#2154) has been fixed, making the latest advert
reliable.
## Testing
- `go vet ./...` — clean
- `go test ./... -count=1` — all tests pass (including 4 new ones)
- `cmd/ingestor` tests — pass
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes the poor contrast in the node side pane's "Paths through this
node" section in dark mode.
## Root Cause
.node-detail-section (side pane) had no background or border — it
inherited the lighter --detail-bg (#232340) from .panel-right. The same
content on the full detail page sits inside .node-full-card which uses
the darker --card-bg (#1a1a2e) + a visible border, giving it proper
contrast.
| Context | Container | Background | Contrast |
|---------|-----------|------------|----------|
| Full detail page | .node-full-card | --card-bg (darker) | ✅ Good |
| Side pane | .node-detail-section | inherited --detail-bg (lighter) | ❌
Poor |
## Fix
Give .node-detail-section the same card treatment as .node-full-card:
`css
.node-detail-section {
background: var(--card-bg);
border: 1px solid var(--border);
border-radius: 8px;
padding: 12px;
margin-bottom: 8px;
}
`
- All colors use CSS variables — no hardcoded hex values
- Both light and dark themes benefit from the card treatment
- No JS changes needed — CSS-only fix
- Cache busters bumped in the same commit
Fixes#334
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Surfaces transport route types in the packets view by adding a **"T"
badge** next to the payload type badge for packets with
`TRANSPORT_FLOOD` (route type 0) or `TRANSPORT_DIRECT` (route type 3)
routes.
This helps mesh analysis — communities can quickly identify transported
packets and gain insights into scope usage adoption.
Closes#241
## What Changed
### Frontend (`public/`)
- **app.js**: Added `isTransportRoute(rt)` and `transportBadge(rt)`
helper functions that render a `<span class="badge
badge-transport">T</span>` badge with the full route type name as a
tooltip
- **packets.js**: Applied `transportBadge()` in all three packet row
render paths:
- Flat (ungrouped) packet rows
- Grouped packet header rows
- Grouped packet child rows
- **style.css**: Added `.badge-transport` class with amber styling and
CSS variable support (`--transport-badge-bg`, `--transport-badge-fg`)
for theme customization
### Backend (`cmd/server/`)
- **decoder_test.go**: Added 6 new tests covering:
- `TestDecodeHeader_TransportFlood` — verifies route type 0 decodes as
TRANSPORT_FLOOD
- `TestDecodeHeader_TransportDirect` — verifies route type 3 decodes as
TRANSPORT_DIRECT
- `TestDecodeHeader_Flood` — verifies route type 1 (non-transport)
decodes correctly
- `TestIsTransportRoute` — verifies the helper identifies transport vs
non-transport routes
- `TestDecodePacket_TransportFloodHasCodes` — verifies transport codes
are extracted from T_FLOOD packets
- `TestDecodePacket_FloodHasNoCodes` — verifies FLOOD packets have no
transport codes
## Visual
In the packets table Type column, transport packets now show:
```
[Channel Msg] [T] ← transport packet
[Channel Msg] ← normal flood packet
```
The "T" badge has an amber color scheme and shows the full route type
name on hover.
## Tests
- All Go tests pass (`cmd/server` and `cmd/ingestor`)
- All frontend tests pass (`test-packet-filter.js`, `test-aging.js`,
`test-frontend-helpers.js`)
- Cache busters bumped in `index.html`
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem
Two data integrity bugs in the Go ingestor cause observer metadata and
signal quality data to be missing for all Go-backend users.
### #320 — Observer metadata never populated
`extractObserverMeta()` reads `battery_mv`, `uptime_secs`, and
`noise_floor` from the **top level** of the MQTT status message.
However, the actual MQTT payload nests these under a `stats` object:
```json
{
"status": "online",
"origin": "ObserverName",
"model": "Heltec V3",
"firmware_version": "v1.14.0-9f1a3ea",
"stats": {
"battery_mv": 4174,
"uptime_secs": 80277,
"noise_floor": -110
}
}
```
Result: battery, uptime, and noise floor are always NULL in the
database.
### #321 — SNR and RSSI always missing on raw packets
The raw packet handler reads `msg["SNR"]` and `msg["RSSI"]` (uppercase
only). Some MQTT bridges send these as lowercase `snr`/`rssi`. The
companion BLE handler already has a case-insensitive fallback — the raw
packet path did not.
Result: SNR/RSSI are NULL for all raw packet observations from bridges
that use lowercase keys.
## Fix
### #320 — Nested stats with top-level fallback
- Added `nestedOrTopLevel()` helper that checks `msg["stats"][key]`
first, then `msg[key]`
- `extractObserverMeta` now uses this helper for `battery_mv`,
`uptime_secs`, `noise_floor`
- Top-level fallback preserved for backward compatibility with bridges
that flatten the structure
- Safe type assertion: `stats, _ :=
msg["stats"].(map[string]interface{})` — no crash if stats is missing or
wrong type
### #321 — Lowercase SNR/RSSI fallback
- Raw packet handler now uses `else if` to check lowercase `snr`/`rssi`
when uppercase keys are absent
- Matches the pattern already used in the companion channel and direct
message handlers
## Tests
10 new test cases added:
| Test | What it verifies |
|------|-----------------|
| `TestExtractObserverMetaNestedStats` | All 5 fields populated from
nested stats object |
| `TestExtractObserverMetaNestedStatsPrecedence` | Nested stats wins
over top-level when both present |
| `TestExtractObserverMetaFlatFallback` | Flat structure still works
(backward compat) |
| `TestExtractObserverMetaEmptyStats` | Empty stats object — no crash,
model still works |
| `TestExtractObserverMetaStatsNotAMap` | stats is a string — no crash,
falls back to top-level |
| `TestExtractObserverMetaNoiseFloorFloat` | Float precision preserved
(noise_floor REAL migration) |
| `TestHandleMessageWithLowercaseSNRRSSI` | Lowercase snr/rssi both
stored correctly |
| `TestHandleMessageSNRRSSIUppercaseWins` | When both cases present,
uppercase takes precedence |
| `TestHandleMessageNoSNRRSSI` | Neither key present — nil, no crash |
| Existing `TestExtractObserverMeta` | Still passes (flat structure
backward compat) |
All tests pass: `go test ./... -count=1` and `go vet ./...` clean.
Closes#320Closes#321
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem
The self-hosted runner (`meshcore-runner-2`) filled its 29GB disk to
100%, blocking all CI runs:
```
Filesystem Size Used Avail Use%
/dev/root 29G 29G 2.3M 100%
Docker Images: 67 total, 2 active, 18.83GB reclaimable (99%)
```
Root cause: no Docker image cleanup after builds. Each CI run builds a
new image but never prunes old ones.
## Fix
### 1. Docker image cleanup after deploy (`deploy` job)
- Runs with `if: always()` so it executes even if deploy fails
- `docker image prune -af --filter "until=24h"` — removes images older
than 24h (safe: current build is minutes old)
- `docker builder prune -f --keep-storage=1GB` — caps build cache
- Logs before/after `docker system df` for visibility
### 2. Runner log cleanup at start of E2E job
- Prunes runner diagnostic logs older than 3 days (was 53MB and growing)
- Reports `df -h` for disk visibility in CI output
## Impact
After manual cleanup today, disk went from 100% → 35% (19GB free). This
PR prevents recurrence.
## Test plan
- [x] Manual cleanup verified on runner via `az vm run-command`
- [ ] Next CI run should show cleanup step output in deploy job logs
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Several features and fixes from a live deployment of the Go v3.0.0
backend.
### geo_filter — full enforcement
- **Go backend config** (`cmd/server/config.go`,
`cmd/ingestor/config.go`): added `GeoFilterConfig` struct so
`geo_filter.polygon` and `bufferKm` from `config.json` are parsed by
both the server and ingestor
- **Ingestor** (`cmd/ingestor/geo_filter.go`, `cmd/ingestor/main.go`):
ADVERT packets from nodes outside the configured polygon + buffer are
dropped *before* any DB write — no transmission, node, or observation
data is stored
- **Server API** (`cmd/server/geo_filter.go`, `cmd/server/routes.go`):
`GET /api/config/geo-filter` endpoint returns the polygon + bufferKm to
the frontend; `/api/nodes` responses filter out any out-of-area nodes
already in the DB
- **Frontend** (`public/map.js`, `public/live.js`): blue polygon overlay
(solid inner + dashed buffer zone) on Map and Live pages, toggled via
"Mesh live area" checkbox, state shared via localStorage
### Automatic DB pruning
- Add `retention.packetDays` to `config.json` to delete transmissions +
observations older than N days on a daily schedule (1 min after startup,
then every 24h). Nodes and observers are never pruned.
- `POST /api/admin/prune?days=N` for manual runs (requires `X-API-Key`
header if `apiKey` is set)
```json
"retention": {
"nodeDays": 7,
"packetDays": 30
}
```
### tools/geofilter-builder.html
Standalone HTML tool (no server needed) — open in browser, click to
place polygon points on a Leaflet map, set `bufferKm`, copy the
generated `geo_filter` JSON block into `config.json`.
### scripts/prune-nodes-outside-geo-filter.py
Utility script to clean existing out-of-area nodes from the database
(dry-run + confirm). Useful after first enabling geo_filter on a
populated DB.
### HB column in packets table
Shows the hop hash size in bytes (1–4) decoded from the path byte of
each packet's raw hex. Displayed as **HB** between Size and Type
columns, hidden on small screens.
## Test plan
- [x] ADVERT from node outside polygon is not stored (no new row in
nodes or transmissions)
- [x] `GET /api/config/geo-filter` returns polygon + bufferKm when
configured, `{polygon: null, bufferKm: 0}` when not
- [x] `/api/nodes` excludes nodes outside polygon even if present in DB
- [x] Map and Live pages show blue polygon overlay when configured;
checkbox toggles it
- [x] `retention.packetDays: 30` deletes old transmissions/observations
on startup and daily
- [x] `POST /api/admin/prune?days=30` returns `{deleted: N, days: 30}`
- [x] `tools/geofilter-builder.html` opens standalone, draws polygon,
copies valid JSON
- [x] HB column shows 1–4 for all packets in grouped and flat view
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- fix packets initial load to honor persisted `meshcore-time-window`
before the filter UI is rendered
- keep the dropdown and effective query window in sync via a shared
`savedTimeWindowMin` value
- add a frontend regression test to ensure `loadPackets()` falls back to
persisted time window when `#fTimeWindow` is not yet present
- bump cache busters in `public/index.html`
## Root cause
`loadPackets()` could run before the filter bar existed, so
`document.getElementById('fTimeWindow')` was null and it fell back to
`15` minutes even though localStorage had a different saved value.
## Testing
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Fix: unprotect /api/decode from API key auth
Fixes#304
### Problem
PR #283 applied `requireAPIKey` to all POST endpoints including
`/api/decode`. But BYOP decode is a stateless read-only decoder — it
never writes to the database. Users see "write endpoints disabled" when
trying to decode packets.
### Fix
- Removed `requireAPIKey` wrapper from `/api/decode` in
`cmd/server/routes.go`
- Updated auth tests to use `/api/perf/reset` (actual write endpoint)
instead of `/api/decode`
- Added tests proving `/api/decode` works without API key, even when
apiKey is configured or empty
### Note
Decoder consolidation (`internal/decoder/` shared package) is tracked
separately and not included here to keep the PR clean.
### Tests
- `cd cmd/server && go test ./...` ✅
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- show relative build age next to the commit hash in the nav stats
version badge (e.g. `abc1234 (3h ago)`)
- use `stats.buildTime` from `/api/stats` and existing `timeAgo()`
formatting in `public/app.js`
- keep behavior unchanged when `buildTime` is missing/unknown
## What changed
- updated `formatVersionBadge()` signature to accept `buildTime`
- appended a `build-age` span after the commit link when `buildTime` is
valid
- passed `stats.buildTime` from `updateNavStats()`
- updated frontend helper tests for the new function signature
- added regression tests for build-age rendering/skip behavior
- bumped cache busters in `public/index.html`
## API check
- verified Go server already exposes `buildTime` on `/api/stats` and
`/api/health` via `cmd/server/routes.go`
- no backend API changes required
## Tests
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
All passed locally.
## Browser validation
- Not run in this environment (no browser session available).
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass region through channel message routes, apply DB/store filtering, normalize IATA at read and write boundaries, and add regression coverage for routes/server/ingestor.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds rule 13 to AGENTS.md: mandatory git worktree isolation for parallel
agents.
- Implementation agents must work in dedicated worktrees, never the main
checkout
- Review agents must read remote branches via `git show
origin/<branch>:`, never the working tree
- Prevents the stale-worktree bug that caused multiple incorrect reviews
tonight
## Summary
This PR performs a **one-time line-ending normalization** with **no
functional changes**.
- Normalized previously CRLF-indexed files to LF using `git add
--renormalize .`
- Added `.git-blame-ignore-revs` to preserve usable blame history for
this bulk formatting-only commit
- Added explicit LF enforcement for shell scripts in `.gitattributes`:
- `manage.sh text eol=lf`
- `*.sh text eol=lf`
## Why this is safe
- This is a text normalization pass only (CRLF → LF)
- No logic, behavior, APIs, or runtime paths were changed
- Review diff noise is expected due to line-ending-only rewrites
## File-count context
- There were **148 known CRLF-indexed files** targeted for normalization
- The renormalization pass touched **151 files total** in this
repository snapshot
## Blame preservation
GitHub reads `.git-blame-ignore-revs` natively, so blame views can skip
the normalization commit.
For local git blame setup:
```bash
git config blame.ignoreRevsFile .git-blame-ignore-revs
```
## Validation
Executed required no-regression checks:
```bash
node test-frontend-helpers.js && node test-packet-filter.js && node test-aging.js
```
All passed.
## Summary
- fix safe .env parser in manage.sh to expand a leading ~ before export
- ensure setup-time PROD_DATA_DIR read from .env also expands ~
- keep behavior unchanged for non-tilde values
## Why
xport "=" does not perform tilde expansion, so values like
PROD_DATA_DIR=~/meshcore-data stayed literal and broke path-based
operations in manage.sh.
## Validation
- ash -n manage.sh
- manual reasoning: PROD_DATA_DIR=~/meshcore-data now resolves to
$HOME/meshcore-data
## Notes
Docker Compose handling is unchanged (compose already expands ~); this
PR only fixes manage.sh runtime parsing.
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- add `DISABLE_MOSQUITTO` support in container startup by switching
supervisord config when disabled
- add a no-mosquitto supervisord config
(`docker/supervisord-go-no-mosquitto.conf`)
- fix Compose port mapping regression so host ports map to fixed
internal listener ports (`80`, `443`, `1883`)
- add compose variants without MQTT port publishing
(`docker-compose.no-mosquitto.yml`,
`docker-compose.staging.no-mosquitto.yml`)
- update `manage.sh` setup flow to ask `Use built-in MQTT broker?
[Y/n]`, skip MQTT port prompt when disabled, persist
`DISABLE_MOSQUITTO`, and use no-mosquitto compose files when
starting/stopping/restarting
- align `.env.example` staging keys with compose
(`STAGING_GO_HTTP_PORT`, `STAGING_GO_MQTT_PORT`)
- fix staging Caddyfile generation to use `STAGING_GO_HTTP_PORT`
- fix `.env.example` staging default comments to match actual values
(82/1885)
## Validation performed
- ✅ `bash -n manage.sh` passes.
- ✅ With `DISABLE_MOSQUITTO=true`, no-mosquitto compose overrides are
selected, Mosquitto is not started, and MQTT port is not published.
- ✅ With `DISABLE_MOSQUITTO=false`, standard compose files are used,
Mosquitto starts, and MQTT port mapping is present.
- ℹ️ Runtime Docker validation requires a running Docker host.
Fixes#267
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes frontend region crosstalk on Channels page by applying region
filtering to message fetches and live WS GRP_TXT handling.
## Changes
- Append `region` query param to channel message API calls in
`selectChannel` and `refreshMessages`.
- Add WS region guard in `public/channels.js` using observer→IATA map
with selected-region snapshot at handler entry.
- On region switch, reload channels and re-fetch selected channel
messages; if empty under selected region, clear pane and show `Channel
not available in selected region`.
- Bump cache busters in `public/index.html`.
- Add frontend helper tests for extracted WS region filter helper in
`test-frontend-helpers.js`.
## Validation
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
Refs #280
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- fix observer upsert write path in `cmd/ingestor` to persist identity
fields
- map status payload fields into observer metadata: `model`,
`firmware`/`firmware_version`, `client_version`/`clientVersion`, `radio`
- keep NULL-safe behavior when identity fields are missing
- add regression tests for identity persistence and missing-field
handling
## Root cause
The ingestor only wrote telemetry (`battery_mv`, `uptime_secs`,
`noise_floor`) and never included observer identity columns in the
upsert statement, leaving `model`, `firmware`, `client_version`, and
`radio` NULL on fresh DBs.
## Testing
- `cd cmd/ingestor && go test ./...`
- `cd cmd/server && go test ./...`
Fixes#295
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes BYOP modal stacking on the Packets page by preventing duplicate
global click handlers and enforcing a single BYOP overlay instance.
## Root cause
Packets page init could register document-level click handlers
repeatedly across SPA navigations. Clicking BYOP then spawned multiple
overlays, and each close action removed only one layer.
## Changes
- `public/packets.js`
- Added `bindDocumentHandler(...)` to de-duplicate document click
handlers.
- Applied it to packets action delegation, filter menu outside-click
close, and column menu close.
- Added `removeAllByopOverlays()` and call it before opening BYOP.
- Tagged BYOP overlay with `.byop-overlay` class.
- Updated close logic to remove all BYOP overlays in one click.
- Scoped BYOP result lookup to the active overlay
(`overlay.querySelector`).
- Added destroy cleanup for document handlers and stray BYOP overlays.
- `test-frontend-helpers.js`
- Added regression tests for:
- BYOP singleton overlay behavior
- one-click close removing all overlays
- document click handler de-dup logic
- `public/index.html`
- Bumped cache busters for JS/CSS assets.
## Validation
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
All passed locally.
Fixes#249
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Fix: Enforce LF line endings repo-wide
### Problem
Windows-based agents produce CRLF line endings, causing git diffs to
show every line as changed (1000+ line "rewrites" that are actually
20-line patches). This has hit us on `manage.sh`, `deploy.yml`, and
multiple PRs.
### Fix
Added `* text=auto eol=lf` to `.gitattributes`. Git will now:
- Store all text files as LF in the repo
- Convert CRLF to LF on commit (regardless of OS)
- Check out as LF on all platforms
Also marks common binary formats explicitly.
### Impact
Existing files with CRLF will be normalized on their next commit. No
functional changes.
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Implements issue #286 P1 frontend timestamp features on top of P0:
- Added global timestamp timezone toggle in Display tab (`Local time` /
`UTC`)
- Added absolute-mode timestamp format presets (`iso`, `iso-seconds`,
`locale`)
- Added optional custom format input (only when
`SITE_CONFIG.timestamps.allowCustomFormat === true`)
- Extended `formatTimestamp()` / `formatTimestampWithTooltip()` behavior
to honor timezone + format settings
- Preserved server defaults with localStorage override precedence
- Bumped `public/index.html` cache busters in same commit
## Details
### 1) Timezone toggle
- New Display tab control persisted to `meshcore-timestamp-timezone`
- Reads server default from `window.SITE_CONFIG.timestamps.timezone`
with fallback to `local`
- Formatting logic now supports both local and UTC absolute rendering
### 2) Format presets (absolute mode only)
- New Display tab preset dropdown (shown only when timestamp mode =
`absolute`)
- Presets implemented:
- `iso` → `YYYY-MM-DD HH:mm:ss`
- `iso-seconds` → `YYYY-MM-DD HH:mm:ss.SSS`
- `locale` → `toLocaleString()` (or UTC locale when timezone=utc)
- Persisted to `meshcore-timestamp-format`
- Reads server default from `window.SITE_CONFIG.timestamps.formatPreset`
(fallback `iso`)
### 3) Custom format string (guarded)
- Text input only renders when
`window.SITE_CONFIG.timestamps.allowCustomFormat` is `true`
- Persisted to `meshcore-timestamp-custom-format`
- If non-empty and enabled, custom format overrides preset
- Frontend intentionally does not hard-validate the format string;
unsupported patterns fall back to preset behavior
## Tests
Executed required test commands:
```bash
node test-frontend-helpers.js
node test-packet-filter.js
node test-aging.js
```
Added coverage in `test-frontend-helpers.js` for:
- UTC output behavior
- Local output behavior
- `iso-seconds` includes milliseconds
- `locale` format behavior
All passed locally.
Refs #286
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
## Fix: FAQ save no longer wipes other home page sections
Fixes#284
### Problem
Editing FAQ in the customizer and saving caused other home page sections
(steps, footer links, hero text) to disappear on reload. Colors could
also reset.
### Root cause
`initState()` in `customize.js` used `||` (OR) logic for the `home`
object — if localStorage had *any* `home.checklist`, it took that and
ignored the server config for other fields. Partial localStorage data
replaced the full server config instead of merging on top.
### Fix
Changed `initState()` to properly layer: `DEFAULTS → server config →
localStorage` for all sections. Each field merges independently — a
partial localStorage save (e.g., only checklist) no longer wipes steps,
footerLinks, or hero fields. Same merge pattern applied to all theme
sections for consistency.
### Files changed
- `public/customize.js` — `initState()` merge logic
- `public/index.html` — cache buster bump
- `test-frontend-helpers.js` — regression tests:
1. Partial localStorage (checklist only) preserves steps/footerLinks
2. Server config values survive partial local overrides
3. Full localStorage properly overrides server config
### Testing
- `node test-frontend-helpers.js` ✅
- `node test-packet-filter.js` ✅
- `node test-aging.js` ✅
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- Add a new **Display** tab to the customizer tab bar (between Home Page
and Export / Save).
- Move timestamp-related **UI Settings** out of Branding into the new
Display tab.
- Keep Branding focused on site identity fields (name, tagline, logo,
favicon).
- Bump `public/index.html` cache busters so updated frontend assets load
immediately.
## Testing
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
Fixes#293
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem
Repeaters with 2-byte adverts occasionally appear as 1-byte on the map
and in stats.
**Root cause:** `computeNodeHashSizeInfo()` sets `HashSize` by
overwriting on every packet (`ni.HashSize = hs`), so the last advert
processed wins — regardless of how many previous packets correctly
showed 2-byte.
When a node sends an ADVERT directly (no relay hops), the path byte
encodes `hashCount=0`. Some firmware sets the full path byte to `0x00`
in this case, which decodes as `hashSize=1` even if the node normally
uses 2-byte hashes. If this packet happens to be the last one iterated,
the node shows as 1-byte.
## Fix
Compute the **mode** (most frequent hash size) across all observed
adverts instead of using the last-seen value. On a tie, prefer the
larger value.
```go
counts := make(map[int]int, len(ni.AllSizes))
for _, hs := range ni.Seq {
counts[hs]++
}
best, bestCount := 1, 0
for hs, cnt := range counts {
if cnt > bestCount || (cnt == bestCount && hs > best) {
best = hs
bestCount = cnt
}
}
ni.HashSize = best
```
A node with 4× hashSize=2 and 1× hashSize=1 now correctly reports
`HashSize=2`.
## Test
`TestGetNodeHashSizeInfoDominant`: seeds 5 adverts (4× 2-byte, 1×
1-byte) and asserts `HashSize=2`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Adds `GeoFilter` struct to `Config` in `cmd/server/config.go` so
`geo_filter.polygon` and `bufferKm` from `config.json` are parsed by the
Go backend
- Adds `GET /api/config/geo-filter` endpoint in `cmd/server/routes.go`
returning the polygon + bufferKm to the frontend
- Restores the blue polygon overlay (solid inner + dashed buffer zone)
on the **Map** page (`public/map.js`)
- Restores the same overlay on the **Live** page (`public/live.js`),
toggled via the "Mesh live area" checkbox
## Test plan
- [x] `GET /api/config/geo-filter` returns `{ polygon: [...], bufferKm:
N }` when configured
- [x] `GET /api/config/geo-filter` returns `{ polygon: null, bufferKm: 0
}` when not configured
- [x] Map page shows blue polygon overlay when `geo_filter.polygon` is
set in config
- [x] Live page shows same overlay, checkbox state shared via
localStorage
- [x] Checkbox is hidden when no polygon is configured
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the separate config.json file bind mount from both compose
files. The data directory mount already covers it, and the Go server
searches /app/data/config.json via LoadConfig.
- Entrypoint symlinks /app/data/config.json for ingestor compatibility
- manage.sh setup creates config in data dir, prompts admin if missing
- manage.sh start checks config exists before starting, offers to create
- deploy.yml simplified — no more sudo rm or directory cleanup
- Backup/restore updated to use data dir path
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- added API key middleware for write routes in cmd/server/routes.go
- protected all current non-GET API routes (POST /api/packets, POST
/api/perf/reset, POST /api/decode)
- middleware enforces X-API-Key against cfg.APIKey and returns 401 JSON
error on missing/wrong key
- preserves backward compatibility: if piKey is empty, requests pass
through
- added startup warning log in cmd/server/main.go when no API key is
configured:
- [security] WARNING: no apiKey configured — write endpoints are
unprotected
- added route tests for missing/wrong/correct key and empty-apiKey
compatibility
## Validation
- cd cmd/server && go test ./... ✅
## Notes
- config.example.json already contains piKey, so no changes were
required.
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removes the separate config.json file bind mount from both compose
files. The data directory mount already covers it, and the Go server
searches /app/data/config.json via LoadConfig.
- Entrypoint symlinks /app/data/config.json for ingestor compatibility
- manage.sh setup creates config in data dir, prompts admin if missing
- manage.sh start checks config exists before starting, offers to create
- deploy.yml simplified — no more sudo rm or directory cleanup
- Backup/restore updated to use data dir path
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 1/2/3-byte selector to Hash Issues analytics page
- 1-byte and 2-byte modes show 16×16 matrix with stat cards (nodes
tracked, using N-byte ID, prefix space used, prefix collisions)
- 3-byte mode shows summary stat cards instead of unrenderable grid
- Fix "Nodes tracked" to always show total node count across all modes
- Use CSS variable colours for matrix cells (light/dark mode compatible)
- Replace native title tooltips with custom styled popovers
- Hide collision risk card when 3-byte mode is selected
- Fix double-tooltip bug on mode switch via _matrixTipInit guard
- Fix tooltip persisting outside matrix grid on mouseleave
https://dev.ve7kod.ca/#/analytics
Hash Issues
---------
Co-authored-by: Jesse <your@email.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes#276
## Root cause
TRACE packets store hop IDs in the payload (bytes 9+) rather than in the
header path field. The header path field is overloaded in TRACE packets
to carry RSSI values instead of repeater IDs (as noted in the issue
comments). This meant `Path.Hops` was always empty for TRACE packets —
the raw bytes ended up as an opaque `PathData` hex string with no
structure.
The hashSize encoded in the header path byte (bits 6–7) is still valid
for TRACE and is used to split the payload path bytes into individual
hop prefixes.
## Fix
After decoding a TRACE payload, if `PathData` is non-empty, parse it
into individual hops using `path.HashSize`:
```go
if header.PayloadType == PayloadTRACE && payload.PathData != "" {
pathBytes, err := hex.DecodeString(payload.PathData)
if err == nil && path.HashSize > 0 {
for i := 0; i+path.HashSize <= len(pathBytes); i += path.HashSize {
path.Hops = append(path.Hops, ...)
}
}
}
```
Applied to both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go`.
## Verification
Packet from the issue: `260001807dca00000000007d547d`
| | Before | After |
|---|---|---|
| `Path.Hops` | `[]` | `["7D", "54", "7D"]` |
| `Path.HashCount` | `0` | `3` |
New test `TestDecodeTracePathParsing` covers this exact packet.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
The in-memory `PacketStore` had **no eviction or aging** — it grew
unbounded until OOM killed the process. At ~3K packets/hour and ~5KB per
packet (not the 450 bytes previously estimated), an 8GB VM would OOM in
a few days.
## Changes
### Time-based eviction
- Configurable via `config.json`: `"packetStore": { "retentionHours": 24
}`
- Packets older than the retention window are evicted from the head of
the sorted slice
### Memory-based cap
- Configurable via `"packetStore": { "maxMemoryMB": 1024 }`
- Hard ceiling — evicts oldest packets when estimated memory exceeds the
cap
### Index cleanup
When a `StoreTx` is evicted, ALL associated data is removed from:
- `byHash`, `byTxID`, `byObsID`, `byObserver`, `byNode`, `byPayloadType`
- `nodeHashes`, `distHops`, `distPaths`, `spIndex`
### Periodic execution
- Background ticker runs eviction every 60 seconds
- Analytics caches and hash size cache are invalidated after eviction
### Stats fixes
- `estimatedMB` now uses ~5KB/packet + ~500B/observation (was 430B +
200B)
- `evicted` counter reflects actual evictions (was hardcoded to 0)
- Removed fake `maxPackets: 2386092` and `maxMB: 1024` from stats
### Config example
```json
{
"packetStore": {
"retentionHours": 24,
"maxMemoryMB": 1024
}
}
```
Both values default to 0 (unlimited) for backward compatibility.
## Tests
- 7 new tests in `eviction_test.go` covering time-based, memory-based,
index cleanup, thread safety, config parsing, and no-op when disabled
- All existing tests pass unchanged
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Problem
The RF analytics `packetsPerHour` chart was counting **observations**
instead of **unique transmissions** per hour. With ~34 observations per
transmission on average, the chart showed ~5,645 packets/hr instead of
the correct ~163/hr.
**Evidence from prod API:**
- `packetsPerHour` total: 1,580,620 (sum of all hourly counts)
- `totalPackets`: 45,764
- That's a ~34× inflation — exactly the observations-per-transmission
ratio
## Root Cause
In `store.go`, the `hourBuckets[hr]++` counter was inside the
observations loop (both regional and non-regional paths). Other counters
like `packetSizes` and `typeBuckets` already deduplicate by hash —
`hourBuckets` was the only one that didn't.
## Fix
Added a `seenHourHash` map (keyed by `hash|hour`) to deduplicate. Each
unique transmission is counted once per hour bucket, matching how packet
sizes and payload types already work.
Both the regional observer path and the non-regional path are fixed. The
legacy path (transmissions without observations) was already correct
since it iterates per-transmission.
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Root causes from CI logs:
1. 'read /app/config.json: is a directory' — Docker creates a directory
when bind-mounting a non-existent file. The entrypoint now detects
and removes directory config.json before falling back to example.
2. 'unable to open database file: out of memory (14)' — old container
(3GB) not fully exited when new one starts. Deploy now uses
'docker compose down' with timeout and waits for memory reclaim.
3. Supervisor gave up after 3 fast retries (FATAL in ~6s). Increased
startretries to 10 and startsecs to 2 for server and ingestor.
Additional:
- Deploy step ensures staging config.json exists before starting
- Healthcheck: added start_period=60s, increased timeout and retries
- No longer uses manage.sh (CI working dir != repo checkout dir)
Root cause: on the 8GB VM, both prod (~2.5GB) and staging (~2GB) containers
run simultaneously. During deploy, manage.sh would rm the old staging container
and immediately start a new one. The old container's memory wasn't reclaimed
yet, so the new one got 'unable to open database file: out of memory (14)'
from SQLite and both corescope-server and corescope-ingestor entered FATAL.
Fix:
- manage.sh restart staging: wait up to 15s for old container to fully exit,
plus 3s for OS memory reclamation before starting new container
- manage.sh restart staging: verify config.json exists before starting
- docker-compose.staging.yml: add deploy.resources.limits.memory=3g to
prevent staging from consuming unbounded memory
## Summary
Adds `distributionByRepeaters` to the `/api/analytics/hash-sizes`
endpoint in the **Go server**.
### Problem
PR #263 implemented this feature in the deprecated Node.js server
(server.js). All backend changes should go in the Go server at
`cmd/server/`.
### Solution
- For each hash size (1, 2, 3), count how many unique repeaters (nodes)
advertise packets with that hash size
- Uses the existing `byNode` map already computed in
`computeAnalyticsHashSizes()`
- Added to both the live response and the empty/fallback response in
routes.go
- Frontend changes from PR #263 (`public/analytics.js`) already render
this field — no frontend changes needed
### Response shape
```json
{
"distributionByRepeaters": { "1": 42, "2": 7, "3": 2 },
...existing fields...
}
```
### Testing
- All Go server tests pass
- Replaces PR #263 (which modified the wrong server)
Closes#263
---------
Co-authored-by: you <you@example.com>
down tears down the entire compose project including prod.
rm -sf stops and removes just the named service.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Version/Commit/BuildTime now populated from package.json, git, and
date. Exported as env vars so docker compose build picks them up.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Complete CI pipeline restructure. Sequential fail-fast chain, E2E tests
against Go server with real staging data, all deprecated Node.js server
tests removed.
### Pipeline (PR):
1. **Go unit tests** — fail-fast, coverage + badges
2. **Playwright E2E** — against Go server with fixture DB, frontend
coverage, fail-fast on first failure
3. **Docker build** — verify containers build
### Pipeline (master merge):
Same chain + deploy to staging + badge publishing
### Removed:
- All Node.js server-side unit tests (deprecated JS server)
- `npm ci` / `npm run test` steps
- JS server coverage collection (`COVERAGE=1 node server.js`)
- Changed-files detection logic
- Docs-only CI skip logic
- Cancel-workflow API hacks
### Added:
- `test-fixtures/e2e-fixture.db` — real data from staging (200 nodes, 31
observers, 500 packets)
- `scripts/capture-fixture.sh` — refresh fixture from staging API
- Go server launches with `-port 13581 -db test-fixtures/e2e-fixture.db
-public public-instrumented`
---------
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
Three optimizations to reduce wall-clock time:
1. Reduce safeClick timeout from 3000ms to 500ms
- Elements either exist immediately after navigation or don't exist at all
- ~75 safeClick calls; if ~30 miss, saves ~75s of dead wait time
2. Replace 18 page.goto() calls with SPA hash navigation
- After initial page load, the SPA shell is already in the DOM
- page.goto() reloads the entire page (network round-trip + parse)
- Hash navigation via location.hash triggers the SPA router instantly
- Only 3 page.goto() remain: initial load + 2 home page loads after localStorage.clear()
3. Remove redundant final route sweep
- All 10 routes were already visited during the page-specific sections
- The sweep just re-navigated to pages that had already been exercised
- Saves ~2s of redundant navigation
Also:
- Reduce inter-route wait from 200ms to 50ms (SPA router is synchronous)
- Merge utility function + packet filter exercises into single evaluate() call
- Use navHash() helper for consistent hash navigation with 150ms settle time
The test 'Node perf page should NOT show Go Runtime section' asserts
Node.js-specific behavior, but E2E tests now run against the Go server
(per this PR), so Go Runtime info is correctly present. Remove the
now-irrelevant assertion.
The Playwright E2E tests were starting `node server.js` (the deprecated
JS server) instead of the Go server, meaning E2E tests weren't testing
the production backend at all.
Changes:
- Add Go 1.22 setup and build steps to the node-test job
- Build the Go server binary before E2E tests run
- Replace `node server.js` with `./corescope-server` in both the
instrumented (coverage) and quick (no-coverage) E2E server starts
- Use `-port 13581` and `-public` flags to configure the Go server
- For coverage runs, serve from `public-instrumented/` directory
The Go server serves the same static files and exposes compatible
/api/* routes (stats, packets, health, perf) that the E2E tests hit.
Change healthThresholds config from milliseconds to hours for readability.
Config keys: infraDegradedHours, infraSilentHours, nodeDegradedHours, nodeSilentHours.
Defaults: infra degraded 24h, silent 72h; node degraded 1h, silent 24h.
- Config stored in hours, converted to ms at comparison time
- /api/config/client sends ms to frontend (backward compatible)
- Frontend tooltips use dynamic thresholds instead of hardcoded strings
- Added healthThresholds section to config.example.json
- Updated Go and Node.js servers, tests
* docs: remove letsmesh.net reference from README
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: remove paths-ignore from pull_request trigger
PR #233 only touches .md files, which were excluded by paths-ignore,
causing CI to be skipped entirely. Remove paths-ignore from the
pull_request trigger so all PRs get validated. Keep paths-ignore on
push to avoid unnecessary deploys for docs-only changes to master.
* ci: skip heavy CI jobs for docs-only PRs
Instead of using paths-ignore (which skips the entire workflow and
blocks required status checks), detect docs-only changes at the start
of each job and skip heavy steps while still reporting success.
This allows doc-only PRs to merge without waiting for Go builds,
Node.js tests, or Playwright E2E runs.
Reverts the approach from 7546ece (removing paths-ignore entirely)
in favor of a proper conditional skip within the jobs themselves.
* fix: update engine tests to match engine-badge HTML format
Tests expected [go]/[node] text but formatVersionBadge now renders
<span class="engine-badge">go</span>. Updated 6 assertions to
check for engine-badge class and engine name in HTML output.
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
Three optimizations to the CI frontend test pipeline:
1. Run E2E tests and coverage collection concurrently
- Previously sequential (E2E ~1.5min, then coverage ~5.75min)
- Now both run in parallel against the same instrumented server
- Expected savings: ~5 min (coverage runs alongside E2E instead of after)
2. Replace networkidle with domcontentloaded in coverage collector
- SPA uses hash routing — networkidle waits 500ms for network silence
on every navigation, adding ~10-15s of dead time across 23 navigations
- domcontentloaded fires immediately once HTML is parsed; JS initializes
the route handler synchronously
- For in-page hash changes, use 200ms setTimeout instead of
waitForLoadState (which would never re-fire for same-document nav)
3. Extract coverage from E2E tests too
- E2E tests already exercise the app against the instrumented server
- Now writes window.__coverage__ to .nyc_output/e2e-coverage.json
- nyc merges both coverage files for higher total coverage
Also:
- Split Playwright install into browser + deps steps (deps skip if present)
- Replace sleep 5 with health-check poll in quick E2E path
The poller's Start() calls GetMaxTransmissionID() to initialize its cursor.
When the test goroutine inserts data between go poller.Start() and the
actual GetMaxTransmissionID() call, the poller's cursor skips past the
test data and never broadcasts it, causing a timeout.
Adding a 100ms sleep after go poller.Start() ensures the poller has
initialized its cursors before the test inserts new data.
SQLite :memory: databases create separate databases per connection.
When the connection pool opens multiple connections (e.g. poller goroutine
vs main test goroutine), tables created on one connection are invisible
to others. Setting MaxOpenConns(1) ensures all queries use the same
in-memory database, fixing TestPollerBroadcastsMultipleObservations.
IngestNewFromDB now broadcasts one message per observation (not per
transmission). IngestNewObservations also broadcasts late arrivals.
Tests verify multi-observer packets produce multiple WS messages.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compared decoder.js against the MeshCore firmware source (Dispatcher.cpp,
Packet.h, Mesh.cpp, AdvertDataHelpers.h) and fixed all mismatches:
1. Field order: transport codes now parsed BEFORE path_length byte,
matching the spec: [header][transport_codes?][path_length][path][payload]
2. ACK payload: was incorrectly decoded as dest(1)+src(1)+ackHash(4).
Firmware shows ACK is just checksum(4) — no dest/src hashes.
3. TRACE payload: was incorrectly decoded as flags(1)+tag(4)+dest(6)+src(1).
Firmware shows tag(4)+authCode(4)+flags(1)+pathData.
4. ADVERT appdata: added missing feature1 (0x20 flag) and feature2
(0x40 flag) parsing — 2-byte fields between location and name.
5. Transport code field naming: renamed nextHop/lastHop to code1/code2
to match spec terminology (transport_code_1/transport_code_2).
6. Fixed incorrect field size labels in packets.js hex breakdown:
dest/src are 1 byte, MAC is 2 bytes (not 6B/6B/4B).
7. Fixed ANON_REQ/PATH comment typos (dest was listed as 6 bytes,
MAC as 4 bytes — both wrong, code was already correct).
All 329 tests pass (66 decoder + 263 spec/golden).
The Windows self-hosted runner picks up jobs and fails because bash
scripts run in PowerShell. Node.js tests need Chromium/Playwright
(Linux-only), and build/deploy/publish use Docker (Linux-only).
Changes:
- node-test: runs-on: [self-hosted, Linux]
- build: runs-on: [self-hosted, Linux]
- deploy: runs-on: [self-hosted, Linux]
- publish: runs-on: [self-hosted, Linux]
- go-test: unchanged (ubuntu-latest)
- perf.js: toFixed(1) on all ms/MB values in Go Runtime section
- style.css: white-space: nowrap on .nav-stats to prevent the · separator
from wrapping onto its own line
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two cache bugs fixed:
1. Hit rate formula excluded stale hits — reported rate was artificially low
because stale-while-revalidate responses (which ARE cache hits from the
caller's perspective) were not counted. Changed formula from
hits/(hits+misses) to (hits+staleHits)/(hits+staleHits+misses).
2. Bulk-health cache invalidated on every advert packet — in a mesh with
dozens of nodes advertising every few seconds, this caused the expensive
bulk-health query to be recomputed on nearly every request, defeating
the cache entirely. Switched to 30s debounced invalidation via
debouncedInvalidateBulkHealth().
Added regression test for hit rate formula in test-server-routes.js.
Match the C++ firmware wire format (Packet::writeTo/readFrom):
1. Field order: transport codes are parsed BEFORE path_length byte,
matching firmware's header → transport_codes → path_len → path → payload
2. ACK payload: just 4-byte CRC checksum, not dest+src+ackHash.
Firmware createAck() writes only ack_crc (4 bytes).
3. TRACE payload: tag(4) + authCode(4) + flags(1) + pathData,
matching firmware createTrace() and onRecvPacket() TRACE handler.
4. ADVERT features: parse feat1 (0x20) and feat2 (0x40) optional
2-byte fields between location and name, matching AdvertDataBuilder
and AdvertDataParser in the firmware.
5. Transport code naming: code1/code2 instead of nextHop/lastHop,
matching firmware's transport_codes[0]/transport_codes[1] naming.
Fixes applied to both cmd/ingestor/decoder.go and cmd/server/decoder.go.
Tests updated to match new behavior.
When go-test or node-test fails, the workflow run is now cancelled
via the GitHub API so the sibling job doesn't sit queued/running.
Also fixed build job to need both go-test AND node-test (was only
waiting on go-test despite the pipeline comment saying both gate it).
The flat 'deploy' concurrency group caused ALL PRs to share one queue,
so pushing to any PR would cancel CI runs on other PRs.
Changed to deploy-${{ github.event.pull_request.number || github.ref }}
so each PR gets its own concurrency group while re-pushes to the same
PR still cancel the previous run.
- /api/stats: 10s server-side cache — was running 5 SQLite COUNT queries
on every call, taking ~1500ms with 28 concurrent WS clients polling every 15s
- GetNodeHashSizeInfo: 15s cache — was doing a full O(n) scan + JSON unmarshal
of all advert packets in memory on every /nodes request, taking ~1200ms
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Linux Docker doesn't resolve host.docker.internal by default.
Required when MQTT sources in config.json point to the host machine.
Harmless on Docker Desktop where it already works.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without build: directive, docker compose tries to pull corescope:latest
from Docker Hub instead of building locally.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add preflight check for 'docker compose' in manage.sh (catches plugin missing)
- Document named Caddy volumes as cert storage, not user data
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove all legacy docker run code paths. manage.sh is now a pure
docker compose wrapper with no dual-mode branching.
Removed:
- COMPOSE_MODE flag and all if/else branches
- get_docker_run_args(), get_data_mount_args(), recreate_container()
- get_required_ports(), get_current_ports(), check_port_match()
- CONTAINER_NAME, DATA_VOLUME, CADDY_VOLUME variables
- All direct docker run/stop/start/rm invocations
All commands now delegate to docker compose:
- start → docker compose up -d prod
- stop → docker compose down / docker compose stop
- restart → docker compose up -d --force-recreate
- update → docker compose build prod + up -d --force-recreate
- reset → docker compose down --rmi local
- backup/restore use bind mount path from .env (PROD_DATA_DIR)
- verify_health, mqtt-test, status all use corescope-prod
Net result: -248 lines, zero dual-mode logic, identical behavior
to running docker compose directly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents what existing users need to update when the rename
from MeshCore Analyzer to CoreScope lands:
- Git remote URL update
- Docker image/container name changes
- Config branding.siteName (if customized)
- CI/CD references (if applicable)
- Confirms data dirs, MQTT, browser state unchanged
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The glob trick COPY .git-commi[t] only works with BuildKit.
manage.sh uses legacy docker build. Just create a default via RUN.
Commit hash comes through --build-arg ldflags anyway.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-28 13:36:37 -07:00
355 changed files with 104051 additions and 55473 deletions
description: "A specialized agent for reviewing pull requests in the meshcore-analyzer repository. It focuses on SOLID, DRY, testing, Go best practices, frontend testability, observability, and performance to prevent regressions and maintain high code quality."
model: "gpt-5.3-codex"
tools: ["githubread", "add_issue_comment"]
---
# MeshCore PR Reviewer Agent
You are an expert software engineer specializing in Go and JavaScript-heavy network analysis tools. Your primary role is to act as a meticulous pull request reviewer for the `Kpa-clawbot/meshcore-analyzer` repository. You are deeply familiar with its architecture, as outlined in `AGENTS.md`, and you enforce its rules rigorously.
Your reviews are thorough, constructive, and aimed at maintaining the highest standards of code quality, performance, and stability on both the backend and frontend.
## Core Principles
1.**Context is King**: Before any review, consult the `AGENTS.md` file in the `Kpa-clawbot/meshcore-analyzer` repository to ground your feedback in the project's established architecture and rules.
2.**Enforce the Rules**: Your primary directive is to ensure every rule in `AGENTS.md` is followed. Call out any deviation.
3.**Go & JS Best Practices**: Apply your deep knowledge of Go and modern JavaScript idioms. Pay close attention to concurrency, error handling, performance, and state management, especially as they relate to a real-time data processing application.
4.**Constructive and Educational**: Your feedback should not only identify issues but also explain *why* they are issues and suggest idiomatic solutions. Your goal is to mentor and elevate the codebase and its contributors.
5.**Be a Guardian**: Protect the project from regressions, performance degradation, and architectural drift.
## Review Focus Areas
You will pay special attention to the following areas during your review:
- **SOLID & DRY**: Does the change adhere to SOLID principles? Is there duplicated logic that could be refactored? Does it respect the existing separation of concerns?
- **Project Architecture**: Does the PR respect the single Node.js server + static frontend architecture? Are changes in the right place?
### 2. Testing and Validation
- **No commit without tests**: Is the backend logic change covered by unit tests? Is `test-packet-filter.js` or `test-aging.js` updated if necessary?
- **Browser Validation**: Has the contributor confirmed the change works in a browser? Is there a screenshot for visual changes?
- **Cache Busters**: If any `public/` assets (`.js`, `.css`) were modified, has the cache buster in `public/index.html` been bumped in the *same commit*? This is critical.
### 3. Go-Specific Concerns
- **Concurrency**: Are goroutines used safely? Are there potential race conditions? Is synchronization used correctly?
- **Error Handling**: Is error handling explicit and clear? Are errors wrapped with context where appropriate?
- **Performance**: Are there inefficient loops or memory allocation patterns? Scrutinize any new data processing logic.
- **Go Idioms**: Does the code follow standard Go idioms and formatting (`gofmt`)?
### 4. Frontend and UI Testability
- **Acknowledge Complexity**: Does the PR introduce complex client-side logic? Recognize that browser-based functionality is difficult to unit test.
- **Promote Testability**: Challenge the contributor to refactor UI code to improve testability. Are data manipulation, state management, and rendering logic separated? Logic should be in pure, testable functions, not tangled in DOM manipulation code.
- **UI Logic Purity**: Scrutinize client-side JavaScript. Are there large, monolithic functions? Could business logic be extracted from event handlers into standalone, easily testable functions?
- **State Management**: How is client-side state managed? Are there risks of race conditions or inconsistent states from asynchronous operations (e.g., API calls)?
### 5. Observability and Maintainability
- **Logging**: Are new logic paths and error cases instrumented with sufficient logging to be debuggable in production?
- **Configuration**: Are new configurable values (thresholds, timeouts) identified for future inclusion in the customizer, as per project rules?
- **Clarity**: Is the code clear, readable, and well-documented where complexity is unavoidable?
### 6. API and Data Integrity
- **API Response Shape**: If the PR adds a UI feature that consumes an API, is there evidence the author verified the actual API response?
- **Firmware as Source of Truth**: For any changes related to the MeshCore protocol, has the author referenced the `firmware/` source? Challenge any "magic numbers" or assumptions about packet structure.
## Review Process
1.**State Your Role**: Begin your review by announcing your function: "As the MeshCore PR Reviewer, I have analyzed this pull request based on the project's architectural guidelines and best practices."
2.**Provide a Summary**: Give a high-level summary of your findings (e.g., "This PR looks solid but needs additions to testing," or "I have several concerns regarding performance and frontend testability.").
3.**Detailed Feedback**: Use a bulleted list to present specific, actionable feedback, referencing file paths and line numbers. For each point, cite the relevant principle or project rule (e.g., "Missing Test Coverage (Rule #1)", "UI Logic Purity (Focus Area #4)").
4.**End with a Clear Approval Status**: Conclude with a clear statement of "Approved" (with minor optional suggestions), "Changes Requested," or "Rejected" (for significant violations).
---
name: "MeshCore PR Reviewer"
description: "A specialized agent for reviewing pull requests in the meshcore-analyzer repository. It focuses on SOLID, DRY, testing, Go best practices, frontend testability, observability, and performance to prevent regressions and maintain high code quality."
model: "gpt-5.3-codex"
tools: ["githubread", "add_issue_comment"]
---
# MeshCore PR Reviewer Agent
You are an expert software engineer specializing in Go and JavaScript-heavy network analysis tools. Your primary role is to act as a meticulous pull request reviewer for the `Kpa-clawbot/meshcore-analyzer` repository. You are deeply familiar with its architecture, as outlined in `AGENTS.md`, and you enforce its rules rigorously.
Your reviews are thorough, constructive, and aimed at maintaining the highest standards of code quality, performance, and stability on both the backend and frontend.
## Core Principles
1.**Context is King**: Before any review, consult the `AGENTS.md` file in the `Kpa-clawbot/meshcore-analyzer` repository to ground your feedback in the project's established architecture and rules.
2.**Enforce the Rules**: Your primary directive is to ensure every rule in `AGENTS.md` is followed. Call out any deviation.
3.**Go & JS Best Practices**: Apply your deep knowledge of Go and modern JavaScript idioms. Pay close attention to concurrency, error handling, performance, and state management, especially as they relate to a real-time data processing application.
4.**Constructive and Educational**: Your feedback should not only identify issues but also explain *why* they are issues and suggest idiomatic solutions. Your goal is to mentor and elevate the codebase and its contributors.
5.**Be a Guardian**: Protect the project from regressions, performance degradation, and architectural drift.
## Review Focus Areas
You will pay special attention to the following areas during your review:
- **SOLID & DRY**: Does the change adhere to SOLID principles? Is there duplicated logic that could be refactored? Does it respect the existing separation of concerns?
- **Project Architecture**: Does the PR respect the single Node.js server + static frontend architecture? Are changes in the right place?
### 2. Testing and Validation
- **No commit without tests**: Is the backend logic change covered by unit tests? Is `test-packet-filter.js` or `test-aging.js` updated if necessary?
- **Browser Validation**: Has the contributor confirmed the change works in a browser? Is there a screenshot for visual changes?
- **Cache Busters**: If any `public/` assets (`.js`, `.css`) were modified, has the cache buster in `public/index.html` been bumped in the *same commit*? This is critical.
### 3. Go-Specific Concerns
- **Concurrency**: Are goroutines used safely? Are there potential race conditions? Is synchronization used correctly?
- **Error Handling**: Is error handling explicit and clear? Are errors wrapped with context where appropriate?
- **Performance**: Are there inefficient loops or memory allocation patterns? Scrutinize any new data processing logic.
- **Go Idioms**: Does the code follow standard Go idioms and formatting (`gofmt`)?
### 4. Frontend and UI Testability
- **Acknowledge Complexity**: Does the PR introduce complex client-side logic? Recognize that browser-based functionality is difficult to unit test.
- **Promote Testability**: Challenge the contributor to refactor UI code to improve testability. Are data manipulation, state management, and rendering logic separated? Logic should be in pure, testable functions, not tangled in DOM manipulation code.
- **UI Logic Purity**: Scrutinize client-side JavaScript. Are there large, monolithic functions? Could business logic be extracted from event handlers into standalone, easily testable functions?
- **State Management**: How is client-side state managed? Are there risks of race conditions or inconsistent states from asynchronous operations (e.g., API calls)?
### 5. Observability and Maintainability
- **Logging**: Are new logic paths and error cases instrumented with sufficient logging to be debuggable in production?
- **Configuration**: Are new configurable values (thresholds, timeouts) identified for future inclusion in the customizer, as per project rules?
- **Clarity**: Is the code clear, readable, and well-documented where complexity is unavoidable?
### 6. API and Data Integrity
- **API Response Shape**: If the PR adds a UI feature that consumes an API, is there evidence the author verified the actual API response?
- **Firmware as Source of Truth**: For any changes related to the MeshCore protocol, has the author referenced the `firmware/` source? Challenge any "magic numbers" or assumptions about packet structure.
## Review Process
1.**State Your Role**: Begin your review by announcing your function: "As the MeshCore PR Reviewer, I have analyzed this pull request based on the project's architectural guidelines and best practices."
2.**Provide a Summary**: Give a high-level summary of your findings (e.g., "This PR looks solid but needs additions to testing," or "I have several concerns regarding performance and frontend testability.").
3.**Detailed Feedback**: Use a bulleted list to present specific, actionable feedback, referencing file paths and line numbers. For each point, cite the relevant principle or project rule (e.g., "Missing Test Coverage (Rule #1)", "UI Logic Purity (Focus Area #4)").
4.**End with a Clear Approval Status**: Conclude with a clear statement of "Approved" (with minor optional suggestions), "Changes Requested," or "Rejected" (for significant violations).
CoreScope has 14 test files, 4,290 lines of test code. Backend coverage 85%+, frontend 42%+. Tests use Node.js native runner, Playwright for E2E, c8/nyc for coverage, supertest for API routes. vm.createContext pattern used for testing frontend helpers in Node.js.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- E2E run 2026-03-26: 12/16 passed, 4 failed. Results:
- ✅ Home page loads
- ✅ Nodes page loads with data
- ❌ Map page loads with markers — No markers found (empty DB, no geo data)
- ✅ Packets page loads with filter
- ✅ Node detail loads
- ✅ Theme customizer opens
- ✅ Dark mode toggle
- ✅ Analytics page loads
- ✅ Map heat checkbox persists in localStorage
- ✅ Map heat checkbox is clickable
- ✅ Live heat disabled when ghosts mode active
- ✅ Live heat checkbox persists in localStorage
- ✅ Heatmap opacity persists in localStorage
- ❌ Live heatmap opacity persists — browser closed before test ran (bug: browser.close() on line 274 is before tests 14-16)
- ❌ Customizer has separate map/live opacity sliders — same browser-closed bug
- ❌ Map re-renders on resize — same browser-closed bug
- BUG FOUND: test-e2e-playwright.js line 274 calls `await browser.close()` before tests 14, 15, 16 execute. Those 3 tests will always fail. The `browser.close()` must be moved after all tests.
- The "Map page loads with markers" failure is expected with an empty local DB — no nodes with coordinates exist to render markers.
- FIX APPLIED 2026-03-26: Moved `browser.close()` from between test 13 and test 14 to after test 16 (just before the summary). Tests 14 ("Live heatmap opacity persists") and 15 ("Customizer has separate map/live opacity sliders") now pass. Test 16 ("Map re-renders on resize") now runs but fails due to empty DB (no markers to count) — same root cause as test 3. Result: 14/16 pass, 2 fail (both map-marker tests, expected with empty DB).
- TESTS ADDED 2026-03-26: Issue #127 (copyToClipboard) — 8 unit tests in test-frontend-helpers.js using vm.createContext + DOM/clipboard mocks. Tests cover: fallback path (execCommand success/fail/throw), clipboard API path, null/undefined input, textarea lifecycle, no-callback usage. Pattern: `makeClipboardSandbox(opts)` helper builds sandbox with configurable navigator.clipboard and document.execCommand mocks. Total frontend helper tests: 47→55.
- TESTS ADDED 2026-03-26: Issue #125 (packet detail dismiss) — 1 E2E test in test-e2e-playwright.js. Tests: click row → pane opens (empty class removed) → click ✕ → pane closes (empty class restored). Skips gracefully when DB has no packets. Inserted before analytics group, before browser.close().
- E2E SPEED OPTIMIZATION 2026-03-26: Rewrote test-e2e-playwright.js for performance per Kobayashi's audit. Changes:
- Replaced ALL 19 `waitUntil: 'networkidle'` → `'domcontentloaded'` + targeted `waitForSelector`/`waitForFunction`. networkidle stalls ~500ms+ per navigation due to persistent WebSocket + Leaflet tiles.
- Eliminated 11 of 12 `waitForTimeout` sleeps → event-driven waits (waitForSelector, waitForFunction). Only 1 remains: 500ms for packet filter debounce (was 1500ms).
- Reordered tests into page groups to eliminate 7 redundant navigations (page.goto 14→7): Home(1,6,7), Nodes(2,5), Map(3,9,10,13,16), Packets(4), Analytics(8), Live(11,12), NoNav(14,15).
- Reduced default timeout from 15s to 10s.
- All 17 test names and assertions preserved unchanged.
- Verified: 17/17 tests pass against local server with generated test data.
- TOTAL: ~757s (~12.6 min locally). CI reports ~13 min (matches).
- ROOT CAUSE: collect-frontend-coverage.js is a 978-line script that launches a SECOND Playwright browser and exhaustively clicks every UI element on every page to maximize code coverage. It contains:
- 169 explicit `waitForTimeout()` calls totaling 104.1s (1.74 min) of hard sleep
- REGRESSION TESTS ADDED 2026-03-27: Memory optimization (observation deduplication). 8 new tests in test-packet-store.js under "=== Observation deduplication (transmission_id refs) ===" section. Tests verify: (1) observations don't duplicate raw_hex/decoded_json, (2) transmission fields accessible via store.byTxId.get(obs.transmission_id), (3) query() and all() still return transmission fields for backward compat, (4) multiple observations share one transmission_id, (5) getSiblings works after dedup, (6) queryGrouped returns transmission fields, (7) memory estimate reflects dedup savings. 4 tests fail pre-fix (expected — Hicks hasn't applied changes yet), 4 pass (backward compat). Pattern: use hasOwnProperty() to distinguish own vs inherited/absent fields.
- REVIEW 2026-03-27: Hicks RAM fix (observation dedup). REJECTED. Tests pass (42 packet-store + 204 route), but 5 server.js consumers access `.hash`, `.raw_hex`, `.decoded_json`, `.payload_type` on lean observations from `byObserver.get()` or `tx.observations` without enrichment. Broken endpoints: (1) `/api/nodes/bulk-health` line 1141 `o.hash` undefined, (2) `/api/nodes/network-status` line 1220 `o.hash` undefined, (3) `/api/analytics/signal` lines 1298+1306 `p.hash`/`p.raw_hex` undefined, (4) `/api/observers/:id/analytics` lines 2320+2329+2361 `p.payload_type`/`p.decoded_json` undefined + lean objects sent to client as recentPackets, (5) `/api/analytics/subpaths` line 2711 `o.hash` undefined. All are regional filtering or analytics code paths that use `byObserver` directly. Fix: either enrich at these call sites or store `hash` on observations (it's small). The enrichment pattern works for `getById()`, `getSiblings()`, and `/api/packets/:id` but was not applied to the 5 other consumers. Route tests pass because they don't assert on these specific field values in analytics responses.
- BATCH REVIEW 2026-03-27: Reviewed 6 issue fixes pushed without sign-off. Full suite: 971 tests, 0 failures across 11 test files. Cache busters uniform (v=1774625000). Verdicts:
-#133 (phantom nodes): ✅ APPROVED. 12 assertions on removePhantomNodes, real db.js code, edge cases (idempotency, real node preserved, stats filtering).
-#123 (channel hash): ⚠️ APPROVED WITH NOTES. 6 new decoder tests cover channelHashHex (zero-padding) and decryptionStatus (no_key ×3, decryption_failed). Missing: `decrypted` status untested (needs valid crypto key), frontend rendering of "Ch 0xXX (no key)" untested.
-#130 (disappearing nodes): ✅ APPROVED. 8 pruneStaleNodes tests cover dim/restore/remove for API vs WS nodes. Real live.js via vm.createContext.
-#131 (auto-updating nodes): ⚠️ APPROVED WITH NOTES. 8 solid isAdvertMessage tests (real code). BUT 5 WS handler tests are source-string-match checks (`src.includes('loadNodes(true)')`) — these verify code exists but not that it works at runtime. No runtime test for debounce batching behavior.
-#129 (observer comparison): ✅ APPROVED. 11 comprehensive tests for comparePacketSets — all edge cases, performance (10K hashes <500ms), mathematical invariant. Real compare.js via vm.createContext.
- NOTES FOR IMPROVEMENT: (1) #131 debounce behavior should get a runtime test via vm.createContext, not string checks. (2) #123 could benefit from a `decrypted` status test if crypto mocking is feasible. Neither is blocking.
- TEST GAP FIX 2026-03-27: Closed both noted gaps from batch review:
-#123 (channel hash decryption `decrypted` status): 3 new tests in test-decoder.js. Used require.cache mocking to swap ChannelCrypto module with mock that returns `{success:true, data:{...}}`. Tests cover: (1) decrypted status with sender+message (text formatted as "Sender: message"), (2) decrypted without sender (text is just message), (3) multiple keys tried, first match wins (verifies iteration order + call count). All verify channelHashHex, type='CHAN', channel name, sender, timestamp, flags. require.cache is restored in finally block.
-#131 (WS handler runtime tests): Rewrote 5 `src.includes()` string-match tests to use vm.createContext with runtime execution. Created `makeNodesWsSandbox()` helper that provides controllable setTimeout (timer queue), mock DOM, tracked api/invalidateApiCache calls, and real `debouncedOnWS` logic. Tests run actual nodes.js init() and verify: (1) ADVERT triggers refresh with 5s debounce, (2) non-ADVERT doesn't trigger refresh, (3) debounce collapses 3 ADVERTs into 1 API call, (4) _allNodes cache reset forces re-fetch, (5) scroll/selection preserved (panel innerHTML + scrollTop untouched by WS handler). Total: 87 frontend helper tests (same count — 5 replaced, not added), 61 decoder tests (+3).
- Technique learned: require.cache mocking is effective for testing code paths that depend on external modules (like ChannelCrypto). Store original, replace exports, restore in finally. Controllable setTimeout (capturing callbacks in array, firing manually) enables testing debounce logic without real timers.
# Bishop — History
## Project Context
CoreScope has 14 test files, 4,290 lines of test code. Backend coverage 85%+, frontend 42%+. Tests use Node.js native runner, Playwright for E2E, c8/nyc for coverage, supertest for API routes. vm.createContext pattern used for testing frontend helpers in Node.js.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- E2E run 2026-03-26: 12/16 passed, 4 failed. Results:
- ✅ Home page loads
- ✅ Nodes page loads with data
- ❌ Map page loads with markers — No markers found (empty DB, no geo data)
- ✅ Packets page loads with filter
- ✅ Node detail loads
- ✅ Theme customizer opens
- ✅ Dark mode toggle
- ✅ Analytics page loads
- ✅ Map heat checkbox persists in localStorage
- ✅ Map heat checkbox is clickable
- ✅ Live heat disabled when ghosts mode active
- ✅ Live heat checkbox persists in localStorage
- ✅ Heatmap opacity persists in localStorage
- ❌ Live heatmap opacity persists — browser closed before test ran (bug: browser.close() on line 274 is before tests 14-16)
- ❌ Customizer has separate map/live opacity sliders — same browser-closed bug
- ❌ Map re-renders on resize — same browser-closed bug
- BUG FOUND: test-e2e-playwright.js line 274 calls `await browser.close()` before tests 14, 15, 16 execute. Those 3 tests will always fail. The `browser.close()` must be moved after all tests.
- The "Map page loads with markers" failure is expected with an empty local DB — no nodes with coordinates exist to render markers.
- FIX APPLIED 2026-03-26: Moved `browser.close()` from between test 13 and test 14 to after test 16 (just before the summary). Tests 14 ("Live heatmap opacity persists") and 15 ("Customizer has separate map/live opacity sliders") now pass. Test 16 ("Map re-renders on resize") now runs but fails due to empty DB (no markers to count) — same root cause as test 3. Result: 14/16 pass, 2 fail (both map-marker tests, expected with empty DB).
- TESTS ADDED 2026-03-26: Issue #127 (copyToClipboard) — 8 unit tests in test-frontend-helpers.js using vm.createContext + DOM/clipboard mocks. Tests cover: fallback path (execCommand success/fail/throw), clipboard API path, null/undefined input, textarea lifecycle, no-callback usage. Pattern: `makeClipboardSandbox(opts)` helper builds sandbox with configurable navigator.clipboard and document.execCommand mocks. Total frontend helper tests: 47→55.
- TESTS ADDED 2026-03-26: Issue #125 (packet detail dismiss) — 1 E2E test in test-e2e-playwright.js. Tests: click row → pane opens (empty class removed) → click ✕ → pane closes (empty class restored). Skips gracefully when DB has no packets. Inserted before analytics group, before browser.close().
- E2E SPEED OPTIMIZATION 2026-03-26: Rewrote test-e2e-playwright.js for performance per Kobayashi's audit. Changes:
- Replaced ALL 19 `waitUntil: 'networkidle'` → `'domcontentloaded'` + targeted `waitForSelector`/`waitForFunction`. networkidle stalls ~500ms+ per navigation due to persistent WebSocket + Leaflet tiles.
- Eliminated 11 of 12 `waitForTimeout` sleeps → event-driven waits (waitForSelector, waitForFunction). Only 1 remains: 500ms for packet filter debounce (was 1500ms).
- Reordered tests into page groups to eliminate 7 redundant navigations (page.goto 14→7): Home(1,6,7), Nodes(2,5), Map(3,9,10,13,16), Packets(4), Analytics(8), Live(11,12), NoNav(14,15).
- Reduced default timeout from 15s to 10s.
- All 17 test names and assertions preserved unchanged.
- Verified: 17/17 tests pass against local server with generated test data.
- TOTAL: ~757s (~12.6 min locally). CI reports ~13 min (matches).
- ROOT CAUSE: collect-frontend-coverage.js is a 978-line script that launches a SECOND Playwright browser and exhaustively clicks every UI element on every page to maximize code coverage. It contains:
- 169 explicit `waitForTimeout()` calls totaling 104.1s (1.74 min) of hard sleep
- REGRESSION TESTS ADDED 2026-03-27: Memory optimization (observation deduplication). 8 new tests in test-packet-store.js under "=== Observation deduplication (transmission_id refs) ===" section. Tests verify: (1) observations don't duplicate raw_hex/decoded_json, (2) transmission fields accessible via store.byTxId.get(obs.transmission_id), (3) query() and all() still return transmission fields for backward compat, (4) multiple observations share one transmission_id, (5) getSiblings works after dedup, (6) queryGrouped returns transmission fields, (7) memory estimate reflects dedup savings. 4 tests fail pre-fix (expected — Hicks hasn't applied changes yet), 4 pass (backward compat). Pattern: use hasOwnProperty() to distinguish own vs inherited/absent fields.
- REVIEW 2026-03-27: Hicks RAM fix (observation dedup). REJECTED. Tests pass (42 packet-store + 204 route), but 5 server.js consumers access `.hash`, `.raw_hex`, `.decoded_json`, `.payload_type` on lean observations from `byObserver.get()` or `tx.observations` without enrichment. Broken endpoints: (1) `/api/nodes/bulk-health` line 1141 `o.hash` undefined, (2) `/api/nodes/network-status` line 1220 `o.hash` undefined, (3) `/api/analytics/signal` lines 1298+1306 `p.hash`/`p.raw_hex` undefined, (4) `/api/observers/:id/analytics` lines 2320+2329+2361 `p.payload_type`/`p.decoded_json` undefined + lean objects sent to client as recentPackets, (5) `/api/analytics/subpaths` line 2711 `o.hash` undefined. All are regional filtering or analytics code paths that use `byObserver` directly. Fix: either enrich at these call sites or store `hash` on observations (it's small). The enrichment pattern works for `getById()`, `getSiblings()`, and `/api/packets/:id` but was not applied to the 5 other consumers. Route tests pass because they don't assert on these specific field values in analytics responses.
- BATCH REVIEW 2026-03-27: Reviewed 6 issue fixes pushed without sign-off. Full suite: 971 tests, 0 failures across 11 test files. Cache busters uniform (v=1774625000). Verdicts:
-#133 (phantom nodes): ✅ APPROVED. 12 assertions on removePhantomNodes, real db.js code, edge cases (idempotency, real node preserved, stats filtering).
-#123 (channel hash): ⚠️ APPROVED WITH NOTES. 6 new decoder tests cover channelHashHex (zero-padding) and decryptionStatus (no_key ×3, decryption_failed). Missing: `decrypted` status untested (needs valid crypto key), frontend rendering of "Ch 0xXX (no key)" untested.
-#130 (disappearing nodes): ✅ APPROVED. 8 pruneStaleNodes tests cover dim/restore/remove for API vs WS nodes. Real live.js via vm.createContext.
-#131 (auto-updating nodes): ⚠️ APPROVED WITH NOTES. 8 solid isAdvertMessage tests (real code). BUT 5 WS handler tests are source-string-match checks (`src.includes('loadNodes(true)')`) — these verify code exists but not that it works at runtime. No runtime test for debounce batching behavior.
-#129 (observer comparison): ✅ APPROVED. 11 comprehensive tests for comparePacketSets — all edge cases, performance (10K hashes <500ms), mathematical invariant. Real compare.js via vm.createContext.
- NOTES FOR IMPROVEMENT: (1) #131 debounce behavior should get a runtime test via vm.createContext, not string checks. (2) #123 could benefit from a `decrypted` status test if crypto mocking is feasible. Neither is blocking.
- TEST GAP FIX 2026-03-27: Closed both noted gaps from batch review:
-#123 (channel hash decryption `decrypted` status): 3 new tests in test-decoder.js. Used require.cache mocking to swap ChannelCrypto module with mock that returns `{success:true, data:{...}}`. Tests cover: (1) decrypted status with sender+message (text formatted as "Sender: message"), (2) decrypted without sender (text is just message), (3) multiple keys tried, first match wins (verifies iteration order + call count). All verify channelHashHex, type='CHAN', channel name, sender, timestamp, flags. require.cache is restored in finally block.
-#131 (WS handler runtime tests): Rewrote 5 `src.includes()` string-match tests to use vm.createContext with runtime execution. Created `makeNodesWsSandbox()` helper that provides controllable setTimeout (timer queue), mock DOM, tracked api/invalidateApiCache calls, and real `debouncedOnWS` logic. Tests run actual nodes.js init() and verify: (1) ADVERT triggers refresh with 5s debounce, (2) non-ADVERT doesn't trigger refresh, (3) debounce collapses 3 ADVERTs into 1 API call, (4) _allNodes cache reset forces re-fetch, (5) scroll/selection preserved (panel innerHTML + scrollTop untouched by WS handler). Total: 87 frontend helper tests (same count — 5 replaced, not added), 61 decoder tests (+3).
- Technique learned: require.cache mocking is effective for testing code paths that depend on external modules (like ChannelCrypto). Store original, replace exports, restore in finally. Controllable setTimeout (capturing callbacks in array, firing manually) enables testing debounce logic without real timers.
- **Massive session 2026-03-27 (FULL DAY):** Reviewed and approved all 6 fixes, closed 2 test gaps, validated E2E:
CoreScope is a real-time LoRa mesh packet analyzer. Node.js + Express + SQLite backend, vanilla JS SPA frontend. Custom decoder.js fixes path_length bug from upstream library. In-memory packet store provides O(1) lookups for 30K+ packets. TTL response cache achieves 7,000× speedup on bulk health endpoint.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- Split the monolithic "Frontend coverage (instrumented Playwright)" CI step into 5 discrete steps: Instrument frontend JS, Start test server (with health-check poll replacing sleep 5), Run Playwright E2E tests, Extract coverage + generate report, Stop test server. Cleanup/report steps use `if: always()` so server shutdown happens even on test failure. Server PID shared across steps via .server.pid file. "Frontend E2E only" fast-path left untouched.
- Fixed memory explosion in packet-store.js: observations no longer duplicate transmission fields (hash, raw_hex, decoded_json, payload_type, route_type). Instead, observations store only `transmission_id` as a reference. Added `_enrichObs()` to hydrate observations at API boundaries (getById, getSiblings, enrichObservations). Replaced `.all()` with `.iterate()` for streaming load. Updated `_transmissionsForObserver()` to use transmission_id instead of hash. For a 185MB DB with 50K transmissions × 23 observations avg, this eliminates ~1.17M copies of hex dumps and JSON — projected ~2GB RAM savings.
- Built standalone Go MQTT ingestor (`cmd/ingestor/`). Ported decoder.js → Go (header parsing, path extraction, all payload types, advert decoding with flags/lat/lon/name). Ported db.js v3 schema (transmissions + observations + nodes + observers). Ported computeContentHash (SHA-256 based, path-independent). Uses modernc.org/sqlite (pure Go, no CGO) and paho.mqtt.golang. 25 tests passing (decoder golden fixtures from production data + DB schema compatibility). Supports same config.json format as Node.js server. Handles Format 1 (raw packet) messages; companion bridge format deferred. System Go was 1.17 — installed Go 1.22.5 to support modern dependencies.
- Built standalone Go web server (`cmd/server/`) — READ side of the Go rewrite. 35+ REST API endpoints ported from server.js. All queries go directly to SQLite (no in-memory packet store). WebSocket broadcast via SQLite polling. Static file server with SPA fallback. Uses gorilla/mux for routing, gorilla/websocket for WS, modernc.org/sqlite for DB. 42 tests passing (20 DB query tests, 20+ route integration tests, 2 WebSocket tests). `go vet` clean. Binary compiles to single executable. Analytics endpoints that required Node.js in-memory store (topology, distance, hash-sizes, subpaths) return structural stubs — core data (RF stats, channels, node health, etc.) fully functional via SQL. System Go 1.17 → installed Go 1.22 for build. Each cmd/* module has its own go.mod (no root-level go.mod).
- Go server API parity fix: Rewrote QueryPackets from observation-centric (packets_v view) to transmission-centric (transmissions table + correlated subqueries). This fixes both performance (9s to sub-100ms for unfiltered queries on 1.2M rows) and response shape. Packets now return first_seen, timestamp (= first_seen), observation_count, and NOT created_at/payload_version/score. Node responses now include last_heard (= last_seen fallback), hash_size (null), hash_size_inconsistent (false). Added schema version detection (v2 vs v3 observations table). Fixed QueryGroupedPackets first_seen. Added GetRecentTransmissionsForNode. All tests pass, build clean with Go 1.22.
- Fixed #133 (node count keeps climbing): `db.getStats().totalNodes` used `SELECT COUNT(*) FROM nodes` which counts every node ever seen — 6800+ on a ~200-400 node mesh. Changed `totalNodes` to count only nodes with `last_seen` within 7 days. Added `totalNodesAllTime` for the full historical count. Also filtered role counts in `/api/stats` to the same 7-day window. Added `countActiveNodes` and `countActiveNodesByRole` prepared statements in db.js. 6 new tests (95 total in test-db.js). The existing `idx_nodes_last_seen` index covers the new queries.
- Go server FULL API parity: Rewrote QueryGroupedPackets from packets_v VIEW scan (8s on 1.2M rows) to transmission-centric query (<100ms). Fixed GetStats to use 7-day window for totalNodes + added totalNodesAllTime. Split GetRoleCounts into 7-day (for /api/stats) and all-time (for /api/nodes). Added packetsLastHour + node lat/lon/role to /api/observers via batch queries (GetObserverPacketCounts, GetNodeLocations). Added multi-node filter support (/api/packets?nodes=pk1,pk2). Fixed /api/packets/:id to return parsed path_json in path field. Populated bulk-health per-node stats from SQL. Updated test seed data to use dynamic timestamps for 7-day filter compatibility. All 42+ tests pass, go vet clean.
- Fixed #133 ROOT CAUSE (phantom nodes): `autoLearnHopNodes` in server.js was calling `db.upsertNode()` for every unresolved hop prefix, creating thousands of fake "repeater" nodes with short public_keys (just the 2-4 byte hop prefix). Removed the `upsertNode` call entirely — unresolved hops are now simply cached to skip repeat DB lookups, and display as raw hex prefixes via hop-resolver. Added `db.removePhantomNodes()` that deletes nodes with `LENGTH(public_key) <= 16` (real pubkeys are 64 hex chars). Called at server startup to purge existing phantoms. 14 new test assertions (109 total in test-db.js).
- Fixed #126 (offline node showing on map due to hash prefix collision): `updatePathSeenTimestamps()` and `autoLearnHopNodes()` used `LIKE prefix%` DB queries that non-deterministically picked the first match when multiple nodes shared a hash prefix (e.g. `1CC4` and `1C82` both start with `1C` under 1-byte hash_size). Extracted `resolveUniquePrefixMatch()` that checks for uniqueness — ambiguous prefixes (matching 2+ nodes) are skipped and cached in a negative-cache Set. This prevents dead nodes from getting `last_heard` updates from packets that actually belong to a different node. 3 new tests (207 total in test-server-routes.js).
- Fixed #123 (channel hash for undecrypted GRP_TXT): Added `channelHashHex` (zero-padded uppercase hex) and `decryptionStatus` ('decrypted'|'no_key'|'decryption_failed') fields to `decodeGrpTxt` in decoder.js. Distinguishes between "no channel keys configured" vs "keys tried but decryption failed." Frontend packets.js updated: list preview shows "🔒 Ch 0xXX (status)", detail pane hex breakdown and message area show channel hash with status label. 6 new tests (58 total in test-decoder.js).
- Ported in-memory packet store to Go (`cmd/server/store.go`). PacketStore loads all transmissions + observations from SQLite at startup via streaming query (no .all()), builds 5 indexes (byHash, byTxID, byObsID, byObserver, byNode), picks longest-path observation per transmission for display fields. QueryPackets and QueryGroupedPackets serve from memory with full filter support (type, route, observer, hash, since, until, region, node). Poller ingests new transmissions into store via IngestNewFromDB. Server/routes fall back to direct DB queries when store is nil (backward-compatible with tests). All 42+ existing tests pass, go vet clean, go build clean. System Go 1.17 requires using Go 1.22.5 at C:\go1.22\go\bin.
- Fixed 3 critically slow Go endpoints by switching from SQLite queries against packets_v VIEW (1.2M rows) to in-memory PacketStore queries. `/api/channels` 7.2s→37ms (195×), `/api/channels/:hash/messages` 8.2s→36ms (228×), `/api/analytics/rf` 4.2s→90ms avg (47×). Key optimizations: (1) byPayloadType index reduces channels scan from 52K to 17K packets, (2) struct-based JSON decode avoids map[string]interface{} allocations, (3) per-transmission work hoisted out of 1.2M observation loop for RF, (4) eliminated second-pass time.Parse over 1.2M observations (track min/max timestamps as strings instead), (5) pre-allocated slices with capacity hints, (6) 15-second TTL cache for RF analytics (separate mutex to avoid contention with store RWMutex). Cache invalidation is TTL-only because live mesh generates continuous ingest events. Also fixed `/api/analytics/channels` to use store. All handlers fall back to DB when store is nil (test compat).
- **#133 PHANTOM NODES (ROOT CAUSE):** Backend `autoLearnHopNodes()` removed upsertNode call. Added `db.removePhantomNodes()` (pubkey ≤16 chars). Called at startup. Cascadia: 7,308 → ~200-400 active nodes. 14 new tests, all passing.
- **#133 ACTIVE WINDOW:** `/api/stats``totalNodes` now 7-day window. Added `totalNodesAllTime` for historical. Role counts filtered to 7-day. Go server GetStats updated for parity.
- **#126 AMBIGUOUS PREFIXES:** `resolveUniquePrefixMatch()` requires unique prefix match. Ambiguous prefixes skipped, cached in negative-cache. Prevents dead nodes from wrong packet attribution.
- **Go API Parity:** QueryGroupedPackets transmission-centric 8s→<100ms. Response shapes match Node.js exactly. All 42+ Go tests passing.
- **Database merge:** Staging 185MB (50K tx + 1.2M obs) merged into prod 21MB. 0 data loss. Merged DB 51,723 tx + 1,237,186 obs. Deploy time 8,491ms, memory 860MiB RSS (v.s. 2.7GB pre-RAM-fix). Backups retained 7 days.
# Hicks — History
## Project Context
CoreScope is a real-time LoRa mesh packet analyzer. Node.js + Express + SQLite backend, vanilla JS SPA frontend. Custom decoder.js fixes path_length bug from upstream library. In-memory packet store provides O(1) lookups for 30K+ packets. TTL response cache achieves 7,000× speedup on bulk health endpoint.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- Split the monolithic "Frontend coverage (instrumented Playwright)" CI step into 5 discrete steps: Instrument frontend JS, Start test server (with health-check poll replacing sleep 5), Run Playwright E2E tests, Extract coverage + generate report, Stop test server. Cleanup/report steps use `if: always()` so server shutdown happens even on test failure. Server PID shared across steps via .server.pid file. "Frontend E2E only" fast-path left untouched.
- Fixed memory explosion in packet-store.js: observations no longer duplicate transmission fields (hash, raw_hex, decoded_json, payload_type, route_type). Instead, observations store only `transmission_id` as a reference. Added `_enrichObs()` to hydrate observations at API boundaries (getById, getSiblings, enrichObservations). Replaced `.all()` with `.iterate()` for streaming load. Updated `_transmissionsForObserver()` to use transmission_id instead of hash. For a 185MB DB with 50K transmissions × 23 observations avg, this eliminates ~1.17M copies of hex dumps and JSON — projected ~2GB RAM savings.
- Built standalone Go MQTT ingestor (`cmd/ingestor/`). Ported decoder.js → Go (header parsing, path extraction, all payload types, advert decoding with flags/lat/lon/name). Ported db.js v3 schema (transmissions + observations + nodes + observers). Ported computeContentHash (SHA-256 based, path-independent). Uses modernc.org/sqlite (pure Go, no CGO) and paho.mqtt.golang. 25 tests passing (decoder golden fixtures from production data + DB schema compatibility). Supports same config.json format as Node.js server. Handles Format 1 (raw packet) messages; companion bridge format deferred. System Go was 1.17 — installed Go 1.22.5 to support modern dependencies.
- Built standalone Go web server (`cmd/server/`) — READ side of the Go rewrite. 35+ REST API endpoints ported from server.js. All queries go directly to SQLite (no in-memory packet store). WebSocket broadcast via SQLite polling. Static file server with SPA fallback. Uses gorilla/mux for routing, gorilla/websocket for WS, modernc.org/sqlite for DB. 42 tests passing (20 DB query tests, 20+ route integration tests, 2 WebSocket tests). `go vet` clean. Binary compiles to single executable. Analytics endpoints that required Node.js in-memory store (topology, distance, hash-sizes, subpaths) return structural stubs — core data (RF stats, channels, node health, etc.) fully functional via SQL. System Go 1.17 → installed Go 1.22 for build. Each cmd/* module has its own go.mod (no root-level go.mod).
- Go server API parity fix: Rewrote QueryPackets from observation-centric (packets_v view) to transmission-centric (transmissions table + correlated subqueries). This fixes both performance (9s to sub-100ms for unfiltered queries on 1.2M rows) and response shape. Packets now return first_seen, timestamp (= first_seen), observation_count, and NOT created_at/payload_version/score. Node responses now include last_heard (= last_seen fallback), hash_size (null), hash_size_inconsistent (false). Added schema version detection (v2 vs v3 observations table). Fixed QueryGroupedPackets first_seen. Added GetRecentTransmissionsForNode. All tests pass, build clean with Go 1.22.
- Fixed #133 (node count keeps climbing): `db.getStats().totalNodes` used `SELECT COUNT(*) FROM nodes` which counts every node ever seen — 6800+ on a ~200-400 node mesh. Changed `totalNodes` to count only nodes with `last_seen` within 7 days. Added `totalNodesAllTime` for the full historical count. Also filtered role counts in `/api/stats` to the same 7-day window. Added `countActiveNodes` and `countActiveNodesByRole` prepared statements in db.js. 6 new tests (95 total in test-db.js). The existing `idx_nodes_last_seen` index covers the new queries.
- Go server FULL API parity: Rewrote QueryGroupedPackets from packets_v VIEW scan (8s on 1.2M rows) to transmission-centric query (<100ms). Fixed GetStats to use 7-day window for totalNodes + added totalNodesAllTime. Split GetRoleCounts into 7-day (for /api/stats) and all-time (for /api/nodes). Added packetsLastHour + node lat/lon/role to /api/observers via batch queries (GetObserverPacketCounts, GetNodeLocations). Added multi-node filter support (/api/packets?nodes=pk1,pk2). Fixed /api/packets/:id to return parsed path_json in path field. Populated bulk-health per-node stats from SQL. Updated test seed data to use dynamic timestamps for 7-day filter compatibility. All 42+ tests pass, go vet clean.
- Fixed #133 ROOT CAUSE (phantom nodes): `autoLearnHopNodes` in server.js was calling `db.upsertNode()` for every unresolved hop prefix, creating thousands of fake "repeater" nodes with short public_keys (just the 2-4 byte hop prefix). Removed the `upsertNode` call entirely — unresolved hops are now simply cached to skip repeat DB lookups, and display as raw hex prefixes via hop-resolver. Added `db.removePhantomNodes()` that deletes nodes with `LENGTH(public_key) <= 16` (real pubkeys are 64 hex chars). Called at server startup to purge existing phantoms. 14 new test assertions (109 total in test-db.js).
- Fixed #126 (offline node showing on map due to hash prefix collision): `updatePathSeenTimestamps()` and `autoLearnHopNodes()` used `LIKE prefix%` DB queries that non-deterministically picked the first match when multiple nodes shared a hash prefix (e.g. `1CC4` and `1C82` both start with `1C` under 1-byte hash_size). Extracted `resolveUniquePrefixMatch()` that checks for uniqueness — ambiguous prefixes (matching 2+ nodes) are skipped and cached in a negative-cache Set. This prevents dead nodes from getting `last_heard` updates from packets that actually belong to a different node. 3 new tests (207 total in test-server-routes.js).
- Fixed #123 (channel hash for undecrypted GRP_TXT): Added `channelHashHex` (zero-padded uppercase hex) and `decryptionStatus` ('decrypted'|'no_key'|'decryption_failed') fields to `decodeGrpTxt` in decoder.js. Distinguishes between "no channel keys configured" vs "keys tried but decryption failed." Frontend packets.js updated: list preview shows "🔒 Ch 0xXX (status)", detail pane hex breakdown and message area show channel hash with status label. 6 new tests (58 total in test-decoder.js).
- Ported in-memory packet store to Go (`cmd/server/store.go`). PacketStore loads all transmissions + observations from SQLite at startup via streaming query (no .all()), builds 5 indexes (byHash, byTxID, byObsID, byObserver, byNode), picks longest-path observation per transmission for display fields. QueryPackets and QueryGroupedPackets serve from memory with full filter support (type, route, observer, hash, since, until, region, node). Poller ingests new transmissions into store via IngestNewFromDB. Server/routes fall back to direct DB queries when store is nil (backward-compatible with tests). All 42+ existing tests pass, go vet clean, go build clean. System Go 1.17 requires using Go 1.22.5 at C:\go1.22\go\bin.
- Fixed 3 critically slow Go endpoints by switching from SQLite queries against packets_v VIEW (1.2M rows) to in-memory PacketStore queries. `/api/channels` 7.2s→37ms (195×), `/api/channels/:hash/messages` 8.2s→36ms (228×), `/api/analytics/rf` 4.2s→90ms avg (47×). Key optimizations: (1) byPayloadType index reduces channels scan from 52K to 17K packets, (2) struct-based JSON decode avoids map[string]interface{} allocations, (3) per-transmission work hoisted out of 1.2M observation loop for RF, (4) eliminated second-pass time.Parse over 1.2M observations (track min/max timestamps as strings instead), (5) pre-allocated slices with capacity hints, (6) 15-second TTL cache for RF analytics (separate mutex to avoid contention with store RWMutex). Cache invalidation is TTL-only because live mesh generates continuous ingest events. Also fixed `/api/analytics/channels` to use store. All handlers fall back to DB when store is nil (test compat).
- **#133 PHANTOM NODES (ROOT CAUSE):** Backend `autoLearnHopNodes()` removed upsertNode call. Added `db.removePhantomNodes()` (pubkey ≤16 chars). Called at startup. Cascadia: 7,308 → ~200-400 active nodes. 14 new tests, all passing.
- **#133 ACTIVE WINDOW:** `/api/stats``totalNodes` now 7-day window. Added `totalNodesAllTime` for historical. Role counts filtered to 7-day. Go server GetStats updated for parity.
- **#126 AMBIGUOUS PREFIXES:** `resolveUniquePrefixMatch()` requires unique prefix match. Ambiguous prefixes skipped, cached in negative-cache. Prevents dead nodes from wrong packet attribution.
CoreScope is a real-time LoRa mesh packet analyzer. Node.js + Express + SQLite backend, vanilla JS SPA frontend with Leaflet maps, WebSocket live feed, MQTT ingestion. Production at v2.6.0, ~18K lines, 85%+ backend test coverage.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- **E2E Playwright performance audit (2026-03-26):** 16 tests, single browser/context/page (good). Key bottlenecks: (1) `waitUntil: 'networkidle'` used ~20 times — catastrophic for SPA with WebSocket + map tiles, (2) ~17s of hardcoded `waitForTimeout` sleeps, (3) redundant `page.goto()` to same routes across tests, (4) CI installs Playwright browser on every run with no caching, (5) coverage collection launches a second full browser session, (6) `sleep 5` server startup instead of health-check polling. Estimated 40-50% total runtime reduction achievable.
- **Issue triage session (2026-03-27):** Triaged 4 open issues, assigned to team:
- **#131** (Feature: Auto-update nodes tab) → Newt (⚛️). Requires WebSocket real-time updates in nodes.js, similar to existing packets feed.
- **#130** (Bug: Disappearing nodes on live map) → Newt (⚛️). High severity, multiple Cascadia Mesh community reports. Likely status calculation or map filter bug. Nodes visible in static list but vanishing from live map.
- **#129** (Feature: Packet comparison between observers) → Newt (⚛️). Feature request from letsmesh analyzer. Side-by-side packet filtering for two repeaters to diagnose repeater issues.
- **#123** (Feature: Show channel hash on decrypt failure) → Hicks (🔧). Core contributor (lincomatic) request. Decoder needs to track why decrypt failed (no key vs. corruption) and expose channel hash + reason in API response.
- **Massive session — 2026-03-27 (full day):**
- **#133 root cause (phantom nodes):** `autoLearnHopNodes()` creates stub nodes for unresolved hop prefixes (2-8 hex chars). Cascadia showed 7,308 nodes (6,638 repeaters) when real size ~200-400. With `hash_size=1`, collision rate high → infinite phantom generation.
- **DB merge decision:** Staging DB (185MB, 50K transmissions, 1.2M observations) is superset. Use as merge base. Transmissions dedup by hash (unique), observations all preserved (unique by observer), nodes/observers latest-wins + sum counts. 6-phase execution plan: pre-flight, backup, merge, deploy, validate, cleanup.
- **Outcome:** All 4 triaged issues fixed (#131, #130, #129, #123), #133 (phantom nodes) fully resolved, #126 (ambiguous hop prefixes) fixed as bonus, database merged successfully (0 data loss, 2 min downtime, 51,723 tx + 1.237M obs), Go rewrite (MQTT ingestor + web server) completed and ready for staging.
- **Team expanded:** Hudson joined for DevOps work, Ripley joined as Support Engineer.
- **Go staging bug triage (2026-03-28):** Filed 8 issues for Go staging bugs missed during API parity work. All found by actually loading the analytics page in a browser — none caught by endpoint-level parity checks.
- **Post-mortem:** Parity was verified by comparing individual endpoint response shapes in isolation. Nobody loaded the analytics page in a browser and looked at it. The agents tested API responses without browser validation of the full UI — exactly the failure mode AGENTS.md rule #2 exists to prevent.
# Kobayashi — History
## Project Context
CoreScope is a real-time LoRa mesh packet analyzer. Node.js + Express + SQLite backend, vanilla JS SPA frontend with Leaflet maps, WebSocket live feed, MQTT ingestion. Production at v2.6.0, ~18K lines, 85%+ backend test coverage.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- **E2E Playwright performance audit (2026-03-26):** 16 tests, single browser/context/page (good). Key bottlenecks: (1) `waitUntil: 'networkidle'` used ~20 times — catastrophic for SPA with WebSocket + map tiles, (2) ~17s of hardcoded `waitForTimeout` sleeps, (3) redundant `page.goto()` to same routes across tests, (4) CI installs Playwright browser on every run with no caching, (5) coverage collection launches a second full browser session, (6) `sleep 5` server startup instead of health-check polling. Estimated 40-50% total runtime reduction achievable.
- **Issue triage session (2026-03-27):** Triaged 4 open issues, assigned to team:
- **#131** (Feature: Auto-update nodes tab) → Newt (⚛️). Requires WebSocket real-time updates in nodes.js, similar to existing packets feed.
- **#130** (Bug: Disappearing nodes on live map) → Newt (⚛️). High severity, multiple Cascadia Mesh community reports. Likely status calculation or map filter bug. Nodes visible in static list but vanishing from live map.
- **#129** (Feature: Packet comparison between observers) → Newt (⚛️). Feature request from letsmesh analyzer. Side-by-side packet filtering for two repeaters to diagnose repeater issues.
- **#123** (Feature: Show channel hash on decrypt failure) → Hicks (🔧). Core contributor (lincomatic) request. Decoder needs to track why decrypt failed (no key vs. corruption) and expose channel hash + reason in API response.
- **Massive session — 2026-03-27 (full day):**
- **#133 root cause (phantom nodes):** `autoLearnHopNodes()` creates stub nodes for unresolved hop prefixes (2-8 hex chars). Cascadia showed 7,308 nodes (6,638 repeaters) when real size ~200-400. With `hash_size=1`, collision rate high → infinite phantom generation.
- **DB merge decision:** Staging DB (185MB, 50K transmissions, 1.2M observations) is superset. Use as merge base. Transmissions dedup by hash (unique), observations all preserved (unique by observer), nodes/observers latest-wins + sum counts. 6-phase execution plan: pre-flight, backup, merge, deploy, validate, cleanup.
- **Outcome:** All 4 triaged issues fixed (#131, #130, #129, #123), #133 (phantom nodes) fully resolved, #126 (ambiguous hop prefixes) fixed as bonus, database merged successfully (0 data loss, 2 min downtime, 51,723 tx + 1.237M obs), Go rewrite (MQTT ingestor + web server) completed and ready for staging.
- **Team expanded:** Hudson joined for DevOps work, Ripley joined as Support Engineer.
- **Go staging bug triage (2026-03-28):** Filed 8 issues for Go staging bugs missed during API parity work. All found by actually loading the analytics page in a browser — none caught by endpoint-level parity checks.
- **Post-mortem:** Parity was verified by comparing individual endpoint response shapes in isolation. Nobody loaded the analytics page in a browser and looked at it. The agents tested API responses without browser validation of the full UI — exactly the failure mode AGENTS.md rule #2 exists to prevent.
CoreScope is a real-time LoRa mesh packet analyzer with a vanilla JS SPA frontend. 22 frontend modules, Leaflet maps, WebSocket live feed, VCR playback, Canvas animations, theme customizer with CSS variables. No build step, no framework. ES5/6 for broad browser support.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- **Issue #127 fix:** Firefox clipboard API fails silently when `navigator.clipboard.writeText()` is called outside a secure context or without proper user gesture handling. Added `window.copyToClipboard()` shared helper to `roles.js` that tries Clipboard API first, falls back to hidden textarea + `document.execCommand('copy')`. Updated all 3 clipboard call sites: `nodes.js` (Copy URL — the reported bug), `packets.js` (Copy Link — had ugly `prompt()` fallback), `customize.js` (Copy to Clipboard — already worked but now uses shared helper). Cache busters bumped. All tests pass (47 frontend, 62 packet-filter).
- **Issue #125 fix:** Added dismiss/close button (✕) to the packet detail pane on desktop. Extracted `closeDetailPanel()` shared helper and `PANEL_CLOSE_HTML` constant — DRY: Escape handler and click handler both call it. Close button uses event delegation on `#pktRight`, styled with CSS variables (`--text-muted`, `--text`, `--surface-1`) matching the mobile `.mobile-sheet-close` pattern. Hidden when panel is in `.empty` state. Clicking a different row still re-opens with new data. Files changed: `public/packets.js`, `public/style.css`. Cache busters NOT bumped (another agent editing index.html).
- **Issue #122 fix:** Node tooltip (line 45) and node detail panel (line 120) in `channels.js` used `last_seen` alone for "Last seen" display. Changed both to `last_heard || last_seen` per AGENTS.md pitfall. Pattern: always prefer `last_heard || last_seen` for any time-ago display. **Server note for Hicks:**`/api/nodes/search` and `/api/nodes/:pubkey` endpoints don't return `last_heard` — only the bulk `/api/nodes` list endpoint computes it from the in-memory packet store. These endpoints need the same `last_heard` enrichment for the frontend fix to fully take effect. Also, `/api/analytics/channels` has a separate bug: `lastActivity` is overwritten unconditionally (no `>=` check) so it shows the oldest packet's timestamp, not the newest.
- **Issue #130 fix:** Live map `pruneStaleNodes()` (added for #133) was completely removing stale nodes from the map, while the static map dims them with CSS. Root cause: API-loaded nodes and WS-only nodes were treated identically — both got deleted when stale. Fix: mark API-loaded nodes with `_fromAPI = true` in `loadNodes()`. `pruneStaleNodes()` now dims API nodes (fillOpacity 0.25, opacity 0.15) instead of removing them, and restores full opacity when they become active again. WS-only dynamic nodes are still removed to prevent memory leaks. Pattern: **live map should match static map behavior** — never remove database-loaded nodes, only change their visual state. 3 new tests added (63 total frontend tests passing).
- **Issue #129 fix:** Added observer packet comparison feature (`#/compare` page). Users select two observers from dropdowns, click Compare, and see which packets each observer saw in the last 24 hours. Data flow: fetches packets per observer via existing `/api/packets?observer=X&limit=10000&since=24h`, computes set intersection/difference client-side using `comparePacketSets()` (O(n) via Set lookups — no nested loops). UI: three summary cards (both/only-A/only-B with counts and percentages), horizontal stacked bar chart, packet type breakdown for shared packets, and tabbed detail tables (up to 200 rows each, clickable to packet detail). URL is shareable: `#/compare?a=ID1&b=ID2`. Added 🔍 compare button to observers page header. Pure function `comparePacketSets` exposed on `window` for testability. 11 new tests (87 total frontend tests). Files: `public/compare.js` (new), `public/style.css`, `public/observers.js`, `public/index.html`, `test-frontend-helpers.js`. Cache busters bumped.
- **Browser validation of 6 fixes (2026-03-27):** Validated against live prod at `https://analyzer.00id.net`. Results: ✅ #133 (phantom nodes) — API returns 50 nodes, reasonable count, no runaway growth. ✅ #123 (channel hash on undecrypted) — GRP_TXT packets with `decryption_failed` status show `channelHashHex` field; packet detail renders `🔒 Channel Hash: 0xE2 (decryption failed)` via `packets.js:1254-1259`. ⏭ #126 (offline node on map) — skipped, requires specific dead node. ✅ #130 (disappearing nodes on live map) — `pruneStaleNodes()` confirmed at `live.js:1474` dims API-loaded nodes (`fillOpacity:0.25`) instead of removing; `_fromAPI=true` flag set at `live.js:1279`. ✅ #131 (auto-updating node list) — `nodes.js:210-216` wires `debouncedOnWS` handler that triggers `loadNodes(true)` on ADVERT messages; `isAdvertMessage()` at `nodes.js:852` checks `payload_type===4`. ✅ #129 (observer comparison) — `compare.js` deployed with full UI: observer dropdowns, `comparePacketSets()` Set logic, summary cards, bar chart, type breakdown. 16 observers available in prod. Pattern: always verify deployed JS matches source — cache buster `v=1774625000` confirmed consistent across all script tags.
- **Packet detail pane fresh-load fix:** The `detail-collapsed` class added for issue #125's close button wasn't applied on initial render, so the empty right panel was visible on fresh page load. Fix: added `detail-collapsed` to the `split-layout` div in the initial `innerHTML` template (packets.js:183). Pattern: when adding a CSS toggle class, always consider the initial DOM state — if nothing is selected, the default state must match "nothing selected." 3 tests added (90 total frontend). Cache busters bumped.
- **#130 LIVE MAP STALE DIMMING:** `pruneStaleNodes()` distinguishes API-loaded (`_fromAPI`) from WS-only. Dims API nodes (fillOpacity 0.25, opacity 0.15) instead of removing. Matches static map behavior. 3 new tests, all passing.
- **#131 NODES TAB WS AUTO-UPDATE:** `loadNodes(refreshOnly)` pattern resets cache + invalidateApiCache + re-fetches. Preserves scroll/selection/listeners. WS handler now triggers on ADVERT messages (payload_type===4). All tests passing.
- **#129 OBSERVER COMPARISON PAGE:** New `#/compare` route with shareable params `?a=ID1&b=ID2`. `comparePacketSets()` pure function (O(n) Set operations). UI: summary cards, bar chart, type breakdown, detail tables. 🔍 compare button on observers header.
- **#133 LIVE PAGE NODE PRUNING:** Prune every 60s using `getNodeStatus()` from roles.js (per-role health thresholds: 24h companions/sensors, 72h infrastructure). `_liveSeen` timestamp set on insert, updated on re-observation. Bounded memory usage.
- **Database merge:** All frontend endpoints working with merged 1.237M observation DB. Load speed verified. All 4 fixes tested end-to-end in browser.
# Newt — History
## Project Context
CoreScope is a real-time LoRa mesh packet analyzer with a vanilla JS SPA frontend. 22 frontend modules, Leaflet maps, WebSocket live feed, VCR playback, Canvas animations, theme customizer with CSS variables. No build step, no framework. ES5/6 for broad browser support.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- **Issue #127 fix:** Firefox clipboard API fails silently when `navigator.clipboard.writeText()` is called outside a secure context or without proper user gesture handling. Added `window.copyToClipboard()` shared helper to `roles.js` that tries Clipboard API first, falls back to hidden textarea + `document.execCommand('copy')`. Updated all 3 clipboard call sites: `nodes.js` (Copy URL — the reported bug), `packets.js` (Copy Link — had ugly `prompt()` fallback), `customize.js` (Copy to Clipboard — already worked but now uses shared helper). Cache busters bumped. All tests pass (47 frontend, 62 packet-filter).
- **Issue #125 fix:** Added dismiss/close button (✕) to the packet detail pane on desktop. Extracted `closeDetailPanel()` shared helper and `PANEL_CLOSE_HTML` constant — DRY: Escape handler and click handler both call it. Close button uses event delegation on `#pktRight`, styled with CSS variables (`--text-muted`, `--text`, `--surface-1`) matching the mobile `.mobile-sheet-close` pattern. Hidden when panel is in `.empty` state. Clicking a different row still re-opens with new data. Files changed: `public/packets.js`, `public/style.css`. Cache busters NOT bumped (another agent editing index.html).
- **Issue #122 fix:** Node tooltip (line 45) and node detail panel (line 120) in `channels.js` used `last_seen` alone for "Last seen" display. Changed both to `last_heard || last_seen` per AGENTS.md pitfall. Pattern: always prefer `last_heard || last_seen` for any time-ago display. **Server note for Hicks:**`/api/nodes/search` and `/api/nodes/:pubkey` endpoints don't return `last_heard` — only the bulk `/api/nodes` list endpoint computes it from the in-memory packet store. These endpoints need the same `last_heard` enrichment for the frontend fix to fully take effect. Also, `/api/analytics/channels` has a separate bug: `lastActivity` is overwritten unconditionally (no `>=` check) so it shows the oldest packet's timestamp, not the newest.
- **Issue #130 fix:** Live map `pruneStaleNodes()` (added for #133) was completely removing stale nodes from the map, while the static map dims them with CSS. Root cause: API-loaded nodes and WS-only nodes were treated identically — both got deleted when stale. Fix: mark API-loaded nodes with `_fromAPI = true` in `loadNodes()`. `pruneStaleNodes()` now dims API nodes (fillOpacity 0.25, opacity 0.15) instead of removing them, and restores full opacity when they become active again. WS-only dynamic nodes are still removed to prevent memory leaks. Pattern: **live map should match static map behavior** — never remove database-loaded nodes, only change their visual state. 3 new tests added (63 total frontend tests passing).
- **Issue #129 fix:** Added observer packet comparison feature (`#/compare` page). Users select two observers from dropdowns, click Compare, and see which packets each observer saw in the last 24 hours. Data flow: fetches packets per observer via existing `/api/packets?observer=X&limit=10000&since=24h`, computes set intersection/difference client-side using `comparePacketSets()` (O(n) via Set lookups — no nested loops). UI: three summary cards (both/only-A/only-B with counts and percentages), horizontal stacked bar chart, packet type breakdown for shared packets, and tabbed detail tables (up to 200 rows each, clickable to packet detail). URL is shareable: `#/compare?a=ID1&b=ID2`. Added 🔍 compare button to observers page header. Pure function `comparePacketSets` exposed on `window` for testability. 11 new tests (87 total frontend tests). Files: `public/compare.js` (new), `public/style.css`, `public/observers.js`, `public/index.html`, `test-frontend-helpers.js`. Cache busters bumped.
- **Browser validation of 6 fixes (2026-03-27):** Validated against live prod at `https://analyzer.00id.net`. Results: ✅ #133 (phantom nodes) — API returns 50 nodes, reasonable count, no runaway growth. ✅ #123 (channel hash on undecrypted) — GRP_TXT packets with `decryption_failed` status show `channelHashHex` field; packet detail renders `🔒 Channel Hash: 0xE2 (decryption failed)` via `packets.js:1254-1259`. ⏭ #126 (offline node on map) — skipped, requires specific dead node. ✅ #130 (disappearing nodes on live map) — `pruneStaleNodes()` confirmed at `live.js:1474` dims API-loaded nodes (`fillOpacity:0.25`) instead of removing; `_fromAPI=true` flag set at `live.js:1279`. ✅ #131 (auto-updating node list) — `nodes.js:210-216` wires `debouncedOnWS` handler that triggers `loadNodes(true)` on ADVERT messages; `isAdvertMessage()` at `nodes.js:852` checks `payload_type===4`. ✅ #129 (observer comparison) — `compare.js` deployed with full UI: observer dropdowns, `comparePacketSets()` Set logic, summary cards, bar chart, type breakdown. 16 observers available in prod. Pattern: always verify deployed JS matches source — cache buster `v=1774625000` confirmed consistent across all script tags.
- **Packet detail pane fresh-load fix:** The `detail-collapsed` class added for issue #125's close button wasn't applied on initial render, so the empty right panel was visible on fresh page load. Fix: added `detail-collapsed` to the `split-layout` div in the initial `innerHTML` template (packets.js:183). Pattern: when adding a CSS toggle class, always consider the initial DOM state — if nothing is selected, the default state must match "nothing selected." 3 tests added (90 total frontend). Cache busters bumped.
- **#130 LIVE MAP STALE DIMMING:** `pruneStaleNodes()` distinguishes API-loaded (`_fromAPI`) from WS-only. Dims API nodes (fillOpacity 0.25, opacity 0.15) instead of removing. Matches static map behavior. 3 new tests, all passing.
- **#131 NODES TAB WS AUTO-UPDATE:** `loadNodes(refreshOnly)` pattern resets cache + invalidateApiCache + re-fetches. Preserves scroll/selection/listeners. WS handler now triggers on ADVERT messages (payload_type===4). All tests passing.
- **#129 OBSERVER COMPARISON PAGE:** New `#/compare` route with shareable params `?a=ID1&b=ID2`. `comparePacketSets()` pure function (O(n) Set operations). UI: summary cards, bar chart, type breakdown, detail tables. 🔍 compare button on observers header.
- **#133 LIVE PAGE NODE PRUNING:** Prune every 60s using `getNodeStatus()` from roles.js (per-role health thresholds: 24h companions/sensors, 72h infrastructure). `_liveSeen` timestamp set on insert, updated on re-observation. Bounded memory usage.
- **Database merge:** All frontend endpoints working with merged 1.237M observation DB. Load speed verified. All 4 fixes tested end-to-end in browser.
Deep knowledge of every frontend behavior, API response, and user-facing feature in CoreScope. Fields community questions, triages bug reports, and explains "why does X look like Y."
1. Read the relevant frontend code FIRST — don't guess
2. Check the live API data if applicable (analyzer.00id.net is public)
3. Explain in user-friendly terms, not code jargon
4. If it's a bug, route to the right squad member
5. If it's expected behavior, explain WHY
## Model
Preferred: auto
# Ripley — Support Engineer
Deep knowledge of every frontend behavior, API response, and user-facing feature in CoreScope. Fields community questions, triages bug reports, and explains "why does X look like Y."
**Decision:** CI pipeline should check if `docker compose` (v2 plugin) is installed on the self-hosted runner and install it if needed, as part of the deploy job itself.
**Rationale:** Self-healing CI is preferred over manual VM setup; the VM may not have docker compose v2 installed.
### 2026-03-27T04:39 — Staging DB: Use Old Problematic DB
**By:** User (via Copilot)
**Decision:** Staging environment's primary purpose is debugging the problematic DB that caused 100% CPU on prod. Use the old DB (`~/meshcore-data-old/` on the VM) for staging. Prod keeps its current (new) DB. Never put the problematic DB on prod.
**Rationale:** This is the reason the staging environment was built.
### 2026-03-27T06:09 — Plan Go Rewrite (MQTT Separation)
**By:** User (via Copilot)
**Decision:** Start planning a Go rewrite. First step: separate MQTT ingestion (writes to DB) from the web server (reads from DB + serves API/frontend). Two separate services.
**Rationale:** Node.js single-thread + V8 heap limitations cause fragility at scale (185MB DB → 2.7GB heap → OOM). Go eliminates heap cap problem and enables real concurrency.
### 2026-03-27T06:31 — NO PII in Git
**By:** User (via Copilot)
**Decision:** NEVER write real names, usernames, email addresses, or any PII to files committed to git. Use "User" for attribution and "deploy" for SSH/server references. This is a PUBLIC repo.
**Rationale:** PII was leaked to the public repo and required a full git history rewrite to remove.
### 2026-03-27T02:19 — Production/Infrastructure Touches: Hudson Only
**By:** User (via Copilot)
**Decision:** Production/infrastructure touches (SSH, DB ops, server restarts, Azure operations) should only be done by Hudson (DevOps). No other agents should touch prod directly.
**Rationale:** Separation of concerns — dev agents write code, DevOps deploys and manages prod.
1. No Docker named volumes — always bind mount from `~/meshcore-data` (host location, easy to access)
2. Staging container runs on plaintext port (e.g., port 81, no HTTPS)
3. Use Docker Compose to orchestrate prod + staging containers on the same VM
4.`manage.sh` supports launching prod only OR prod+staging with clear messaging
5. Ports must be configurable via `manage.sh` or environment, with sane defaults
### 2026-03-27T03:43 — Staging Refinements: Shared Data
**By:** User (via Copilot)
**Decision:**
1. Staging copies prod DB on launch (snapshot into staging data dir when started)
2. Staging connects to SAME MQTT broker as prod (not its own Mosquitto)
**Rationale:** Staging needs real data (prod-like conditions) to be useful for testing.
### 2026-03-27T17:13 — Scribe Auto-Run After Agent Batches
**By:** User (via Copilot)
**Decision:** Scribe must run after EVERY batch of agent work automatically. No manual triggers. No reminders needed. This is a process guarantee, not a suggestion.
**Rationale:** Coordinator has been forgetting to spawn Scribe after agent batches complete. This is a process failure. Scribe auto-spawn ends the forgetfulness.
---
## Decision: Technical Fixes
### Issue #126 — Skip Ambiguous Hop Prefixes
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented
When resolving hop prefixes to full node pubkeys, require a **unique match**. If prefix matches 2+ nodes in DB, skip it and cache in `ambiguousHopPrefixes` (negative cache). Prevents hash prefix collisions (e.g., `1CC4` vs `1C82` sharing prefix `1C` under 1-byte hash_size) from attributing packets to wrong nodes.
**Impact:**
- Hopresixes that collide won't update `lastPathSeenMap` for any node (conservative, correct)
-`disambiguateHops()` still does geometric disambiguation for route visualization
-`autoLearnHopNodes()` no longer calls `db.upsertNode()` for unresolved hops
- Added `db.removePhantomNodes()` — deletes nodes where `LENGTH(public_key) <= 16` (real keys are 64 hex chars)
- Called at startup to purge existing phantoms from prior behavior
- Hop-resolver still handles unresolved prefixes gracefully
**Part 2: totalNodes now 7-day active window**
-`/api/stats``totalNodes` returns only nodes seen in last 7 days (was all-time)
- New field `totalNodesAllTime` for historical tracking
- Role counts (repeaters, rooms, companions, sensors) also filtered to 7-day window
- Frontend: no changes needed (same field name, smaller correct number)
**Impact:** Frontend `totalNodes` now reflects active mesh size. Go server should apply same 7-day filter when querying.
---
### Issue #123 — Channel Hash on Undecrypted Messages
**By:** Hicks
**Status:** Implemented
Fixed test coverage for decrypted status tracking on channel messages.
---
### Issue #130 — Live Map: Dim Stale Nodes, Don't Remove
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
`pruneStaleNodes()` in `live.js` now distinguishes API-loaded nodes (`_fromAPI`) from WS-only dynamic nodes. API nodes dimmed (reduced opacity) when stale instead of removed. WS-only nodes still pruned to prevent memory leaks.
**Rationale:** Static map shows stale nodes with faded markers; live map was deleting them, causing user-reported disappearing nodes. Parity expected.
**Pattern:** Database-loaded nodes never removed from map during session. Future live map features should respect `_fromAPI` flag.
---
### Issue #131 — Nodes Tab Auto-Update via WebSocket
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
WS-driven page updates must reset local caches: (1) set local cache to null, (2) call `invalidateApiCache()`, (3) re-fetch. New `loadNodes(refreshOnly)` pattern skips full DOM rebuild, only updates data rows. Preserves scroll, selection, listeners.
**Trap:** Two-layer caching (local variable + API cache) prevents re-fetches. All three reset steps required.
**Pattern:** Other pages doing WS-driven updates should follow same approach.
---
### Issue #129 — Observer Comparison Page
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
Added `comparePacketSets(hashesA, hashesB)` as standalone pure function exposed on `window` for testability. Computes `{ onlyA, onlyB, both }` via Set operations (O(n)).
**Pattern:** Comparison logic decoupled from UI, reusable. Client-side diff avoids new server endpoint. 24-hour window keeps data size reasonable (~10K packets max).
---
### Issue #132 — Detail Pane Collapse
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
Detail pane collapse uses CSS class on parent container. Add `detail-collapsed` class to `.split-layout`, which sets `.panel-right` to `display: none`. `.panel-left` with `flex: 1` fills 100% width naturally.
**Pattern:** CSS class toggling on parent cleaner than inline styles, easier to animate, keeps layout logic in CSS.
---
## Decision: Infrastructure & Deployment
### Database Merge — Prod + Staging
**By:** Kobayashi (Lead) / Hudson (DevOps)
**Date:** 2026-03-27
**Status:** ✅ Complete
Merged staging DB (185MB, 50K transmissions + 1.2M observations) into prod DB (21MB). Dedup strategy:
- **Transmissions:** `INSERT OR IGNORE` on `hash` (unique key)
- **Observations:** All unique by observer, all preserved
- **Nodes/Observers:** Latest `last_seen` wins, sum counts
- Backups: Retained at `/home/deploy/backups/pre-merge-20260327-071425/` until 2026-04-03
---
### Unified Docker Volume Paths
**By:** Hudson (DevOps)
**Date:** 2026-03-27
**Status:** Applied
Reconciled `manage.sh` and `docker-compose.yml` Docker volume names:
- Caddy volume: `caddy-data` everywhere (prod); `caddy-data-staging` for staging
- Data directory: Bind mount via `PROD_DATA_DIR` env var, default `~/meshcore-data`
- Config/Caddyfile: Mounted from repo checkout for prod, staging data dir for staging
- Removed deprecated `version` key from docker-compose.yml
**Consequence:**`./manage.sh start` and `docker compose up prod` now produce identical mounts. Anyone with data in old `caddy-data-prod` volume will need Caddy to re-provision TLS certs automatically.
Standalone Go web server replacing Node.js server's READ side (REST API + WebSocket). Two-component rewrite: ingestor (MQTT writes), server (REST/WS reads).
**Architecture Decisions:**
1.**Direct SQLite queries** — No in-memory packet store; all reads via `packets_v` view (v3 schema)
2.**Per-module go.mod** — Each `cmd/*` directory has own `go.mod`
3.**gorilla/mux for routing** — Handles 35+ parameterized routes cleanly
4.**SQLite polling for WebSocket** — Polls for new transmission IDs every 1s (decouples from MQTT)
**Future Work:** Full analytics via SQL, TTL response cache, shared `internal/db/` package, TLS, region-aware filtering.
---
### Go API Parity: Transmission-Centric Queries
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented, all 42+ tests pass
Go server rewrote packet list queries from VIEW-based (slow, wrong shape) to **transmission-centric** with correlated subqueries. Schema version detection (`isV3` flag) handles both v2 and v3 schemas.
-`/api/packets` — Multi-node filter support (`nodes` query param, comma-separated pubkeys)
---
### Go In-Memory Packet Store (cmd/server/store.go)
**By:** Hicks (Backend Dev)
**Date:** 2026-03-26
**Status:** Implemented
Port of `packet-store.js` with streaming load, 5 indexes, lean observation structs (only observation-specific fields). `QueryPackets` handles type, route, observer, hash, since, until, region, node. `IngestNewFromDB()` streams new transmissions from DB into memory.
- Startup: One-time load adds few seconds (acceptable)
- DB still used for: analytics, node/observer queries, role counts, region resolution
---
### Observation RAM Optimization
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented
Observation objects in in-memory packet store now store only `transmission_id` reference instead of copying `hash`, `raw_hex`, `decoded_json`, `payload_type`, `route_type` from parent. API boundary methods (`getById`, `getSiblings`, `enrichObservations`) hydrate on demand. Load uses `.iterate()` instead of `.all()` to avoid materializing full JOIN.
**Impact:** Eliminates ~1.17M redundant string copies, avoids 1.17M-row array during startup. 2.7GB RAM → acceptable levels with 185MB database.
**Code Pattern:** Any code reading observation objects from `tx.observations` directly must use `pktStore.enrichObservations()` if it needs transmission fields. Internal iteration over observations for observer_id, snr, rssi, path_json works unchanged.
**Status:** Proposed — awaiting user sign-off before implementation
Playwright E2E tests (16 tests in `test-e2e-playwright.js`) are slow in CI. Analysis identified ~40-50% potential runtime reduction.
### Recommendations (prioritized)
#### HIGH impact (30%+ improvement)
1.**Replace `waitUntil: 'networkidle'` with `'domcontentloaded'` + targeted waits** — used ~20 times; `networkidle` worst-case for SPAs with persistent WebSocket + Leaflet tile loading. Each navigation pays 500ms+ penalty.
2.**Eliminate redundant navigations** — group tests by route; navigate once, run all assertions for that route.
3.**Cache Playwright browser install in CI** — `npx playwright install chromium --with-deps` runs every frontend push. Self-hosted runner should retain browser between runs.
#### MEDIUM impact (10-30%)
4.**Replace hardcoded `waitForTimeout` with event-driven waits** — ~17s scattered. Replace with `waitForSelector`, `waitForFunction`, or `page.waitForResponse`.
5.**Merge coverage collection into E2E run** — `collect-frontend-coverage.js` launches second browser. Extract `window.__coverage__` at E2E end instead.
6.**Replace `sleep 5` server startup with health-check polling** — Start tests as soon as `/api/stats` responsive (~1-2s savings).
#### LOW impact (<10%)
7.**Block unnecessary resources for non-visual tests** — use `page.route()` to abort map tiles, fonts.
8.**Reduce default timeout 15s → 10s** — sufficient for local CI.
### Implementation notes
- Items 1-2 are test-file-only (Bishop/Newt scope)
- Items 3, 5-6 are CI pipeline (Hicks scope)
- No architectural changes; all incremental
- All assertions remain identical — only wait strategies change
---
### 2026-03-27T20:56:00Z — Protobuf API Contract (Merged)
**By:** Kpa-clawbot (via Copilot)
**Decision:**
1. All frontend/backend interfaces get protobuf definitions as single source of truth
2. Go generates structs with JSON tags from protos; Node stays unchanged — protos derived from Node's current JSON shapes
3. Proto definitions MUST use inheritance and composition (no repeating field definitions)
4. Data flow: SQLite → proto struct → JSON; JSON blobs from DB deserialize against proto structs for validation
5. CI pipeline's proto fixture capture runs against prod (stable reference), not staging
**Rationale:** Eliminates parity bugs between Node and Go. Compiler-enforced contract. Prod is known-good baseline.
**Decision:** CI pipeline should check if `docker compose` (v2 plugin) is installed on the self-hosted runner and install it if needed, as part of the deploy job itself.
**Rationale:** Self-healing CI is preferred over manual VM setup; the VM may not have docker compose v2 installed.
### 2026-03-27T04:39 — Staging DB: Use Old Problematic DB
**By:** User (via Copilot)
**Decision:** Staging environment's primary purpose is debugging the problematic DB that caused 100% CPU on prod. Use the old DB (`~/meshcore-data-old/` on the VM) for staging. Prod keeps its current (new) DB. Never put the problematic DB on prod.
**Rationale:** This is the reason the staging environment was built.
### 2026-03-27T06:09 — Plan Go Rewrite (MQTT Separation)
**By:** User (via Copilot)
**Decision:** Start planning a Go rewrite. First step: separate MQTT ingestion (writes to DB) from the web server (reads from DB + serves API/frontend). Two separate services.
**Rationale:** Node.js single-thread + V8 heap limitations cause fragility at scale (185MB DB → 2.7GB heap → OOM). Go eliminates heap cap problem and enables real concurrency.
### 2026-03-27T06:31 — NO PII in Git
**By:** User (via Copilot)
**Decision:** NEVER write real names, usernames, email addresses, or any PII to files committed to git. Use "User" for attribution and "deploy" for SSH/server references. This is a PUBLIC repo.
**Rationale:** PII was leaked to the public repo and required a full git history rewrite to remove.
### 2026-03-27T02:19 — Production/Infrastructure Touches: Hudson Only
**By:** User (via Copilot)
**Decision:** Production/infrastructure touches (SSH, DB ops, server restarts, Azure operations) should only be done by Hudson (DevOps). No other agents should touch prod directly.
**Rationale:** Separation of concerns — dev agents write code, DevOps deploys and manages prod.
1. No Docker named volumes — always bind mount from `~/meshcore-data` (host location, easy to access)
2. Staging container runs on plaintext port (e.g., port 81, no HTTPS)
3. Use Docker Compose to orchestrate prod + staging containers on the same VM
4.`manage.sh` supports launching prod only OR prod+staging with clear messaging
5. Ports must be configurable via `manage.sh` or environment, with sane defaults
### 2026-03-27T03:43 — Staging Refinements: Shared Data
**By:** User (via Copilot)
**Decision:**
1. Staging copies prod DB on launch (snapshot into staging data dir when started)
2. Staging connects to SAME MQTT broker as prod (not its own Mosquitto)
**Rationale:** Staging needs real data (prod-like conditions) to be useful for testing.
### 2026-03-27T17:13 — Scribe Auto-Run After Agent Batches
**By:** User (via Copilot)
**Decision:** Scribe must run after EVERY batch of agent work automatically. No manual triggers. No reminders needed. This is a process guarantee, not a suggestion.
**Rationale:** Coordinator has been forgetting to spawn Scribe after agent batches complete. This is a process failure. Scribe auto-spawn ends the forgetfulness.
---
## Decision: Technical Fixes
### Issue #126 — Skip Ambiguous Hop Prefixes
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented
When resolving hop prefixes to full node pubkeys, require a **unique match**. If prefix matches 2+ nodes in DB, skip it and cache in `ambiguousHopPrefixes` (negative cache). Prevents hash prefix collisions (e.g., `1CC4` vs `1C82` sharing prefix `1C` under 1-byte hash_size) from attributing packets to wrong nodes.
**Impact:**
- Hopresixes that collide won't update `lastPathSeenMap` for any node (conservative, correct)
-`disambiguateHops()` still does geometric disambiguation for route visualization
-`autoLearnHopNodes()` no longer calls `db.upsertNode()` for unresolved hops
- Added `db.removePhantomNodes()` — deletes nodes where `LENGTH(public_key) <= 16` (real keys are 64 hex chars)
- Called at startup to purge existing phantoms from prior behavior
- Hop-resolver still handles unresolved prefixes gracefully
**Part 2: totalNodes now 7-day active window**
-`/api/stats``totalNodes` returns only nodes seen in last 7 days (was all-time)
- New field `totalNodesAllTime` for historical tracking
- Role counts (repeaters, rooms, companions, sensors) also filtered to 7-day window
- Frontend: no changes needed (same field name, smaller correct number)
**Impact:** Frontend `totalNodes` now reflects active mesh size. Go server should apply same 7-day filter when querying.
---
### Issue #123 — Channel Hash on Undecrypted Messages
**By:** Hicks
**Status:** Implemented
Fixed test coverage for decrypted status tracking on channel messages.
---
### Issue #130 — Live Map: Dim Stale Nodes, Don't Remove
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
`pruneStaleNodes()` in `live.js` now distinguishes API-loaded nodes (`_fromAPI`) from WS-only dynamic nodes. API nodes dimmed (reduced opacity) when stale instead of removed. WS-only nodes still pruned to prevent memory leaks.
**Rationale:** Static map shows stale nodes with faded markers; live map was deleting them, causing user-reported disappearing nodes. Parity expected.
**Pattern:** Database-loaded nodes never removed from map during session. Future live map features should respect `_fromAPI` flag.
---
### Issue #131 — Nodes Tab Auto-Update via WebSocket
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
WS-driven page updates must reset local caches: (1) set local cache to null, (2) call `invalidateApiCache()`, (3) re-fetch. New `loadNodes(refreshOnly)` pattern skips full DOM rebuild, only updates data rows. Preserves scroll, selection, listeners.
**Trap:** Two-layer caching (local variable + API cache) prevents re-fetches. All three reset steps required.
**Pattern:** Other pages doing WS-driven updates should follow same approach.
---
### Issue #129 — Observer Comparison Page
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
Added `comparePacketSets(hashesA, hashesB)` as standalone pure function exposed on `window` for testability. Computes `{ onlyA, onlyB, both }` via Set operations (O(n)).
**Pattern:** Comparison logic decoupled from UI, reusable. Client-side diff avoids new server endpoint. 24-hour window keeps data size reasonable (~10K packets max).
---
### Issue #132 — Detail Pane Collapse
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
Detail pane collapse uses CSS class on parent container. Add `detail-collapsed` class to `.split-layout`, which sets `.panel-right` to `display: none`. `.panel-left` with `flex: 1` fills 100% width naturally.
**Pattern:** CSS class toggling on parent cleaner than inline styles, easier to animate, keeps layout logic in CSS.
---
## Decision: Infrastructure & Deployment
### Database Merge — Prod + Staging
**By:** Kobayashi (Lead) / Hudson (DevOps)
**Date:** 2026-03-27
**Status:** ✅ Complete
Merged staging DB (185MB, 50K transmissions + 1.2M observations) into prod DB (21MB). Dedup strategy:
- **Transmissions:** `INSERT OR IGNORE` on `hash` (unique key)
- **Observations:** All unique by observer, all preserved
- **Nodes/Observers:** Latest `last_seen` wins, sum counts
- Backups: Retained at `/home/deploy/backups/pre-merge-20260327-071425/` until 2026-04-03
---
### Unified Docker Volume Paths
**By:** Hudson (DevOps)
**Date:** 2026-03-27
**Status:** Applied
Reconciled `manage.sh` and `docker-compose.yml` Docker volume names:
- Caddy volume: `caddy-data` everywhere (prod); `caddy-data-staging` for staging
- Data directory: Bind mount via `PROD_DATA_DIR` env var, default `~/meshcore-data`
- Config/Caddyfile: Mounted from repo checkout for prod, staging data dir for staging
- Removed deprecated `version` key from docker-compose.yml
**Consequence:**`./manage.sh start` and `docker compose up prod` now produce identical mounts. Anyone with data in old `caddy-data-prod` volume will need Caddy to re-provision TLS certs automatically.
Standalone Go web server replacing Node.js server's READ side (REST API + WebSocket). Two-component rewrite: ingestor (MQTT writes), server (REST/WS reads).
**Architecture Decisions:**
1.**Direct SQLite queries** — No in-memory packet store; all reads via `packets_v` view (v3 schema)
2.**Per-module go.mod** — Each `cmd/*` directory has own `go.mod`
3.**gorilla/mux for routing** — Handles 35+ parameterized routes cleanly
4.**SQLite polling for WebSocket** — Polls for new transmission IDs every 1s (decouples from MQTT)
**Future Work:** Full analytics via SQL, TTL response cache, shared `internal/db/` package, TLS, region-aware filtering.
---
### Go API Parity: Transmission-Centric Queries
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented, all 42+ tests pass
Go server rewrote packet list queries from VIEW-based (slow, wrong shape) to **transmission-centric** with correlated subqueries. Schema version detection (`isV3` flag) handles both v2 and v3 schemas.
-`/api/packets` — Multi-node filter support (`nodes` query param, comma-separated pubkeys)
---
### Go In-Memory Packet Store (cmd/server/store.go)
**By:** Hicks (Backend Dev)
**Date:** 2026-03-26
**Status:** Implemented
Port of `packet-store.js` with streaming load, 5 indexes, lean observation structs (only observation-specific fields). `QueryPackets` handles type, route, observer, hash, since, until, region, node. `IngestNewFromDB()` streams new transmissions from DB into memory.
- Startup: One-time load adds few seconds (acceptable)
- DB still used for: analytics, node/observer queries, role counts, region resolution
---
### Observation RAM Optimization
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented
Observation objects in in-memory packet store now store only `transmission_id` reference instead of copying `hash`, `raw_hex`, `decoded_json`, `payload_type`, `route_type` from parent. API boundary methods (`getById`, `getSiblings`, `enrichObservations`) hydrate on demand. Load uses `.iterate()` instead of `.all()` to avoid materializing full JOIN.
**Impact:** Eliminates ~1.17M redundant string copies, avoids 1.17M-row array during startup. 2.7GB RAM → acceptable levels with 185MB database.
**Code Pattern:** Any code reading observation objects from `tx.observations` directly must use `pktStore.enrichObservations()` if it needs transmission fields. Internal iteration over observations for observer_id, snr, rssi, path_json works unchanged.
**Status:** Proposed — awaiting user sign-off before implementation
Playwright E2E tests (16 tests in `test-e2e-playwright.js`) are slow in CI. Analysis identified ~40-50% potential runtime reduction.
### Recommendations (prioritized)
#### HIGH impact (30%+ improvement)
1.**Replace `waitUntil: 'networkidle'` with `'domcontentloaded'` + targeted waits** — used ~20 times; `networkidle` worst-case for SPAs with persistent WebSocket + Leaflet tile loading. Each navigation pays 500ms+ penalty.
2.**Eliminate redundant navigations** — group tests by route; navigate once, run all assertions for that route.
3.**Cache Playwright browser install in CI** — `npx playwright install chromium --with-deps` runs every frontend push. Self-hosted runner should retain browser between runs.
#### MEDIUM impact (10-30%)
4.**Replace hardcoded `waitForTimeout` with event-driven waits** — ~17s scattered. Replace with `waitForSelector`, `waitForFunction`, or `page.waitForResponse`.
5.**Merge coverage collection into E2E run** — `collect-frontend-coverage.js` launches second browser. Extract `window.__coverage__` at E2E end instead.
6.**Replace `sleep 5` server startup with health-check polling** — Start tests as soon as `/api/stats` responsive (~1-2s savings).
#### LOW impact (<10%)
7.**Block unnecessary resources for non-visual tests** — use `page.route()` to abort map tiles, fonts.
8.**Reduce default timeout 15s → 10s** — sufficient for local CI.
### Implementation notes
- Items 1-2 are test-file-only (Bishop/Newt scope)
- Items 3, 5-6 are CI pipeline (Hicks scope)
- No architectural changes; all incremental
- All assertions remain identical — only wait strategies change
---
### 2026-03-27T20:56:00Z — Protobuf API Contract (Merged)
**By:** Kpa-clawbot (via Copilot)
**Decision:**
1. All frontend/backend interfaces get protobuf definitions as single source of truth
2. Go generates structs with JSON tags from protos; Node stays unchanged — protos derived from Node's current JSON shapes
3. Proto definitions MUST use inheritance and composition (no repeating field definitions)
4. Data flow: SQLite → proto struct → JSON; JSON blobs from DB deserialize against proto structs for validation
5. CI pipeline's proto fixture capture runs against prod (stable reference), not staging
**Rationale:** Eliminates parity bugs between Node and Go. Compiler-enforced contract. Prod is known-good baseline.
-`kobayashi-2026-03-27.md` (27 lines) — Root cause analysis, DB merge plan, coordination
-`hudson-2026-03-27.md` (117 lines) — DB merge execution, Docker Compose migration, staging setup, CI pipeline
-`ripley-2026-03-27.md` (30 lines) — Support onboarding, health threshold documentation
**Entry Total:** 448 lines of orchestration logs covering 28 issues, 2 Go services, database merge, staging deployment, CI pipeline updates, 42 E2E tests, 19 backend fixes
---
## Decisions.md Review
Current decisions.md (342 lines) contains authoritative log of all technical + infrastructure + deployment decisions made during #151-160 session. No archival needed (well under 20KB threshold). Organized by:
1. User Directives (process decisions)
2. Technical Fixes (bug fixes with rationale)
3. Infrastructure & Deployment (ops decisions)
4. Go Rewrite — API & Storage (architecture decisions)
-`kobayashi-2026-03-27.md` (27 lines) — Root cause analysis, DB merge plan, coordination
-`hudson-2026-03-27.md` (117 lines) — DB merge execution, Docker Compose migration, staging setup, CI pipeline
-`ripley-2026-03-27.md` (30 lines) — Support onboarding, health threshold documentation
**Entry Total:** 448 lines of orchestration logs covering 28 issues, 2 Go services, database merge, staging deployment, CI pipeline updates, 42 E2E tests, 19 backend fixes
---
## Decisions.md Review
Current decisions.md (342 lines) contains authoritative log of all technical + infrastructure + deployment decisions made during #151-160 session. No archival needed (well under 20KB threshold). Organized by:
1. User Directives (process decisions)
2. Technical Fixes (bug fixes with rationale)
3. Infrastructure & Deployment (ops decisions)
4. Go Rewrite — API & Storage (architecture decisions)
| `squad:{name}` | Pick up issue and complete the work | Named member |
### How Issue Assignment Works
1. When a GitHub issue gets the `squad` label, the **Lead** triages it — analyzing content, assigning the right `squad:{member}` label, and commenting with triage notes.
2. When a `squad:{member}` label is applied, that member picks up the issue in their next session.
3. Members can reassign by removing their label and adding another member's label.
4. The `squad` label is the "inbox" — untriaged issues waiting for Lead review.
## Rules
1.**Eager by default** — spawn all agents who could usefully start work, including anticipatory downstream work.
2.**Scribe always runs** after substantial work, always as `mode: "background"`. Never blocks.
3.**Quick facts → coordinator answers directly.** Don't spawn an agent for "what port does the server run on?"
4.**When two agents could handle it**, pick the one whose domain is the primary concern.
5.**"Team, ..." → fan-out.** Spawn all relevant agents in parallel as `mode: "background"`.
6.**Anticipate downstream work.** If a feature is being built, spawn the tester to write test cases from requirements simultaneously.
7.**Issue-labeled work** — when a `squad:{member}` label is applied to an issue, route to that member. The Lead handles all `squad` (base label) triage.
| `squad:{name}` | Pick up issue and complete the work | Named member |
### How Issue Assignment Works
1. When a GitHub issue gets the `squad` label, the **Lead** triages it — analyzing content, assigning the right `squad:{member}` label, and commenting with triage notes.
2. When a `squad:{member}` label is applied, that member picks up the issue in their next session.
3. Members can reassign by removing their label and adding another member's label.
4. The `squad` label is the "inbox" — untriaged issues waiting for Lead review.
## Rules
1.**Eager by default** — spawn all agents who could usefully start work, including anticipatory downstream work.
2.**Scribe always runs** after substantial work, always as `mode: "background"`. Never blocks.
3.**Quick facts → coordinator answers directly.** Don't spawn an agent for "what port does the server run on?"
4.**When two agents could handle it**, pick the one whose domain is the primary concern.
5.**"Team, ..." → fan-out.** Spawn all relevant agents in parallel as `mode: "background"`.
6.**Anticipate downstream work.** If a feature is being built, spawn the tester to write test cases from requirements simultaneously.
7.**Issue-labeled work** — when a `squad:{member}` label is applied to an issue, route to that member. The Lead handles all `squad` (base label) triage.
> {One-line personality statement — what makes this person tick}
## Identity
- **Name:** {Name}
- **Role:** {Role title}
- **Expertise:** {2-3 specific skills relevant to the project}
- **Style:** {How they communicate — direct? thorough? opinionated?}
## What I Own
- {Area of responsibility 1}
- {Area of responsibility 2}
- {Area of responsibility 3}
## How I Work
- {Key approach or principle 1}
- {Key approach or principle 2}
- {Pattern or convention I follow}
## Boundaries
**I handle:** {types of work this agent does}
**I don't handle:** {types of work that belong to other team members}
**When I'm unsure:** I say so and suggest who might know.
**If I review others' work:** On rejection, I may require a different agent to revise (not the original author) or request a new specialist be spawned. The Coordinator enforces this.
## Model
- **Preferred:** auto
- **Rationale:** Coordinator selects the best model based on task type — cost first unless writing code
- **Fallback:** Standard chain — the coordinator handles fallback automatically
## Collaboration
Before starting work, run `git rev-parse --show-toplevel` to find the repo root, or use the `TEAM ROOT` provided in the spawn prompt. All `.squad/` paths must be resolved relative to this root — do not assume CWD is the repo root (you may be in a worktree or subdirectory).
Before starting work, read `.squad/decisions.md` for team decisions that affect me.
After making a decision others should know, write it to `.squad/decisions/inbox/{my-name}-{brief-slug}.md` — the Scribe will merge it.
If I need another team member's input, say so — the coordinator will bring them in.
## Voice
{1-2 sentences describing personality. Not generic — specific. This agent has OPINIONS.
They have preferences. They push back. They have a style that's distinctly theirs.
Example: "Opinionated about test coverage. Will push back if tests are skipped.
Prefers integration tests over mocks. Thinks 80% coverage is the floor, not the ceiling."}
# {Name} — {Role}
> {One-line personality statement — what makes this person tick}
## Identity
- **Name:** {Name}
- **Role:** {Role title}
- **Expertise:** {2-3 specific skills relevant to the project}
- **Style:** {How they communicate — direct? thorough? opinionated?}
## What I Own
- {Area of responsibility 1}
- {Area of responsibility 2}
- {Area of responsibility 3}
## How I Work
- {Key approach or principle 1}
- {Key approach or principle 2}
- {Pattern or convention I follow}
## Boundaries
**I handle:** {types of work this agent does}
**I don't handle:** {types of work that belong to other team members}
**When I'm unsure:** I say so and suggest who might know.
**If I review others' work:** On rejection, I may require a different agent to revise (not the original author) or request a new specialist be spawned. The Coordinator enforces this.
## Model
- **Preferred:** auto
- **Rationale:** Coordinator selects the best model based on task type — cost first unless writing code
- **Fallback:** Standard chain — the coordinator handles fallback automatically
## Collaboration
Before starting work, run `git rev-parse --show-toplevel` to find the repo root, or use the `TEAM ROOT` provided in the spawn prompt. All `.squad/` paths must be resolved relative to this root — do not assume CWD is the repo root (you may be in a worktree or subdirectory).
Before starting work, read `.squad/decisions.md` for team decisions that affect me.
After making a decision others should know, write it to `.squad/decisions/inbox/{my-name}-{brief-slug}.md` — the Scribe will merge it.
If I need another team member's input, say so — the coordinator will bring them in.
## Voice
{1-2 sentences describing personality. Not generic — specific. This agent has OPINIONS.
They have preferences. They push back. They have a style that's distinctly theirs.
Example: "Opinionated about test coverage. Will push back if tests are skipped.
Prefers integration tests over mocks. Thinks 80% coverage is the floor, not the ceiling."}
When the user or system imposes constraints (question limits, revision limits, time budgets), maintain a visible counter in your responses and in the artifact.
## Format
```
📊 Clarifying questions used: 2 / 3
```
## Rules
- Update the counter each time the constraint is consumed
- When a constraint is exhausted, state it: `📊 Question budget exhausted (3/3). Proceeding with current information.`
- If no constraints are active, do not display counters
- Include the final constraint status in multi-agent artifacts
## Example Session
```
Coordinator: Spawning agents to analyze requirements...
📊 Clarifying questions used: 0 / 3
Agent asks clarification: "Should we support OAuth?"
Coordinator: Checking with user...
📊 Clarifying questions used: 1 / 3
Agent asks clarification: "What's the rate limit?"
Coordinator: Checking with user...
📊 Clarifying questions used: 2 / 3
Agent asks clarification: "Do we need RBAC?"
Coordinator: Checking with user...
📊 Clarifying questions used: 3 / 3
Agent asks clarification: "Should we cache responses?"
Coordinator: 📊 Question budget exhausted (3/3). Proceeding without clarification.
```
# Constraint Budget Tracking
When the user or system imposes constraints (question limits, revision limits, time budgets), maintain a visible counter in your responses and in the artifact.
## Format
```
📊 Clarifying questions used: 2 / 3
```
## Rules
- Update the counter each time the constraint is consumed
- When a constraint is exhausted, state it: `📊 Question budget exhausted (3/3). Proceeding with current information.`
- If no constraints are active, do not display counters
- Include the final constraint status in multi-agent artifacts
## Example Session
```
Coordinator: Spawning agents to analyze requirements...
📊 Clarifying questions used: 0 / 3
Agent asks clarification: "Should we support OAuth?"
Coordinator: Checking with user...
📊 Clarifying questions used: 1 / 3
Agent asks clarification: "What's the rate limit?"
Coordinator: Checking with user...
📊 Clarifying questions used: 2 / 3
Agent asks clarification: "Do we need RBAC?"
Coordinator: Checking with user...
📊 Clarifying questions used: 3 / 3
Agent asks clarification: "Should we cache responses?"
Coordinator: 📊 Question budget exhausted (3/3). Proceeding without clarification.
# Cooperative Rate Limiting for Multi-Agent Deployments
> Coordinate API quota across multiple Ralph instances to prevent cascading failures.
## Problem
The [circuit breaker template](ralph-circuit-breaker.md) handles single-instance rate limiting well. But when multiple Ralphs run across machines (or pods on K8s), each instance independently hits API limits:
- **No coordination** — 5 Ralphs each think they have full API quota
- **Thundering herd** — All Ralphs retry simultaneously after rate limit resets
- **Priority inversion** — Low-priority work exhausts quota before critical work runs
- **Reactive only** — Circuit opens AFTER 429, wasting the failed request
## Solution: 6-Pattern Architecture
These patterns layer on top of the existing circuit breaker. Each is independent — adopt one or all.
4.**Kubernetes:** Add KEDA scaler for automatic pod scaling
## References
- [Circuit Breaker Template](ralph-circuit-breaker.md) — Foundation patterns
- [Squad on AKS](https://github.com/tamirdresher/squad-on-aks) — Production K8s deployment
- [KEDA Copilot Scaler](https://github.com/tamirdresher/keda-copilot-scaler) — Custom KEDA external scaler
# Cooperative Rate Limiting for Multi-Agent Deployments
> Coordinate API quota across multiple Ralph instances to prevent cascading failures.
## Problem
The [circuit breaker template](ralph-circuit-breaker.md) handles single-instance rate limiting well. But when multiple Ralphs run across machines (or pods on K8s), each instance independently hits API limits:
- **No coordination** — 5 Ralphs each think they have full API quota
- **Thundering herd** — All Ralphs retry simultaneously after rate limit resets
- **Priority inversion** — Low-priority work exhausts quota before critical work runs
- **Reactive only** — Circuit opens AFTER 429, wasting the failed request
## Solution: 6-Pattern Architecture
These patterns layer on top of the existing circuit breaker. Each is independent — adopt one or all.
You are working on a project that uses **Squad**, an AI team framework. When picking up issues autonomously, follow these guidelines.
## Team Context
Before starting work on any issue:
1. Read `.squad/team.md` for the team roster, member roles, and your capability profile.
2. Read `.squad/routing.md` for work routing rules.
3. If the issue has a `squad:{member}` label, read that member's charter at `.squad/agents/{member}/charter.md` to understand their domain expertise and coding style — work in their voice.
## Capability Self-Check
Before starting work, check your capability profile in `.squad/team.md` under the **Coding Agent → Capabilities** section.
- **🟢 Good fit** — proceed autonomously.
- **🟡 Needs review** — proceed, but note in the PR description that a squad member should review.
- **🔴 Not suitable** — do NOT start work. Instead, comment on the issue:
```
🤖 This issue doesn't match my capability profile (reason: {why}). Suggesting reassignment to a squad member.
```
## Branch Naming
Use the squad branch convention:
```
squad/{issue-number}-{kebab-case-slug}
```
Example: `squad/42-fix-login-validation`
## PR Guidelines
When opening a PR:
- Reference the issue: `Closes #{issue-number}`
- If the issue had a `squad:{member}` label, mention the member: `Working as {member} ({role})`
- If this is a 🟡 needs-review task, add to the PR description: `⚠️ This task was flagged as "needs review" — please have a squad member review before merging.`
- Follow any project conventions in `.squad/decisions.md`
## Decisions
If you make a decision that affects other team members, write it to:
```
.squad/decisions/inbox/copilot-{brief-slug}.md
```
The Scribe will merge it into the shared decisions file.
# Copilot Coding Agent — Squad Instructions
You are working on a project that uses **Squad**, an AI team framework. When picking up issues autonomously, follow these guidelines.
## Team Context
Before starting work on any issue:
1. Read `.squad/team.md` for the team roster, member roles, and your capability profile.
2. Read `.squad/routing.md` for work routing rules.
3. If the issue has a `squad:{member}` label, read that member's charter at `.squad/agents/{member}/charter.md` to understand their domain expertise and coding style — work in their voice.
## Capability Self-Check
Before starting work, check your capability profile in `.squad/team.md` under the **Coding Agent → Capabilities** section.
- **🟢 Good fit** — proceed autonomously.
- **🟡 Needs review** — proceed, but note in the PR description that a squad member should review.
- **🔴 Not suitable** — do NOT start work. Instead, comment on the issue:
```
🤖 This issue doesn't match my capability profile (reason: {why}). Suggesting reassignment to a squad member.
```
## Branch Naming
Use the squad branch convention:
```
squad/{issue-number}-{kebab-case-slug}
```
Example: `squad/42-fix-login-validation`
## PR Guidelines
When opening a PR:
- Reference the issue: `Closes #{issue-number}`
- If the issue had a `squad:{member}` label, mention the member: `Working as {member} ({role})`
- If this is a 🟡 needs-review task, add to the PR description: `⚠️ This task was flagged as "needs review" — please have a squad member review before merging.`
- Follow any project conventions in `.squad/decisions.md`
## Decisions
If you make a decision that affects other team members, write it to:
```
.squad/decisions/inbox/copilot-{brief-slug}.md
```
The Scribe will merge it into the shared decisions file.
# KEDA External Scaler for GitHub Issue-Driven Agent Autoscaling
> Scale agent pods to zero when idle, up when work arrives — driven by GitHub Issues.
## Overview
When running Squad on Kubernetes, agent pods sit idle when no work exists. [KEDA](https://keda.sh) (Kubernetes Event-Driven Autoscaler) solves this for queue-based workloads, but GitHub Issues isn't a native KEDA trigger.
The `keda-copilot-scaler` is a KEDA External Scaler (gRPC) that bridges this gap:
1. Polls GitHub API for issues matching specific labels (e.g., `squad:copilot`)
# KEDA External Scaler for GitHub Issue-Driven Agent Autoscaling
> Scale agent pods to zero when idle, up when work arrives — driven by GitHub Issues.
## Overview
When running Squad on Kubernetes, agent pods sit idle when no work exists. [KEDA](https://keda.sh) (Kubernetes Event-Driven Autoscaler) solves this for queue-based workloads, but GitHub Issues isn't a native KEDA trigger.
The `keda-copilot-scaler` is a KEDA External Scaler (gRPC) that bridges this gap:
1. Polls GitHub API for issues matching specific labels (e.g., `squad:copilot`)
> Enable Ralph to skip issues requiring capabilities the current machine lacks.
## Overview
When running Squad across multiple machines (laptops, DevBoxes, GPU servers, Kubernetes nodes), each machine has different tooling. The capability system lets you declare what each machine can do, and Ralph automatically routes work accordingly.
## Setup
### 1. Create a Capabilities Manifest
Create `~/.squad/machine-capabilities.json` (user-wide) or `.squad/machine-capabilities.json` (project-local):
> Enable Ralph to skip issues requiring capabilities the current machine lacks.
## Overview
When running Squad across multiple machines (laptops, DevBoxes, GPU servers, Kubernetes nodes), each machine has different tooling. The capability system lets you declare what each machine can do, and Ralph automatically routes work accordingly.
## Setup
### 1. Create a Capabilities Manifest
Create `~/.squad/machine-capabilities.json` (user-wide) or `.squad/machine-capabilities.json` (project-local):
1. Ralph loads `machine-capabilities.json` at startup
2. For each open issue, Ralph extracts `needs:*` labels
3. If any required capability is missing, the issue is skipped
4. Issues without `needs:*` labels are always processed (opt-in system)
## Kubernetes Integration
On Kubernetes, machine capabilities map to node labels:
```yaml
# Node labels (set by capability DaemonSet or manually)
node.squad.dev/gpu:"true"
node.squad.dev/browser:"true"
# Pod spec uses nodeSelector
spec:
nodeSelector:
node.squad.dev/gpu:"true"
```
A DaemonSet can run capability discovery on each node and maintain labels automatically. See the [squad-on-aks](https://github.com/tamirdresher/squad-on-aks) project for a complete Kubernetes deployment example.
MCP (Model Context Protocol) servers extend Squad with tools for external services — Trello, Aspire dashboards, Azure, Notion, and more. The user configures MCP servers in their environment; Squad discovers and uses them.
> **Full patterns:** Read `.squad/skills/mcp-tool-discovery/SKILL.md` for discovery patterns, domain-specific usage, and graceful degradation.
## Config File Locations
Users configure MCP servers at these locations (checked in priority order):
1.**Repository-level:**`.copilot/mcp-config.json` (team-shared, committed to repo)
- **GitHub MCP requires a separate token** from the `gh` CLI auth. Generate at https://github.com/settings/tokens
- **Trello requires API key + token** from https://trello.com/power-ups/admin
- **Azure requires service principal credentials** — see Azure docs for setup
- **Aspire uses the dashboard URL** — typically `http://localhost:18888` during local dev
Auth is a real blocker for some MCP servers. Users need separate tokens for GitHub MCP, Azure MCP, Trello MCP, etc. This is a documentation problem, not a code problem.
# MCP Integration — Configuration and Samples
MCP (Model Context Protocol) servers extend Squad with tools for external services — Trello, Aspire dashboards, Azure, Notion, and more. The user configures MCP servers in their environment; Squad discovers and uses them.
> **Full patterns:** Read `.squad/skills/mcp-tool-discovery/SKILL.md` for discovery patterns, domain-specific usage, and graceful degradation.
## Config File Locations
Users configure MCP servers at these locations (checked in priority order):
1.**Repository-level:**`.copilot/mcp-config.json` (team-shared, committed to repo)
- **GitHub MCP requires a separate token** from the `gh` CLI auth. Generate at https://github.com/settings/tokens
- **Trello requires API key + token** from https://trello.com/power-ups/admin
- **Azure requires service principal credentials** — see Azure docs for setup
- **Aspire uses the dashboard URL** — typically `http://localhost:18888` during local dev
Auth is a real blocker for some MCP servers. Users need separate tokens for GitHub MCP, Azure MCP, Trello MCP, etc. This is a documentation problem, not a code problem.
When multiple agents contribute to a final artifact (document, analysis, design), use this format. The assembled result must include:
- Termination condition
- Constraint budgets (if active)
- Reviewer verdicts (if any)
- Raw agent outputs appendix
## Assembly Structure
The assembled result goes at the top. Below it, include:
```
## APPENDIX: RAW AGENT OUTPUTS
### {Name} ({Role}) — Raw Output
{Paste agent's verbatim response here, unedited}
### {Name} ({Role}) — Raw Output
{Paste agent's verbatim response here, unedited}
```
## Appendix Rules
This appendix is for diagnostic integrity. Do not edit, summarize, or polish the raw outputs. The Coordinator may not rewrite raw agent outputs; it may only paste them verbatim and assemble the final artifact above.
See `.squad/templates/run-output.md` for the complete output format template.
# Multi-Agent Artifact Format
When multiple agents contribute to a final artifact (document, analysis, design), use this format. The assembled result must include:
- Termination condition
- Constraint budgets (if active)
- Reviewer verdicts (if any)
- Raw agent outputs appendix
## Assembly Structure
The assembled result goes at the top. Below it, include:
```
## APPENDIX: RAW AGENT OUTPUTS
### {Name} ({Role}) — Raw Output
{Paste agent's verbatim response here, unedited}
### {Name} ({Role}) — Raw Output
{Paste agent's verbatim response here, unedited}
```
## Appendix Rules
This appendix is for diagnostic integrity. Do not edit, summarize, or polish the raw outputs. The Coordinator may not rewrite raw agent outputs; it may only paste them verbatim and assemble the final artifact above.
See `.squad/templates/run-output.md` for the complete output format template.
Plugins are curated agent templates, skills, instructions, and prompts shared by the community via GitHub repositories (e.g., `github/awesome-copilot`, `anthropics/skills`). They provide ready-made expertise for common domains — cloud platforms, frameworks, testing strategies, etc.
## Marketplace State
Registered marketplace sources are stored in `.squad/plugins/marketplaces.json`:
```json
{
"marketplaces":[
{
"name":"awesome-copilot",
"source":"github/awesome-copilot",
"added_at":"2026-02-14T00:00:00Z"
}
]
}
```
## CLI Commands
Users manage marketplaces via the CLI:
-`squad plugin marketplace add {owner/repo}` — Register a GitHub repo as a marketplace source
-`squad plugin marketplace remove {name}` — Remove a registered marketplace
-`squad plugin marketplace list` — List registered marketplaces
-`squad plugin marketplace browse {name}` — List available plugins in a marketplace
## When to Browse
During the **Adding Team Members** flow, AFTER allocating a name but BEFORE generating the charter:
1. Read `.squad/plugins/marketplaces.json`. If the file doesn't exist or `marketplaces` is empty, skip silently.
2. For each registered marketplace, search for plugins whose name or description matches the new member's role or domain keywords.
3. Present matching plugins to the user: *"Found '{plugin-name}' in {marketplace} marketplace — want me to install it as a skill for {CastName}?"*
4. If the user accepts, install the plugin (see below). If they decline or skip, proceed without it.
## How to Install a Plugin
1. Read the plugin content from the marketplace repository (the plugin's `SKILL.md` or equivalent).
2. Copy it into the agent's skills directory: `.squad/skills/{plugin-name}/SKILL.md`
3. If the plugin includes charter-level instructions (role boundaries, tool preferences), merge those into the agent's `charter.md`.
4. Log the installation in the agent's `history.md`: *"📦 Plugin '{plugin-name}' installed from {marketplace}."*
## Graceful Degradation
- **No marketplaces configured:** Skip the marketplace check entirely. No warning, no prompt.
- **Marketplace unreachable:** Warn the user (*"⚠ Couldn't reach {marketplace} — continuing without it"*) and proceed with team member creation normally.
- **No matching plugins:** Inform the user (*"No matching plugins found in configured marketplaces"*) and proceed.
# Plugin Marketplace
Plugins are curated agent templates, skills, instructions, and prompts shared by the community via GitHub repositories (e.g., `github/awesome-copilot`, `anthropics/skills`). They provide ready-made expertise for common domains — cloud platforms, frameworks, testing strategies, etc.
## Marketplace State
Registered marketplace sources are stored in `.squad/plugins/marketplaces.json`:
```json
{
"marketplaces":[
{
"name":"awesome-copilot",
"source":"github/awesome-copilot",
"added_at":"2026-02-14T00:00:00Z"
}
]
}
```
## CLI Commands
Users manage marketplaces via the CLI:
-`squad plugin marketplace add {owner/repo}` — Register a GitHub repo as a marketplace source
-`squad plugin marketplace remove {name}` — Remove a registered marketplace
-`squad plugin marketplace list` — List registered marketplaces
-`squad plugin marketplace browse {name}` — List available plugins in a marketplace
## When to Browse
During the **Adding Team Members** flow, AFTER allocating a name but BEFORE generating the charter:
1. Read `.squad/plugins/marketplaces.json`. If the file doesn't exist or `marketplaces` is empty, skip silently.
2. For each registered marketplace, search for plugins whose name or description matches the new member's role or domain keywords.
3. Present matching plugins to the user: *"Found '{plugin-name}' in {marketplace} marketplace — want me to install it as a skill for {CastName}?"*
4. If the user accepts, install the plugin (see below). If they decline or skip, proceed without it.
## How to Install a Plugin
1. Read the plugin content from the marketplace repository (the plugin's `SKILL.md` or equivalent).
2. Copy it into the agent's skills directory: `.squad/skills/{plugin-name}/SKILL.md`
3. If the plugin includes charter-level instructions (role boundaries, tool preferences), merge those into the agent's `charter.md`.
4. Log the installation in the agent's `history.md`: *"📦 Plugin '{plugin-name}' installed from {marketplace}."*
## Graceful Degradation
- **No marketplaces configured:** Skip the marketplace check entirely. No warning, no prompt.
- **Marketplace unreachable:** Warn the user (*"⚠ Couldn't reach {marketplace} — continuing without it"*) and proceed with team member creation normally.
- **No matching plugins:** Inform the user (*"No matching plugins found in configured marketplaces"*) and proceed.
| `squad:{name}` | Pick up issue and complete the work | Named member |
### How Issue Assignment Works
1. When a GitHub issue gets the `squad` label, the **Lead** triages it — analyzing content, assigning the right `squad:{member}` label, and commenting with triage notes.
2. When a `squad:{member}` label is applied, that member picks up the issue in their next session.
3. Members can reassign by removing their label and adding another member's label.
4. The `squad` label is the "inbox" — untriaged issues waiting for Lead review.
## Rules
1.**Eager by default** — spawn all agents who could usefully start work, including anticipatory downstream work.
2.**Scribe always runs** after substantial work, always as `mode: "background"`. Never blocks.
3.**Quick facts → coordinator answers directly.** Don't spawn an agent for "what port does the server run on?"
4.**When two agents could handle it**, pick the one whose domain is the primary concern.
5.**"Team, ..." → fan-out.** Spawn all relevant agents in parallel as `mode: "background"`.
6.**Anticipate downstream work.** If a feature is being built, spawn the tester to write test cases from requirements simultaneously.
7.**Issue-labeled work** — when a `squad:{member}` label is applied to an issue, route to that member. The Lead handles all `squad` (base label) triage.
| `squad:{name}` | Pick up issue and complete the work | Named member |
### How Issue Assignment Works
1. When a GitHub issue gets the `squad` label, the **Lead** triages it — analyzing content, assigning the right `squad:{member}` label, and commenting with triage notes.
2. When a `squad:{member}` label is applied, that member picks up the issue in their next session.
3. Members can reassign by removing their label and adding another member's label.
4. The `squad` label is the "inbox" — untriaged issues waiting for Lead review.
## Rules
1.**Eager by default** — spawn all agents who could usefully start work, including anticipatory downstream work.
2.**Scribe always runs** after substantial work, always as `mode: "background"`. Never blocks.
3.**Quick facts → coordinator answers directly.** Don't spawn an agent for "what port does the server run on?"
4.**When two agents could handle it**, pick the one whose domain is the primary concern.
5.**"Team, ..." → fan-out.** Spawn all relevant agents in parallel as `mode: "background"`.
6.**Anticipate downstream work.** If a feature is being built, spawn the tester to write test cases from requirements simultaneously.
7.**Issue-labeled work** — when a `squad:{member}` label is applied to an issue, route to that member. The Lead handles all `squad` (base label) triage.
- **Style:** Silent. Never speaks to the user. Works in the background.
- **Mode:** Always spawned as `mode: "background"`. Never blocks the conversation.
## What I Own
-`.squad/log/` — session logs (what happened, who worked, what was decided)
-`.squad/decisions.md` — the shared decision log all agents read (canonical, merged)
-`.squad/decisions/inbox/` — decision drop-box (agents write here, I merge)
- Cross-agent context propagation — when one agent's decision affects another
## How I Work
**Worktree awareness:** Use the `TEAM ROOT` provided in the spawn prompt to resolve all `.squad/` paths. If no TEAM ROOT is given, run `git rev-parse --show-toplevel` as fallback. Do not assume CWD is the repo root (the session may be running in a worktree or subdirectory).
After every substantial work session:
1.**Log the session** to `.squad/log/{timestamp}-{topic}.md`:
- Who worked
- What was done
- Decisions made
- Key outcomes
- Brief. Facts only.
2.**Merge the decision inbox:**
- Read all files in `.squad/decisions/inbox/`
- APPEND each decision's contents to `.squad/decisions.md`
- Delete each inbox file after merging
3.**Deduplicate and consolidate decisions.md:**
- Parse the file into decision blocks (each block starts with `### `).
- **Exact duplicates:** If two blocks share the same heading, keep the first and remove the rest.
- **Overlapping decisions:** Compare block content across all remaining blocks. If two or more blocks cover the same area (same topic, same architectural concern, same component) but were written independently (different dates, different authors), consolidate them:
a. Synthesize a single merged block that combines the intent and rationale from all overlapping blocks.
b. Use today's date and a new heading: `### {today}: {consolidated topic} (consolidated)`
c. Credit all original authors: `**By:** {Name1}, {Name2}`
d. Under **What:**, combine the decisions. Note any differences or evolution.
e. Under **Why:**, merge the rationale, preserving unique reasoning from each.
f. Remove the original overlapping blocks.
- Write the updated file back. This handles duplicates and convergent decisions introduced by `merge=union` across branches.
4.**Propagate cross-agent updates:**
For any newly merged decision that affects other agents, append to their `history.md`:
```
📌 Team update ({timestamp}): {summary} — decided by {Name}
```
5. **Commit `.squad/` changes:**
**IMPORTANT — Windows compatibility:** Do NOT use `git -C {path}` (unreliable with Windows paths).
Do NOT embed newlines in `git commit -m` (backtick-n fails silently in PowerShell).
Instead:
- `cd` into the team root first.
- Stage all `.squad/` files: `git add .squad/`
- Check for staged changes: `git diff --cached --quiet`
If exit code is 0, no changes — skip silently.
- Write the commit message to a temp file, then commit with `-F`:
- **Style:** Silent. Never speaks to the user. Works in the background.
- **Mode:** Always spawned as `mode: "background"`. Never blocks the conversation.
## What I Own
-`.squad/log/` — session logs (what happened, who worked, what was decided)
-`.squad/decisions.md` — the shared decision log all agents read (canonical, merged)
-`.squad/decisions/inbox/` — decision drop-box (agents write here, I merge)
- Cross-agent context propagation — when one agent's decision affects another
## How I Work
**Worktree awareness:** Use the `TEAM ROOT` provided in the spawn prompt to resolve all `.squad/` paths. If no TEAM ROOT is given, run `git rev-parse --show-toplevel` as fallback. Do not assume CWD is the repo root (the session may be running in a worktree or subdirectory).
After every substantial work session:
1.**Log the session** to `.squad/log/{timestamp}-{topic}.md`:
- Who worked
- What was done
- Decisions made
- Key outcomes
- Brief. Facts only.
2.**Merge the decision inbox:**
- Read all files in `.squad/decisions/inbox/`
- APPEND each decision's contents to `.squad/decisions.md`
- Delete each inbox file after merging
3.**Deduplicate and consolidate decisions.md:**
- Parse the file into decision blocks (each block starts with `### `).
- **Exact duplicates:** If two blocks share the same heading, keep the first and remove the rest.
- **Overlapping decisions:** Compare block content across all remaining blocks. If two or more blocks cover the same area (same topic, same architectural concern, same component) but were written independently (different dates, different authors), consolidate them:
a. Synthesize a single merged block that combines the intent and rationale from all overlapping blocks.
b. Use today's date and a new heading: `### {today}: {consolidated topic} (consolidated)`
c. Credit all original authors: `**By:** {Name1}, {Name2}`
d. Under **What:**, combine the decisions. Note any differences or evolution.
e. Under **Why:**, merge the rationale, preserving unique reasoning from each.
f. Remove the original overlapping blocks.
- Write the updated file back. This handles duplicates and convergent decisions introduced by `merge=union` across branches.
4.**Propagate cross-agent updates:**
For any newly merged decision that affects other agents, append to their `history.md`:
```
📌 Team update ({timestamp}): {summary} — decided by {Name}
```
5. **Commit `.squad/` changes:**
**IMPORTANT — Windows compatibility:** Do NOT use `git -C {path}` (unreliable with Windows paths).
Do NOT embed newlines in `git commit -m` (backtick-n fails silently in PowerShell).
Instead:
- `cd` into the team root first.
- Stage all `.squad/` files: `git add .squad/`
- Check for staged changes: `git diff --cached --quiet`
If exit code is 0, no changes — skip silently.
- Write the commit message to a temp file, then commit with `-F`:
description: "Standard collaboration patterns for all squad agents — worktree awareness, decisions, cross-agent communication"
domain: "team-workflow"
confidence: "high"
source: "extracted from charter boilerplate — identical content in 18+ agent charters"
---
## Context
Every agent on the team follows identical collaboration patterns for worktree awareness, decision recording, and cross-agent communication. These were previously duplicated in every charter's Collaboration section (~300 bytes × 18 agents = ~5.4KB of redundant context). Now centralized here.
The coordinator's spawn prompt already instructs agents to read decisions.md and their history.md. This skill adds the patterns for WRITING decisions and requesting help.
## Patterns
### Worktree Awareness
Use the `TEAM ROOT` path provided in your spawn prompt. All `.squad/` paths are relative to this root. If TEAM ROOT is not provided (rare), run `git rev-parse --show-toplevel` as fallback. Never assume CWD is the repo root.
### Decision Recording
After making a decision that affects other team members, write it to:
If you need another team member's input, say so in your response. The coordinator will bring them in. Don't try to do work outside your domain.
### Reviewer Protocol
If you have reviewer authority and reject work: the original author is locked out from revising that artifact. A different agent must own the revision. State who should revise in your rejection response.
## Anti-Patterns
- Don't read all agent charters — you only need your own context + decisions.md
- Don't write directly to `.squad/decisions.md` — always use the inbox drop-box
- Don't assume CWD is the repo root — always use TEAM ROOT
---
name: "agent-collaboration"
description: "Standard collaboration patterns for all squad agents — worktree awareness, decisions, cross-agent communication"
domain: "team-workflow"
confidence: "high"
source: "extracted from charter boilerplate — identical content in 18+ agent charters"
---
## Context
Every agent on the team follows identical collaboration patterns for worktree awareness, decision recording, and cross-agent communication. These were previously duplicated in every charter's Collaboration section (~300 bytes × 18 agents = ~5.4KB of redundant context). Now centralized here.
The coordinator's spawn prompt already instructs agents to read decisions.md and their history.md. This skill adds the patterns for WRITING decisions and requesting help.
## Patterns
### Worktree Awareness
Use the `TEAM ROOT` path provided in your spawn prompt. All `.squad/` paths are relative to this root. If TEAM ROOT is not provided (rare), run `git rev-parse --show-toplevel` as fallback. Never assume CWD is the repo root.
### Decision Recording
After making a decision that affects other team members, write it to:
If you need another team member's input, say so in your response. The coordinator will bring them in. Don't try to do work outside your domain.
### Reviewer Protocol
If you have reviewer authority and reject work: the original author is locked out from revising that artifact. A different agent must own the revision. State who should revise in your rejection response.
## Anti-Patterns
- Don't read all agent charters — you only need your own context + decisions.md
- Don't write directly to `.squad/decisions.md` — always use the inbox drop-box
description: "Shared hard rules enforced across all squad agents"
domain: "team-governance"
confidence: "high"
source: "reskill extraction — Product Isolation Rule and Peer Quality Check appeared in all 20 agent charters"
---
## Context
Every squad agent must follow these two hard rules. They were previously duplicated in every charter. Now they live here as a shared skill, loaded once.
## Patterns
### Product Isolation Rule (hard rule)
Tests, CI workflows, and product code must NEVER depend on specific agent names from any particular squad. "Our squad" must not impact "the squad." No hardcoded references to agent names (Flight, EECOM, FIDO, etc.) in test assertions, CI configs, or product logic. Use generic/parameterized values. If a test needs agent names, use obviously-fake test fixtures (e.g., "test-agent-1", "TestBot").
### Peer Quality Check (hard rule)
Before finishing work, verify your changes don't break existing tests. Run the test suite for files you touched. If CI has been failing, check your changes aren't contributing to the problem. When you learn from mistakes, update your history.md.
## Anti-Patterns
- Don't hardcode dev team agent names in product code or tests
- Don't skip test verification before declaring work done
- Don't ignore pre-existing CI failures that your changes may worsen
---
name: "agent-conduct"
description: "Shared hard rules enforced across all squad agents"
domain: "team-governance"
confidence: "high"
source: "reskill extraction — Product Isolation Rule and Peer Quality Check appeared in all 20 agent charters"
---
## Context
Every squad agent must follow these two hard rules. They were previously duplicated in every charter. Now they live here as a shared skill, loaded once.
## Patterns
### Product Isolation Rule (hard rule)
Tests, CI workflows, and product code must NEVER depend on specific agent names from any particular squad. "Our squad" must not impact "the squad." No hardcoded references to agent names (Flight, EECOM, FIDO, etc.) in test assertions, CI configs, or product logic. Use generic/parameterized values. If a test needs agent names, use obviously-fake test fixtures (e.g., "test-agent-1", "TestBot").
### Peer Quality Check (hard rule)
Before finishing work, verify your changes don't break existing tests. Run the test suite for files you touched. If CI has been failing, check your changes aren't contributing to the problem. When you learn from mistakes, update your history.md.
## Anti-Patterns
- Don't hardcode dev team agent names in product code or tests
- Don't skip test verification before declaring work done
- Don't ignore pre-existing CI failures that your changes may worsen
source: "extracted from Drucker and Trejo charters — earned knowledge from v0.8.22 release incident"
---
## Context
CI workflows must be defensive. These patterns were learned from the v0.8.22 release disaster where invalid semver, wrong token types, missing retry logic, and draft releases caused a multi-hour outage. Both Drucker (CI/CD) and Trejo (Release Manager) carried this knowledge in their charters — now centralized here.
## Patterns
### Semver Validation Gate
Every publish workflow MUST validate version format before `npm publish`. 4-part versions (e.g., 0.8.21.4) are NOT valid semver — npm mangles them.
```yaml
- name:Validate semver
run:|
VERSION="${{ github.event.release.tag_name }}"
VERSION="${VERSION#v}"
if ! npx semver "$VERSION" > /dev/null 2>&1; then
echo "❌ Invalid semver: $VERSION"
echo "Only 3-part versions (X.Y.Z) or prerelease (X.Y.Z-tag.N) are valid."
exit 1
fi
echo "✅ Valid semver: $VERSION"
```
### NPM Token Type Verification
NPM_TOKEN MUST be an Automation token, not a User token with 2FA:
- User tokens require OTP — CI can't provide it → EOTP error
- If using workflow_dispatch: verify release is published via GitHub API before proceeding
### Build Script Protection
Set `SKIP_BUILD_BUMP=1` (or `$env:SKIP_BUILD_BUMP = "1"` on Windows) before ANY release build. bump-build.mjs is for dev builds ONLY — it silently mutates versions.
## Known Failure Modes (v0.8.22 Incident)
| # | What Happened | Root Cause | Prevention |
|---|---------------|-----------|------------|
| 1 | 4-part version published, npm mangled it | No semver validation gate | `npx semver` check before every publish |
| 2 | CI failed 5+ times with EOTP | User token with 2FA | Automation token only |
| 3 | Verify returned false 404 | No retry logic for propagation | 5 attempts, 15s intervals |
| 4 | Workflow never triggered | Draft release doesn't emit event | Never create draft releases |
| 5 | Version mutated during release | bump-build.mjs ran in release | SKIP_BUILD_BUMP=1 |
## Anti-Patterns
- ❌ Publishing without semver validation gate
- ❌ Single-shot verification without retry
- ❌ Hard-coded secrets in workflows
- ❌ Silent CI failures — every error needs actionable output with remediation
source: "extracted from Drucker and Trejo charters — earned knowledge from v0.8.22 release incident"
---
## Context
CI workflows must be defensive. These patterns were learned from the v0.8.22 release disaster where invalid semver, wrong token types, missing retry logic, and draft releases caused a multi-hour outage. Both Drucker (CI/CD) and Trejo (Release Manager) carried this knowledge in their charters — now centralized here.
## Patterns
### Semver Validation Gate
Every publish workflow MUST validate version format before `npm publish`. 4-part versions (e.g., 0.8.21.4) are NOT valid semver — npm mangles them.
```yaml
- name:Validate semver
run:|
VERSION="${{ github.event.release.tag_name }}"
VERSION="${VERSION#v}"
if ! npx semver "$VERSION" > /dev/null 2>&1; then
echo "❌ Invalid semver: $VERSION"
echo "Only 3-part versions (X.Y.Z) or prerelease (X.Y.Z-tag.N) are valid."
exit 1
fi
echo "✅ Valid semver: $VERSION"
```
### NPM Token Type Verification
NPM_TOKEN MUST be an Automation token, not a User token with 2FA:
- User tokens require OTP — CI can't provide it → EOTP error
- If using workflow_dispatch: verify release is published via GitHub API before proceeding
### Build Script Protection
Set `SKIP_BUILD_BUMP=1` (or `$env:SKIP_BUILD_BUMP = "1"` on Windows) before ANY release build. bump-build.mjs is for dev builds ONLY — it silently mutates versions.
## Known Failure Modes (v0.8.22 Incident)
| # | What Happened | Root Cause | Prevention |
|---|---------------|-----------|------------|
| 1 | 4-part version published, npm mangled it | No semver validation gate | `npx semver` check before every publish |
| 2 | CI failed 5+ times with EOTP | User token with 2FA | Automation token only |
| 3 | Verify returned false 404 | No retry logic for propagation | 5 attempts, 15s intervals |
| 4 | Workflow never triggered | Draft release doesn't emit event | Never create draft releases |
| 5 | Version mutated during release | bump-build.mjs ran in release | SKIP_BUILD_BUMP=1 |
## Anti-Patterns
- ❌ Publishing without semver validation gate
- ❌ Single-shot verification without retry
- ❌ Hard-coded secrets in workflows
- ❌ Silent CI failures — every error needs actionable output with remediation
description: "Platform detection and adaptive spawning for CLI vs VS Code vs other surfaces"
domain: "orchestration"
confidence: "high"
source: "extracted"
---
## Context
Squad runs on multiple Copilot surfaces (CLI, VS Code, JetBrains, GitHub.com). The coordinator must detect its platform and adapt spawning behavior accordingly. Different tools are available on different platforms, requiring conditional logic for agent spawning, SQL usage, and response timing.
## Patterns
### Platform Detection
Before spawning agents, determine the platform by checking available tools:
1.**CLI mode** — `task` tool is available → full spawning control. Use `task` with `agent_type`, `mode`, `model`, `description`, `prompt` parameters. Collect results via `read_agent`.
2.**VS Code mode** — `runSubagent` or `agent` tool is available → conditional behavior. Use `runSubagent` with the task prompt. Drop `agent_type`, `mode`, and `model` parameters. Multiple subagents in one turn run concurrently (equivalent to background mode). Results return automatically — no `read_agent` needed.
3.**Fallback mode** — neither `task` nor `runSubagent`/`agent` available → work inline. Do not apologize or explain the limitation. Execute the task directly.
If both `task` and `runSubagent` are available, prefer `task` (richer parameter surface).
### VS Code Spawn Adaptations
When in VS Code mode, the coordinator changes behavior in these ways:
- **Spawning tool:** Use `runSubagent` instead of `task`. The prompt is the only required parameter — pass the full agent prompt (charter, identity, task, hygiene, response order) exactly as you would on CLI.
- **Parallelism:** Spawn ALL concurrent agents in a SINGLE turn. They run in parallel automatically. This replaces `mode: "background"` + `read_agent` polling.
- **Model selection:** Accept the session model. Do NOT attempt per-spawn model selection or fallback chains — they only work on CLI. In Phase 1, all subagents use whatever model the user selected in VS Code's model picker.
- **Scribe:** Cannot fire-and-forget. Batch Scribe as the LAST subagent in any parallel group. Scribe is light work (file ops only), so the blocking is tolerable.
- **Launch table:** Skip it. Results arrive with the response, not separately. By the time the coordinator speaks, the work is already done.
- **`read_agent`:** Skip entirely. Results return automatically when subagents complete.
- **`agent_type`:** Drop it. All VS Code subagents have full tool access by default. Subagents inherit the parent's tools.
- **`description`:** Drop it. The agent name is already in the prompt.
- **Prompt content:** Keep ALL prompt structure — charter, identity, task, hygiene, response order blocks are surface-independent.
### Feature Degradation Table
| Feature | CLI | VS Code | Degradation |
|---------|-----|---------|-------------|
| Parallel fan-out | `mode: "background"` + `read_agent` | Multiple subagents in one turn | None — equivalent concurrency |
| Model selection | Per-spawn `model` param (4-layer hierarchy) | Session model only (Phase 1) | Accept session model, log intent |
| Scribe fire-and-forget | Background, never read | Sync, must wait | Batch with last parallel group |
| Launch table UX | Show table → results later | Skip table → results with response | UX only — results are correct |
| SQL tool | Available | Not available | Avoid SQL in cross-platform code paths |
| Response order bug | Critical workaround | Possibly necessary (unverified) | Keep the block — harmless if unnecessary |
### SQL Tool Caveat
The `sql` tool is **CLI-only**. It does not exist on VS Code, JetBrains, or GitHub.com. Any coordinator logic or agent workflow that depends on SQL (todo tracking, batch processing, session state) will silently fail on non-CLI surfaces. Cross-platform code paths must not depend on SQL. Use filesystem-based state (`.squad/` files) for anything that must work everywhere.
## Examples
**Example 1: CLI parallel spawn**
```typescript
// Coordinator detects task tool available → CLI mode
// Coordinator detects runSubagent available → VS Code mode
runSubagent({prompt:"...Fenster charter + task..."})
runSubagent({prompt:"...Hockney charter + task..."})
runSubagent({prompt:"...Scribe charter + task..."})// Last in group
// Results return automatically, no read_agent
```
**Example 3: Fallback mode**
```typescript
// Neither task nor runSubagent available → work inline
// Coordinator executes the task directly without spawning
```
## Anti-Patterns
- ❌ Using SQL tool in cross-platform workflows (breaks on VS Code/JetBrains/GitHub.com)
- ❌ Attempting per-spawn model selection on VS Code (Phase 1 — only session model works)
- ❌ Fire-and-forget Scribe on VS Code (must batch as last subagent)
- ❌ Showing launch table on VS Code (results already inline)
- ❌ Apologizing or explaining platform limitations to the user
- ❌ Using `task` when only `runSubagent` is available
- ❌ Dropping prompt structure (charter/identity/task) on non-CLI platforms
---
name: "client-compatibility"
description: "Platform detection and adaptive spawning for CLI vs VS Code vs other surfaces"
domain: "orchestration"
confidence: "high"
source: "extracted"
---
## Context
Squad runs on multiple Copilot surfaces (CLI, VS Code, JetBrains, GitHub.com). The coordinator must detect its platform and adapt spawning behavior accordingly. Different tools are available on different platforms, requiring conditional logic for agent spawning, SQL usage, and response timing.
## Patterns
### Platform Detection
Before spawning agents, determine the platform by checking available tools:
1.**CLI mode** — `task` tool is available → full spawning control. Use `task` with `agent_type`, `mode`, `model`, `description`, `prompt` parameters. Collect results via `read_agent`.
2.**VS Code mode** — `runSubagent` or `agent` tool is available → conditional behavior. Use `runSubagent` with the task prompt. Drop `agent_type`, `mode`, and `model` parameters. Multiple subagents in one turn run concurrently (equivalent to background mode). Results return automatically — no `read_agent` needed.
3.**Fallback mode** — neither `task` nor `runSubagent`/`agent` available → work inline. Do not apologize or explain the limitation. Execute the task directly.
If both `task` and `runSubagent` are available, prefer `task` (richer parameter surface).
### VS Code Spawn Adaptations
When in VS Code mode, the coordinator changes behavior in these ways:
- **Spawning tool:** Use `runSubagent` instead of `task`. The prompt is the only required parameter — pass the full agent prompt (charter, identity, task, hygiene, response order) exactly as you would on CLI.
- **Parallelism:** Spawn ALL concurrent agents in a SINGLE turn. They run in parallel automatically. This replaces `mode: "background"` + `read_agent` polling.
- **Model selection:** Accept the session model. Do NOT attempt per-spawn model selection or fallback chains — they only work on CLI. In Phase 1, all subagents use whatever model the user selected in VS Code's model picker.
- **Scribe:** Cannot fire-and-forget. Batch Scribe as the LAST subagent in any parallel group. Scribe is light work (file ops only), so the blocking is tolerable.
- **Launch table:** Skip it. Results arrive with the response, not separately. By the time the coordinator speaks, the work is already done.
- **`read_agent`:** Skip entirely. Results return automatically when subagents complete.
- **`agent_type`:** Drop it. All VS Code subagents have full tool access by default. Subagents inherit the parent's tools.
- **`description`:** Drop it. The agent name is already in the prompt.
- **Prompt content:** Keep ALL prompt structure — charter, identity, task, hygiene, response order blocks are surface-independent.
### Feature Degradation Table
| Feature | CLI | VS Code | Degradation |
|---------|-----|---------|-------------|
| Parallel fan-out | `mode: "background"` + `read_agent` | Multiple subagents in one turn | None — equivalent concurrency |
| Model selection | Per-spawn `model` param (4-layer hierarchy) | Session model only (Phase 1) | Accept session model, log intent |
| Scribe fire-and-forget | Background, never read | Sync, must wait | Batch with last parallel group |
| Launch table UX | Show table → results later | Skip table → results with response | UX only — results are correct |
| SQL tool | Available | Not available | Avoid SQL in cross-platform code paths |
| Response order bug | Critical workaround | Possibly necessary (unverified) | Keep the block — harmless if unnecessary |
### SQL Tool Caveat
The `sql` tool is **CLI-only**. It does not exist on VS Code, JetBrains, or GitHub.com. Any coordinator logic or agent workflow that depends on SQL (todo tracking, batch processing, session state) will silently fail on non-CLI surfaces. Cross-platform code paths must not depend on SQL. Use filesystem-based state (`.squad/` files) for anything that must work everywhere.
## Examples
**Example 1: CLI parallel spawn**
```typescript
// Coordinator detects task tool available → CLI mode
description: "Coordinating work across multiple Squad instances"
domain: "orchestration"
confidence: "medium"
source: "manual"
tools:
- name: "squad-discover"
description: "List known squads and their capabilities"
when: "When you need to find which squad can handle a task"
- name: "squad-delegate"
description: "Create work in another squad's repository"
when: "When a task belongs to another squad's domain"
---
## Context
When an organization runs multiple Squad instances (e.g., platform-squad, frontend-squad, data-squad), those squads need to discover each other, share context, and hand off work across repository boundaries. This skill teaches agents how to coordinate across squads without creating tight coupling.
Cross-squad orchestration applies when:
- A task requires capabilities owned by another squad
- An architectural decision affects multiple squads
- A feature spans multiple repositories with different squads
- A squad needs to request infrastructure, tooling, or support from another squad
## Patterns
### Discovery via Manifest
Each squad publishes a `.squad/manifest.json` declaring its name, capabilities, and contact information. Squads discover each other through:
1.**Well-known paths**: Check `.squad/manifest.json` in known org repos
2.**Upstream config**: Squads already listed in `.squad/upstream.json` are checked for manifests
3.**Explicit registry**: A central `squad-registry.json` can list all squads in an org
- **Direct file writes across repos** — Never modify another squad's `.squad/` directory. Use issues and PRs as the communication protocol.
- **Tight coupling** — Don't depend on another squad's internal structure. Use the manifest as the public API contract.
- **Unbounded delegation** — Always include acceptance criteria and a timeout. Don't create open-ended requests.
- **Skipping discovery** — Don't hardcode squad locations. Use manifests and the discovery protocol.
- **Sharing secrets** — Never include credentials, tokens, or internal URLs in cross-squad issues.
- **Circular delegation** — Track delegation chains. If squad A delegates to B which delegates back to A, something is wrong.
---
name: "cross-squad"
description: "Coordinating work across multiple Squad instances"
domain: "orchestration"
confidence: "medium"
source: "manual"
tools:
- name: "squad-discover"
description: "List known squads and their capabilities"
when: "When you need to find which squad can handle a task"
- name: "squad-delegate"
description: "Create work in another squad's repository"
when: "When a task belongs to another squad's domain"
---
## Context
When an organization runs multiple Squad instances (e.g., platform-squad, frontend-squad, data-squad), those squads need to discover each other, share context, and hand off work across repository boundaries. This skill teaches agents how to coordinate across squads without creating tight coupling.
Cross-squad orchestration applies when:
- A task requires capabilities owned by another squad
- An architectural decision affects multiple squads
- A feature spans multiple repositories with different squads
- A squad needs to request infrastructure, tooling, or support from another squad
## Patterns
### Discovery via Manifest
Each squad publishes a `.squad/manifest.json` declaring its name, capabilities, and contact information. Squads discover each other through:
1.**Well-known paths**: Check `.squad/manifest.json` in known org repos
2.**Upstream config**: Squads already listed in `.squad/upstream.json` are checked for manifests
3.**Explicit registry**: A central `squad-registry.json` can list all squads in an org
**✅ THIS SKILL PRODUCES (exactly these, nothing more):**
1.**`mesh.json`** — Generated from user answers about zones and squads (which squads participate, what zone each is in, paths/URLs for each), using `mesh.json.example` in this skill's directory as the schema template
2.**`sync-mesh.sh` and `sync-mesh.ps1`** — Copied from this skill's directory into the project root (these are bundled resources, NOT generated code)
3.**Zone 2 state repo initialization** (if applicable) — If the user specified a Zone 2 shared state repo, run `sync-mesh.sh --init` to scaffold the state repo structure
4.**A decision entry** in `.squad/decisions/inbox/` documenting the mesh configuration for team awareness
**❌ THIS SKILL DOES NOT PRODUCE:**
- **No application code** — No validators, libraries, or modules of any kind
- **No test files** — No test suites, test cases, or test scaffolding
- **No GENERATING sync scripts** — They are bundled with this skill as pre-built resources. COPY them, don't generate them.
- **No daemons or services** — No background processes, servers, or persistent runtimes
- **No modifications to existing squad files** beyond the decision entry (no changes to team.md, routing.md, agent charters, etc.)
**Your role:** Configure the mesh topology and install the bundled sync scripts. Nothing more.
## Context
When squads are on different machines (developer laptops, CI runners, cloud VMs, partner orgs), the local file-reading convention still works — but remote files need to arrive on your disk first. This skill teaches the pattern for distributed squad communication.
**When this applies:**
- Squads span multiple machines, VMs, or CI runners
- Squads span organizations or companies
- An agent needs context from a squad whose files aren't on the local filesystem
**When this does NOT apply:**
- All squads are on the same machine (just read the files directly)
## Patterns
### The Core Principle
> "The filesystem is the mesh, and git is how the mesh crosses machine boundaries."
The agent interface never changes. Agents always read local files. The distributed layer's only job is to make remote files appear locally before the agent reads them.
### Three Zones of Communication
**Zone 1 — Local:** Same filesystem. Read files directly. Zero transport.
**Zone 2 — Remote-Trusted:** Different host, same org, shared git auth. Transport: `git pull` from a shared repo. This collapses Zone 2 into Zone 1 — files materialize on disk, agent reads them normally.
**Zone 3 — Remote-Opaque:** Different org, no shared auth. Transport: `curl` to fetch published contracts (SUMMARY.md). One-way visibility — you see only what they publish.
2. READ: cat .mesh/**/state.md — all files are local now
3. WORK: do their assigned work (the agent's normal task, NOT mesh-building)
4. WRITE: update own billboard, log, drops
5. PUBLISH: git add + commit + push — share state with remote peers
```
Steps 2–4 are identical to local-only. Steps 1 and 5 are the entire distributed extension. **Note:** "WORK" means the agent performs its normal squad duties — it does NOT mean "build mesh infrastructure."
Three zone types, one file. Local squads need only a path. Remote-trusted need a git URL. Remote-opaque need an HTTP URL.
### Write Partitioning
Each squad writes only to its own directory (`boards/{self}.md`, `squads/{self}/*`, `drops/{date}-{self}-*.md`). No two squads write to the same file. Git push/pull never conflicts. If push fails ("branch is behind"), the fix is always `git pull --rebase && git push`.
### Trust Boundaries
Trust maps to git permissions:
- **Same repo access** = full mesh visibility
- **Read-only access** = can observe, can't write
- **No access** = invisible (correct behavior)
For selective visibility, use separate repos per audience (internal, partner, public). Git permissions ARE the trust negotiation.
### Phased Rollout
- **Phase 0:** Convention only — document zones, agree on mesh.json fields, manually run `git pull`/`git push`. Zero new code.
- **Phase 1:** Sync script (~30 lines bash or PowerShell) when manual sync gets tedious.
- **Phase 2:** Published contracts + curl fetch when a Zone 3 partner appears.
- **Phase 3:** Never. No MCP federation, A2A, service discovery, message queues.
**Important:** Phases are NOT auto-advanced. These are project-level decisions — you start at Phase 0 (manual sync) and only move forward when the team decides complexity is justified.
### Mesh State Repo
The shared mesh state repo is a plain git repository — NOT a Squad project. It holds:
- One directory per participating squad
- Each directory contains at minimum a SUMMARY.md with the squad's current state
- A root README explaining what the repo is and who participates
No `.squad/` folder, no agents, no automation. Write partitioning means each squad only pushes to its own directory. The repo is a rendezvous point, not an intelligent system.
If you want a squad that *observes* mesh health, that's a separate Squad project that lists the state repo as a Zone 2 remote in its `mesh.json` — it does NOT live inside the state repo.
## Examples
### Developer Laptop + CI Squad (Zone 2)
Auth-squad agent wakes up. `git pull` brings ci-squad's latest results. Agent reads: "3 test failures in auth module." Adjusts work. Pushes results when done. **Overhead: one `git pull`, one `git push`.**
### Two Orgs Collaborating (Zone 3)
Payment-squad fetches partner's published SUMMARY.md via curl. Reads: "Risk scoring v3 API deprecated April 15. New field `device_fingerprint` required." The consuming agent (in payment-squad's team) reads this information and uses it to inform its work — for example, updating payment integration code to include the new field. Partner can't see payment-squad's internals.
### Same Org, Shared Mesh Repo (Zone 2)
Three squads on different machines. One shared git repo holds the mesh. Each squad: `git pull` before work, `git push` after. Write partitioning ensures zero merge conflicts.
## AGENT WORKFLOW (Deterministic Setup)
When a user invokes this skill to set up a distributed mesh, follow these steps **exactly, in order:**
### Step 1: ASK the user for mesh topology
Ask these questions (adapt phrasing naturally, but get these answers):
1.**Which squads are participating?** (List of squad names)
2.**For each squad, which zone is it in?**
-`local` — same filesystem (just need a path)
-`remote-trusted` — different machine, same org, shared git access (need git URL + ref)
-`remote-opaque` — different org, no shared auth (need HTTPS URL to published contract)
3.**For each squad, what's the connection info?**
- Local: relative or absolute path to their `.mesh/` directory
- Remote-trusted: git URL (SSH or HTTPS), ref (branch/tag), and where to sync it to locally
- Remote-opaque: HTTPS URL to their SUMMARY.md, where to sync it, and auth type (none/bearer)
4.**Where should the shared state live?** (For Zone 2 squads: git repo URL for the mesh state, or confirm each squad syncs independently)
### Step 2: GENERATE `mesh.json`
Using the answers from Step 1, create a `mesh.json` file at the project root. Use `mesh.json.example` from THIS skill's directory (`.squad/skills/distributed-mesh/mesh.json.example`) as the schema template.
**✅ THIS SKILL PRODUCES (exactly these, nothing more):**
1.**`mesh.json`** — Generated from user answers about zones and squads (which squads participate, what zone each is in, paths/URLs for each), using `mesh.json.example` in this skill's directory as the schema template
2.**`sync-mesh.sh` and `sync-mesh.ps1`** — Copied from this skill's directory into the project root (these are bundled resources, NOT generated code)
3.**Zone 2 state repo initialization** (if applicable) — If the user specified a Zone 2 shared state repo, run `sync-mesh.sh --init` to scaffold the state repo structure
4.**A decision entry** in `.squad/decisions/inbox/` documenting the mesh configuration for team awareness
**❌ THIS SKILL DOES NOT PRODUCE:**
- **No application code** — No validators, libraries, or modules of any kind
- **No test files** — No test suites, test cases, or test scaffolding
- **No GENERATING sync scripts** — They are bundled with this skill as pre-built resources. COPY them, don't generate them.
- **No daemons or services** — No background processes, servers, or persistent runtimes
- **No modifications to existing squad files** beyond the decision entry (no changes to team.md, routing.md, agent charters, etc.)
**Your role:** Configure the mesh topology and install the bundled sync scripts. Nothing more.
## Context
When squads are on different machines (developer laptops, CI runners, cloud VMs, partner orgs), the local file-reading convention still works — but remote files need to arrive on your disk first. This skill teaches the pattern for distributed squad communication.
**When this applies:**
- Squads span multiple machines, VMs, or CI runners
- Squads span organizations or companies
- An agent needs context from a squad whose files aren't on the local filesystem
**When this does NOT apply:**
- All squads are on the same machine (just read the files directly)
## Patterns
### The Core Principle
> "The filesystem is the mesh, and git is how the mesh crosses machine boundaries."
The agent interface never changes. Agents always read local files. The distributed layer's only job is to make remote files appear locally before the agent reads them.
### Three Zones of Communication
**Zone 1 — Local:** Same filesystem. Read files directly. Zero transport.
**Zone 2 — Remote-Trusted:** Different host, same org, shared git auth. Transport: `git pull` from a shared repo. This collapses Zone 2 into Zone 1 — files materialize on disk, agent reads them normally.
**Zone 3 — Remote-Opaque:** Different org, no shared auth. Transport: `curl` to fetch published contracts (SUMMARY.md). One-way visibility — you see only what they publish.
2. READ: cat .mesh/**/state.md — all files are local now
3. WORK: do their assigned work (the agent's normal task, NOT mesh-building)
4. WRITE: update own billboard, log, drops
5. PUBLISH: git add + commit + push — share state with remote peers
```
Steps 2–4 are identical to local-only. Steps 1 and 5 are the entire distributed extension. **Note:** "WORK" means the agent performs its normal squad duties — it does NOT mean "build mesh infrastructure."
Three zone types, one file. Local squads need only a path. Remote-trusted need a git URL. Remote-opaque need an HTTP URL.
### Write Partitioning
Each squad writes only to its own directory (`boards/{self}.md`, `squads/{self}/*`, `drops/{date}-{self}-*.md`). No two squads write to the same file. Git push/pull never conflicts. If push fails ("branch is behind"), the fix is always `git pull --rebase && git push`.
### Trust Boundaries
Trust maps to git permissions:
- **Same repo access** = full mesh visibility
- **Read-only access** = can observe, can't write
- **No access** = invisible (correct behavior)
For selective visibility, use separate repos per audience (internal, partner, public). Git permissions ARE the trust negotiation.
### Phased Rollout
- **Phase 0:** Convention only — document zones, agree on mesh.json fields, manually run `git pull`/`git push`. Zero new code.
- **Phase 1:** Sync script (~30 lines bash or PowerShell) when manual sync gets tedious.
- **Phase 2:** Published contracts + curl fetch when a Zone 3 partner appears.
- **Phase 3:** Never. No MCP federation, A2A, service discovery, message queues.
**Important:** Phases are NOT auto-advanced. These are project-level decisions — you start at Phase 0 (manual sync) and only move forward when the team decides complexity is justified.
### Mesh State Repo
The shared mesh state repo is a plain git repository — NOT a Squad project. It holds:
- One directory per participating squad
- Each directory contains at minimum a SUMMARY.md with the squad's current state
- A root README explaining what the repo is and who participates
No `.squad/` folder, no agents, no automation. Write partitioning means each squad only pushes to its own directory. The repo is a rendezvous point, not an intelligent system.
If you want a squad that *observes* mesh health, that's a separate Squad project that lists the state repo as a Zone 2 remote in its `mesh.json` — it does NOT live inside the state repo.
## Examples
### Developer Laptop + CI Squad (Zone 2)
Auth-squad agent wakes up. `git pull` brings ci-squad's latest results. Agent reads: "3 test failures in auth module." Adjusts work. Pushes results when done. **Overhead: one `git pull`, one `git push`.**
### Two Orgs Collaborating (Zone 3)
Payment-squad fetches partner's published SUMMARY.md via curl. Reads: "Risk scoring v3 API deprecated April 15. New field `device_fingerprint` required." The consuming agent (in payment-squad's team) reads this information and uses it to inform its work — for example, updating payment integration code to include the new field. Partner can't see payment-squad's internals.
### Same Org, Shared Mesh Repo (Zone 2)
Three squads on different machines. One shared git repo holds the mesh. Each squad: `git pull` before work, `git push` after. Write partitioning ensures zero merge conflicts.
## AGENT WORKFLOW (Deterministic Setup)
When a user invokes this skill to set up a distributed mesh, follow these steps **exactly, in order:**
### Step 1: ASK the user for mesh topology
Ask these questions (adapt phrasing naturally, but get these answers):
1.**Which squads are participating?** (List of squad names)
2.**For each squad, which zone is it in?**
-`local` — same filesystem (just need a path)
-`remote-trusted` — different machine, same org, shared git access (need git URL + ref)
-`remote-opaque` — different org, no shared auth (need HTTPS URL to published contract)
3.**For each squad, what's the connection info?**
- Local: relative or absolute path to their `.mesh/` directory
- Remote-trusted: git URL (SSH or HTTPS), ref (branch/tag), and where to sync it to locally
- Remote-opaque: HTTPS URL to their SUMMARY.md, where to sync it, and auth type (none/bearer)
4.**Where should the shared state live?** (For Zone 2 squads: git repo URL for the mesh state, or confirm each squad syncs independently)
### Step 2: GENERATE `mesh.json`
Using the answers from Step 1, create a `mesh.json` file at the project root. Use `mesh.json.example` from THIS skill's directory (`.squad/skills/distributed-mesh/mesh.json.example`) as the schema template.
Squad documentation follows the Microsoft Style Guide with Squad-specific conventions. Consistency across docs builds trust and improves discoverability.
## Patterns
### Microsoft Style Guide Rules
- **Sentence-case headings:** "Getting started" not "Getting Started"
- **Active voice:** "Run the command" not "The command should be run"
- **Second person:** "You can configure..." not "Users can configure..."
- **Present tense:** "The system routes..." not "The system will route..."
- **No ampersands in prose:** "and" not "&" (except in code, brand names, or UI elements)
### Squad Formatting Patterns
- **Scannability first:** Paragraphs for narrative (3-4 sentences max), bullets for scannable lists, tables for structured data
- **"Try this" prompts at top:** Start feature/scenario pages with practical prompts users can copy
- **Experimental warnings:** Features in preview get callout at top
- **Cross-references at bottom:** Related pages linked after main content
- **Always update test assertions:** When adding docs pages to `features/`, `scenarios/`, `guides/`, update corresponding `EXPECTED_*` arrays in `test/docs-build.test.ts` in the same commit
## Examples
✓ **Correct:**
```markdown
# Getting started with Squad
> ⚠️ **Experimental:** This feature is in preview.
Try this:
\`\`\`bash
squad init
\`\`\`
Squad helps you build AI teams...
---
## Install Squad
Run the following command...
```
✗ **Incorrect:**
```markdown
# Getting Started With Squad // Title case
Squad is a tool which will help users... // Third person, future tense
You can install Squad with npm & configure it... // Ampersand in prose
```
## Anti-Patterns
- Title-casing headings because "it looks nicer"
- Writing in passive voice or third person
- Long paragraphs of dense text (breaks scannability)
- Adding doc pages without updating test assertions
Squad documentation follows the Microsoft Style Guide with Squad-specific conventions. Consistency across docs builds trust and improves discoverability.
## Patterns
### Microsoft Style Guide Rules
- **Sentence-case headings:** "Getting started" not "Getting Started"
- **Active voice:** "Run the command" not "The command should be run"
- **Second person:** "You can configure..." not "Users can configure..."
- **Present tense:** "The system routes..." not "The system will route..."
- **No ampersands in prose:** "and" not "&" (except in code, brand names, or UI elements)
### Squad Formatting Patterns
- **Scannability first:** Paragraphs for narrative (3-4 sentences max), bullets for scannable lists, tables for structured data
- **"Try this" prompts at top:** Start feature/scenario pages with practical prompts users can copy
- **Experimental warnings:** Features in preview get callout at top
- **Cross-references at bottom:** Related pages linked after main content
- **Always update test assertions:** When adding docs pages to `features/`, `scenarios/`, `guides/`, update corresponding `EXPECTED_*` arrays in `test/docs-build.test.ts` in the same commit
## Examples
✓ **Correct:**
```markdown
# Getting started with Squad
> ⚠️ **Experimental:** This feature is in preview.
Try this:
\`\`\`bash
squad init
\`\`\`
Squad helps you build AI teams...
---
## Install Squad
Run the following command...
```
✗ **Incorrect:**
```markdown
# Getting Started With Squad // Title case
Squad is a tool which will help users... // Third person, future tense
You can install Squad with npm & configure it... // Ampersand in prose
```
## Anti-Patterns
- Title-casing headings because "it looks nicer"
- Writing in passive voice or third person
- Long paragraphs of dense text (breaks scannability)
- Adding doc pages without updating test assertions
description: "Shifts Layer 3 model selection to cost-optimized alternatives when economy mode is active."
domain: "model-selection"
confidence: "low"
source: "manual"
---
## SCOPE
✅ THIS SKILL PRODUCES:
- A modified Layer 3 model selection table applied when economy mode is active
-`economyMode: true` written to `.squad/config.json` when activated persistently
- Spawn acknowledgments with `💰` indicator when economy mode is active
❌ THIS SKILL DOES NOT PRODUCE:
- Code, tests, or documentation
- Cost reports or billing artifacts
- Changes to Layer 0, Layer 1, or Layer 2 resolution (user intent always wins)
## Context
Economy mode shifts Layer 3 (Task-Aware Auto-Selection) to lower-cost alternatives. It does NOT override persistent config (`defaultModel`, `agentModelOverrides`) or per-agent charter preferences — those represent explicit user intent and always take priority.
Use this skill when the user wants to reduce costs across an entire session or permanently, without manually specifying models for each agent.
**Prefer `gpt-4.1` over `gpt-5-mini`** when the task involves structured output or agentic tool use. Prefer `gpt-5-mini` for pure text generation tasks where latency matters.
## AGENT WORKFLOW
### On Session Start
1. READ `.squad/config.json`
2. CHECK for `economyMode: true` — if present, activate economy mode for the session
2. ACKNOWLEDGE: `✅ Economy mode active — using cost-optimized models this session. (Layer 0 and Layer 2 preferences still apply)`
**Persistent:** "always use economy mode", "save economy mode"
1. WRITE `economyMode: true` to `.squad/config.json` (merge, don't overwrite other fields)
2. ACKNOWLEDGE: `✅ Economy mode saved — cost-optimized models will be used until disabled.`
### On Every Agent Spawn (Economy Mode Active)
1. CHECK Layer 0a/0b first (agentModelOverrides, defaultModel) — if set, use that. Economy mode does NOT override Layer 0.
2. CHECK Layer 1 (session directive for a specific model) — if set, use that. Economy mode does NOT override explicit session directives.
3. CHECK Layer 2 (charter preference) — if set, use that. Economy mode does NOT override charter preferences.
4. APPLY economy table at Layer 3 instead of normal table.
5. INCLUDE `💰` in spawn acknowledgment: `🔧 {Name} ({model} · 💰 economy) — {task}`
### On Deactivation
**Trigger phrases:** "turn off economy mode", "disable economy mode", "use normal models"
1. REMOVE `economyMode` from `.squad/config.json` (if it was persisted)
2. CLEAR session economy mode state
3. ACKNOWLEDGE: `✅ Economy mode disabled — returning to standard model selection.`
### STOP
After updating economy mode state and including the `💰` indicator in spawn acknowledgments, this skill is done. Do NOT:
- Change Layer 0, Layer 1, or Layer 2 model choices
- Override charter-specified models
- Generate cost reports or comparisons
- Fall back to premium models via economy mode (economy mode never bumps UP)
## Config Schema
`.squad/config.json` economy-related fields:
```json
{
"version":1,
"economyMode":true
}
```
-`economyMode` — when `true`, Layer 3 uses the economy table. Optional; absent = economy mode off.
- Combines with `defaultModel` and `agentModelOverrides` — Layer 0 always wins.
## Anti-Patterns
- **Don't override Layer 0 in economy mode.** If the user set `defaultModel: "claude-opus-4.6"`, they want quality. Economy mode only affects Layer 3 auto-selection.
- **Don't silently apply economy mode.** Always acknowledge when activated or deactivated.
- **Don't treat economy mode as permanent by default.** Session phrases activate session-only; only "always" or `config.json` persist it.
- **Don't bump premium tasks down too far.** Architecture and security reviews shift from opus to sonnet in economy mode — they do NOT go to fast/cheap models.
---
name: "economy-mode"
description: "Shifts Layer 3 model selection to cost-optimized alternatives when economy mode is active."
domain: "model-selection"
confidence: "low"
source: "manual"
---
## SCOPE
✅ THIS SKILL PRODUCES:
- A modified Layer 3 model selection table applied when economy mode is active
-`economyMode: true` written to `.squad/config.json` when activated persistently
- Spawn acknowledgments with `💰` indicator when economy mode is active
❌ THIS SKILL DOES NOT PRODUCE:
- Code, tests, or documentation
- Cost reports or billing artifacts
- Changes to Layer 0, Layer 1, or Layer 2 resolution (user intent always wins)
## Context
Economy mode shifts Layer 3 (Task-Aware Auto-Selection) to lower-cost alternatives. It does NOT override persistent config (`defaultModel`, `agentModelOverrides`) or per-agent charter preferences — those represent explicit user intent and always take priority.
Use this skill when the user wants to reduce costs across an entire session or permanently, without manually specifying models for each agent.
**Prefer `gpt-4.1` over `gpt-5-mini`** when the task involves structured output or agentic tool use. Prefer `gpt-5-mini` for pure text generation tasks where latency matters.
## AGENT WORKFLOW
### On Session Start
1. READ `.squad/config.json`
2. CHECK for `economyMode: true` — if present, activate economy mode for the session
2. ACKNOWLEDGE: `✅ Economy mode active — using cost-optimized models this session. (Layer 0 and Layer 2 preferences still apply)`
**Persistent:** "always use economy mode", "save economy mode"
1. WRITE `economyMode: true` to `.squad/config.json` (merge, don't overwrite other fields)
2. ACKNOWLEDGE: `✅ Economy mode saved — cost-optimized models will be used until disabled.`
### On Every Agent Spawn (Economy Mode Active)
1. CHECK Layer 0a/0b first (agentModelOverrides, defaultModel) — if set, use that. Economy mode does NOT override Layer 0.
2. CHECK Layer 1 (session directive for a specific model) — if set, use that. Economy mode does NOT override explicit session directives.
3. CHECK Layer 2 (charter preference) — if set, use that. Economy mode does NOT override charter preferences.
4. APPLY economy table at Layer 3 instead of normal table.
5. INCLUDE `💰` in spawn acknowledgment: `🔧 {Name} ({model} · 💰 economy) — {task}`
### On Deactivation
**Trigger phrases:** "turn off economy mode", "disable economy mode", "use normal models"
1. REMOVE `economyMode` from `.squad/config.json` (if it was persisted)
2. CLEAR session economy mode state
3. ACKNOWLEDGE: `✅ Economy mode disabled — returning to standard model selection.`
### STOP
After updating economy mode state and including the `💰` indicator in spawn acknowledgments, this skill is done. Do NOT:
- Change Layer 0, Layer 1, or Layer 2 model choices
- Override charter-specified models
- Generate cost reports or comparisons
- Fall back to premium models via economy mode (economy mode never bumps UP)
## Config Schema
`.squad/config.json` economy-related fields:
```json
{
"version":1,
"economyMode":true
}
```
-`economyMode` — when `true`, Layer 3 uses the economy table. Optional; absent = economy mode off.
- Combines with `defaultModel` and `agentModelOverrides` — Layer 0 always wins.
## Anti-Patterns
- **Don't override Layer 0 in economy mode.** If the user set `defaultModel: "claude-opus-4.6"`, they want quality. Economy mode only affects Layer 3 auto-selection.
- **Don't silently apply economy mode.** Always acknowledge when activated or deactivated.
- **Don't treat economy mode as permanent by default.** Session phrases activate session-only; only "always" or `config.json` persist it.
- **Don't bump premium tasks down too far.** Architecture and security reviews shift from opus to sonnet in economy mode — they do NOT go to fast/cheap models.
| Author disagrees with a decision or design | Empathetic Disagreement | T8 |
| Need more reproduction info or context | Information Request | T9 |
Use exactly one template as the base draft. Replace placeholders with issue-specific details, then apply the humanizer patterns. If the thread spans multiple signals, choose the highest-risk template and capture the nuance in the thread summary.
### Confidence Classification
| Confidence | Criteria | Example |
|-----------|----------|---------|
| 🟢 High | Answer exists in Squad docs or FAQ, similar question answered before, no technical ambiguity | "How do I install Squad?" |
| 🟡 Medium | Technical answer is sound but involves judgment calls, OR docs exist but don't perfectly match the question, OR tone is tricky | "Can Squad work with Azure DevOps?" (yes, but setup is nuanced) |
| 🔴 Needs Review | Technical uncertainty, policy/roadmap question, potential reputational risk, author is frustrated/angry, question about unreleased features | "When will Squad support Claude?" |
**Auto-escalation rules:**
- Any mention of competitors → 🔴
- Any mention of pricing/licensing → 🔴
- Author has >3 follow-up comments without resolution → 🔴
- Question references a closed-wontfix issue → 🔴
### 3. Draft
Use the humanizer skill for every draft.
- Complete **Thread-Read Verification** before writing.
- Read the **full thread**, including all comments, before writing.
- Select the matching template from the **Template Selection Guide** and record the template ID in the review notes.
- Treat templates as reusable drafting assets: keep the structure, replace placeholders, and only improvise when the thread truly requires it.
- Validate the draft against the humanizer anti-patterns.
- Flag long threads (`>10` comments) with `⚠️`.
### Thread-Read Verification
Before drafting, PAO MUST verify complete thread coverage:
1.**Count verification:** Compare API comment count with actually-read comments. If mismatch, abort draft.
2.**Deleted comment check:** Use `gh api` timeline to detect deleted comments. If found, flag as ⚠️ in review table.
3.**Thread summary:** Include in every draft: "Thread: {N} comments, last activity {date}, {summary of key points}"
4.**Long thread flag:** If >10 comments, add ⚠️ to review table and include condensed thread summary
5.**Evidence line in review table:** Each draft row includes "Read: {N}/{total} comments" column
| Author disagrees with a decision or design | Empathetic Disagreement | T8 |
| Need more reproduction info or context | Information Request | T9 |
Use exactly one template as the base draft. Replace placeholders with issue-specific details, then apply the humanizer patterns. If the thread spans multiple signals, choose the highest-risk template and capture the nuance in the thread summary.
### Confidence Classification
| Confidence | Criteria | Example |
|-----------|----------|---------|
| 🟢 High | Answer exists in Squad docs or FAQ, similar question answered before, no technical ambiguity | "How do I install Squad?" |
| 🟡 Medium | Technical answer is sound but involves judgment calls, OR docs exist but don't perfectly match the question, OR tone is tricky | "Can Squad work with Azure DevOps?" (yes, but setup is nuanced) |
| 🔴 Needs Review | Technical uncertainty, policy/roadmap question, potential reputational risk, author is frustrated/angry, question about unreleased features | "When will Squad support Claude?" |
**Auto-escalation rules:**
- Any mention of competitors → 🔴
- Any mention of pricing/licensing → 🔴
- Author has >3 follow-up comments without resolution → 🔴
- Question references a closed-wontfix issue → 🔴
### 3. Draft
Use the humanizer skill for every draft.
- Complete **Thread-Read Verification** before writing.
- Read the **full thread**, including all comments, before writing.
- Select the matching template from the **Template Selection Guide** and record the template ID in the review notes.
- Treat templates as reusable drafting assets: keep the structure, replace placeholders, and only improvise when the thread truly requires it.
- Validate the draft against the humanizer anti-patterns.
- Flag long threads (`>10` comments) with `⚠️`.
### Thread-Read Verification
Before drafting, PAO MUST verify complete thread coverage:
1.**Count verification:** Compare API comment count with actually-read comments. If mismatch, abort draft.
2.**Deleted comment check:** Use `gh api` timeline to detect deleted comments. If found, flag as ⚠️ in review table.
3.**Thread summary:** Include in every draft: "Thread: {N} comments, last activity {date}, {summary of key points}"
4.**Long thread flag:** If >10 comments, add ⚠️ to review table and include condensed thread summary
5.**Evidence line in review table:** Each draft row includes "Read: {N}/{total} comments" column
Many developers use GitHub through an Enterprise Managed User (EMU) account at work while maintaining a personal GitHub account for open-source contributions. AI agents spawned by Squad inherit the shell's default `gh` authentication — which is usually the EMU account. This causes failures when agents try to push to personal repos, create PRs on forks, or interact with resources outside the enterprise org.
This skill teaches agents how to detect the active identity, switch contexts safely, and avoid mixing credentials across operations.
## Patterns
### Detect Current Identity
Before any GitHub operation, check which account is active:
```bash
gh auth status
```
Look for:
-`Logged in to github.com as USERNAME` — the active account
-`Token scopes: ...` — what permissions are available
- Multiple accounts will show separate entries
### Extract a Specific Account's Token
When you need to operate as a specific user (not the default):
```bash
# Get the personal account token (by username)
gh auth token --user personaluser
# Get the EMU account token
gh auth token --user corpalias_enterprise
```
**Use case:** Push to a personal fork while the default `gh` auth is the EMU account.
### Push to Personal Repos from EMU Shell
The most common scenario: your shell defaults to the EMU account, but you need to push to a personal GitHub repo.
**Why this works:**`gh auth token --user` reads from `gh`'s credential store without switching the active account. The token is used inline for a single operation and never persisted.
### Create PRs on Personal Forks
When the default `gh` context is EMU but you need to create a PR from a personal fork:
```bash
# Option 1: Use --repo flag (works if token has access)
Many developers use GitHub through an Enterprise Managed User (EMU) account at work while maintaining a personal GitHub account for open-source contributions. AI agents spawned by Squad inherit the shell's default `gh` authentication — which is usually the EMU account. This causes failures when agents try to push to personal repos, create PRs on forks, or interact with resources outside the enterprise org.
This skill teaches agents how to detect the active identity, switch contexts safely, and avoid mixing credentials across operations.
## Patterns
### Detect Current Identity
Before any GitHub operation, check which account is active:
```bash
gh auth status
```
Look for:
-`Logged in to github.com as USERNAME` — the active account
-`Token scopes: ...` — what permissions are available
- Multiple accounts will show separate entries
### Extract a Specific Account's Token
When you need to operate as a specific user (not the default):
```bash
# Get the personal account token (by username)
gh auth token --user personaluser
# Get the EMU account token
gh auth token --user corpalias_enterprise
```
**Use case:** Push to a personal fork while the default `gh` auth is the EMU account.
### Push to Personal Repos from EMU Shell
The most common scenario: your shell defaults to the EMU account, but you need to push to a personal GitHub repo.
**Why this works:**`gh auth token --user` reads from `gh`'s credential store without switching the active account. The token is used inline for a single operation and never persisted.
### Create PRs on Personal Forks
When the default `gh` context is EMU but you need to create a PR from a personal fork:
```bash
# Option 1: Use --repo flag (works if token has access)
When the coordinator routes multiple issues simultaneously (e.g., "fix bugs X, Y, and Z"), use `git worktree` to give each agent an isolated working directory. No filesystem collisions, no branch-switching overhead.
### When to Use Worktrees vs Sequential
| Scenario | Strategy |
|----------|----------|
| Single issue | Standard workflow above — no worktree needed |
| 2+ simultaneous issues in same repo | Worktrees — one per issue |
| Work spanning multiple repos | Separate clones as siblings (see Multi-Repo below) |
### Setup
From the main clone (must be on dev or any branch):
```bash
# Ensure dev is current
git fetch origin dev
# Create a worktree per issue — siblings to the main clone
All PRs target `dev` independently. Agents never interfere with each other's filesystem.
### .squad/ State in Worktrees
The `.squad/` directory exists in each worktree as a copy. This is safe because:
- `.gitattributes` declares `merge=union` on append-only files (history.md, decisions.md, logs)
- Each agent appends to its own section; union merge reconciles on PR merge to dev
- **Rule:** Never rewrite or reorder `.squad/` files in a worktree — append only
### Cleanup After Merge
After a worktree's PR is merged to dev:
```bash
# From the main clone
git worktree remove ../squad-195
git worktree prune # clean stale metadata
git branch -d squad/195-fix-stamp-bug
git push origin --delete squad/195-fix-stamp-bug
```
If a worktree was deleted manually (rm -rf), `git worktree prune` recovers the state.
---
## Multi-Repo Downstream Scenarios
When work spans multiple repositories (e.g., squad-cli changes need squad-sdk changes, or a user's app depends on squad):
### Setup
Clone downstream repos as siblings to the main repo:
```
~/work/
squad-pr/ # main repo
squad-sdk/ # downstream dependency
user-app/ # consumer project
```
Each repo gets its own issue branch following its own naming convention. If the downstream repo also uses Squad conventions, use `squad/{issue-number}-{slug}`.
### Coordinated PRs
- Create PRs in each repo independently
- Link them in PR descriptions:
```
Closes #42
**Depends on:** squad-sdk PR #17 (squad-sdk changes required for this feature)
```
- Merge order: dependencies first (e.g., squad-sdk), then dependents (e.g., squad-cli)
### Local Linking for Testing
Before pushing, verify cross-repo changes work together:
When the coordinator routes multiple issues simultaneously (e.g., "fix bugs X, Y, and Z"), use `git worktree` to give each agent an isolated working directory. No filesystem collisions, no branch-switching overhead.
### When to Use Worktrees vs Sequential
| Scenario | Strategy |
|----------|----------|
| Single issue | Standard workflow above — no worktree needed |
| 2+ simultaneous issues in same repo | Worktrees — one per issue |
| Work spanning multiple repos | Separate clones as siblings (see Multi-Repo below) |
### Setup
From the main clone (must be on dev or any branch):
```bash
# Ensure dev is current
git fetch origin dev
# Create a worktree per issue — siblings to the main clone
All PRs target `dev` independently. Agents never interfere with each other's filesystem.
### .squad/ State in Worktrees
The `.squad/` directory exists in each worktree as a copy. This is safe because:
- `.gitattributes` declares `merge=union` on append-only files (history.md, decisions.md, logs)
- Each agent appends to its own section; union merge reconciles on PR merge to dev
- **Rule:** Never rewrite or reorder `.squad/` files in a worktree — append only
### Cleanup After Merge
After a worktree's PR is merged to dev:
```bash
# From the main clone
git worktree remove ../squad-195
git worktree prune # clean stale metadata
git branch -d squad/195-fix-stamp-bug
git push origin --delete squad/195-fix-stamp-bug
```
If a worktree was deleted manually (rm -rf), `git worktree prune` recovers the state.
---
## Multi-Repo Downstream Scenarios
When work spans multiple repositories (e.g., squad-cli changes need squad-sdk changes, or a user's app depends on squad):
### Setup
Clone downstream repos as siblings to the main repo:
```
~/work/
squad-pr/ # main repo
squad-sdk/ # downstream dependency
user-app/ # consumer project
```
Each repo gets its own issue branch following its own naming convention. If the downstream repo also uses Squad conventions, use `squad/{issue-number}-{slug}`.
### Coordinated PRs
- Create PRs in each repo independently
- Link them in PR descriptions:
```
Closes #42
**Depends on:** squad-sdk PR #17 (squad-sdk changes required for this feature)
```
- Merge order: dependencies first (e.g., squad-sdk), then dependents (e.g., squad-cli)
### Local Linking for Testing
Before pushing, verify cross-repo changes work together:
description: Detect and set up account-locked gh aliases for multi-account GitHub. The AI reads this skill, detects accounts, asks the user which is personal/work, and runs the setup automatically.
description: Detect and set up account-locked gh aliases for multi-account GitHub. The AI reads this skill, detects accounts, asks the user which is personal/work, and runs the setup automatically.
description: Record final outcomes to history.md, not intermediate requests or reversed decisions
domain: documentation, team-collaboration
confidence: high
source: earned (Kobayashi v0.6.0 incident, team intervention)
---
## Context
History files (.md files tracking decisions, spawns, outcomes) are read cold by future agents. Stale or incorrect entries poison decision-making downstream. The Kobayashi incident proved this: history said "Brady decided v0.6.0" when Brady had reversed that to v0.8.17. Future spawns read the wrong truth and repeated the mistake.
## Patterns
- **Record the final outcome**, not the initial request.
- **Wait for confirmation** before writing to history — don't log intermediate states.
- **If a decision reverses**, update the entry immediately — don't leave stale data.
- **One read = one truth.** A future agent should never need to cross-reference other files to understand what actually happened.
## Examples
✓ **Correct:**
- "Migration target: v0.8.17 (initially discussed as v0.6.0, corrected by Brady)"
- "Reverted to Node 18 per Brady's explicit request on 2024-01-15"
✗ **Incorrect:**
- "Brady directed v0.6.0" (when later reversed)
- Recording what was *requested* instead of what *actually happened*
- Logging entries before outcome is confirmed
## Anti-Patterns
- Writing intermediate or "for now" states to disk
- Attributing decisions without confirming final direction
- Treating history like a draft — history is the source of truth
- Assuming readers will cross-reference or verify; they won't
---
name: history-hygiene
description: Record final outcomes to history.md, not intermediate requests or reversed decisions
domain: documentation, team-collaboration
confidence: high
source: earned (Kobayashi v0.6.0 incident, team intervention)
---
## Context
History files (.md files tracking decisions, spawns, outcomes) are read cold by future agents. Stale or incorrect entries poison decision-making downstream. The Kobayashi incident proved this: history said "Brady decided v0.6.0" when Brady had reversed that to v0.8.17. Future spawns read the wrong truth and repeated the mistake.
## Patterns
- **Record the final outcome**, not the initial request.
- **Wait for confirmation** before writing to history — don't log intermediate states.
- **If a decision reverses**, update the entry immediately — don't leave stale data.
- **One read = one truth.** A future agent should never need to cross-reference other files to understand what actually happened.
## Examples
✓ **Correct:**
- "Migration target: v0.8.17 (initially discussed as v0.6.0, corrected by Brady)"
- "Reverted to Node 18 per Brady's explicit request on 2024-01-15"
✗ **Incorrect:**
- "Brady directed v0.6.0" (when later reversed)
- Recording what was *requested* instead of what *actually happened*
- Logging entries before outcome is confirmed
## Anti-Patterns
- Writing intermediate or "for now" states to disk
- Attributing decisions without confirming final direction
- Treating history like a draft — history is the source of truth
- Assuming readers will cross-reference or verify; they won't
description: "Tone enforcement patterns for external-facing community responses"
domain: "communication, tone, community"
confidence: "low"
source: "manual (RFC #426 — PAO External Communications)"
---
## Context
Use this skill whenever PAO drafts external-facing responses for issues or discussions.
- Tone must be warm, helpful, and human-sounding — never robotic or corporate.
- Brady's constraint applies everywhere: **Humanized tone is mandatory**.
- This applies to **all external-facing content** drafted by PAO in Phase 1 issues/discussions workflows.
## Patterns
1.**Warm opening** — Start with acknowledgment ("Thanks for reporting this", "Great question!")
2.**Active voice** — "We're looking into this" not "This is being investigated"
3.**Second person** — Address the person directly ("you" not "the user")
4.**Conversational connectors** — "That said...", "Here's what we found...", "Quick note:"
5.**Specific, not vague** — "This affects the casting module in v0.8.x" not "We are aware of issues"
6.**Empathy markers** — "I can see how that would be frustrating", "Good catch!"
7.**Action-oriented closes** — "Let us know if that helps!" not "Please advise if further assistance is required"
8.**Uncertainty is OK** — "We're not 100% sure yet, but here's what we think is happening..." is better than false confidence
9.**Profanity filter** — Never include profanity, slurs, or aggressive language, even when quoting
10.**Baseline comparison** — Responses should align with tone of 5-10 "gold standard" responses (>80% similarity threshold)
11.**Empathetic disagreement** — "We hear you. That's a fair concern." before explaining the reasoning
12.**Information request** — Ask for specific details, not open-ended "can you provide more info?"
13.**No link-dumping** — Don't just paste URLs. Provide context: "Check out the [getting started guide](url) — specifically the section on routing" not just a bare link
## Examples
### 1. Welcome
```text
Hey {author}! Welcome to Squad 👋 Thanks for opening this.
{substantive response}
Let us know if you have questions — happy to help!
```
### 2. Troubleshooting
```text
Thanks for the detailed report, {author}!
Here's what we think is happening: {explanation}
{steps or workaround}
Let us know if that helps, or if you're seeing something different.
```
### 3. Feature guidance
```text
Great question! {context on current state}
{guidance or workaround}
We've noted this as a potential improvement — {tracking info if applicable}.
```
### 4. Redirect
```text
Thanks for reaching out! This one is actually better suited for {correct location}.
{brief explanation of why}
Feel free to open it there — they'll be able to help!
```
### 5. Acknowledgment
```text
Good catch, {author}. We've confirmed this is a real issue.
{what we know so far}
We'll update this thread when we have a fix. Thanks for flagging it!
```
### 6. Closing
```text
This should be resolved in {version/PR}! 🎉
{brief summary of what changed}
Thanks for reporting this, {author} — it made Squad better.
```
### 7. Technical uncertainty
```text
Interesting find, {author}. We're not 100% sure what's causing this yet.
Here's what we've ruled out: {list}
We'd love more context if you have it — {specific ask}.
We'll dig deeper and update this thread.
```
## Anti-Patterns
- ❌ Corporate speak: "We appreciate your patience as we investigate this matter"
- ❌ Marketing hype: "Squad is the BEST way to..." or "This amazing feature..."
- ❌ Passive voice: "It has been determined that..." or "The issue is being tracked"
- ❌ Dismissive: "This works as designed" without empathy
- ❌ Over-promising: "We'll ship this next week" without commitment from the team
- ❌ Empty acknowledgment: "Thanks for your feedback" with no substance
- ❌ Robot signatures: "Best regards, PAO" or "Sincerely, The Squad Team"
- ❌ Excessive emoji: More than 1-2 emoji per response
- ❌ Quoting profanity: Even when the original issue contains it, paraphrase instead
- ❌ Link-dumping: Pasting URLs without context ("See: https://...")
- ❌ Open-ended info requests: "Can you provide more information?" without specifying what information
---
name: "humanizer"
description: "Tone enforcement patterns for external-facing community responses"
domain: "communication, tone, community"
confidence: "low"
source: "manual (RFC #426 — PAO External Communications)"
---
## Context
Use this skill whenever PAO drafts external-facing responses for issues or discussions.
- Tone must be warm, helpful, and human-sounding — never robotic or corporate.
- Brady's constraint applies everywhere: **Humanized tone is mandatory**.
- This applies to **all external-facing content** drafted by PAO in Phase 1 issues/discussions workflows.
## Patterns
1.**Warm opening** — Start with acknowledgment ("Thanks for reporting this", "Great question!")
2.**Active voice** — "We're looking into this" not "This is being investigated"
3.**Second person** — Address the person directly ("you" not "the user")
4.**Conversational connectors** — "That said...", "Here's what we found...", "Quick note:"
5.**Specific, not vague** — "This affects the casting module in v0.8.x" not "We are aware of issues"
6.**Empathy markers** — "I can see how that would be frustrating", "Good catch!"
7.**Action-oriented closes** — "Let us know if that helps!" not "Please advise if further assistance is required"
8.**Uncertainty is OK** — "We're not 100% sure yet, but here's what we think is happening..." is better than false confidence
9.**Profanity filter** — Never include profanity, slurs, or aggressive language, even when quoting
10.**Baseline comparison** — Responses should align with tone of 5-10 "gold standard" responses (>80% similarity threshold)
11.**Empathetic disagreement** — "We hear you. That's a fair concern." before explaining the reasoning
12.**Information request** — Ask for specific details, not open-ended "can you provide more info?"
13.**No link-dumping** — Don't just paste URLs. Provide context: "Check out the [getting started guide](url) — specifically the section on routing" not just a bare link
## Examples
### 1. Welcome
```text
Hey {author}! Welcome to Squad 👋 Thanks for opening this.
{substantive response}
Let us know if you have questions — happy to help!
```
### 2. Troubleshooting
```text
Thanks for the detailed report, {author}!
Here's what we think is happening: {explanation}
{steps or workaround}
Let us know if that helps, or if you're seeing something different.
```
### 3. Feature guidance
```text
Great question! {context on current state}
{guidance or workaround}
We've noted this as a potential improvement — {tracking info if applicable}.
```
### 4. Redirect
```text
Thanks for reaching out! This one is actually better suited for {correct location}.
{brief explanation of why}
Feel free to open it there — they'll be able to help!
```
### 5. Acknowledgment
```text
Good catch, {author}. We've confirmed this is a real issue.
{what we know so far}
We'll update this thread when we have a fix. Thanks for flagging it!
```
### 6. Closing
```text
This should be resolved in {version/PR}! 🎉
{brief summary of what changed}
Thanks for reporting this, {author} — it made Squad better.
```
### 7. Technical uncertainty
```text
Interesting find, {author}. We're not 100% sure what's causing this yet.
Here's what we've ruled out: {list}
We'd love more context if you have it — {specific ask}.
We'll dig deeper and update this thread.
```
## Anti-Patterns
- ❌ Corporate speak: "We appreciate your patience as we investigate this matter"
- ❌ Marketing hype: "Squad is the BEST way to..." or "This amazing feature..."
- ❌ Passive voice: "It has been determined that..." or "The issue is being tracked"
- ❌ Dismissive: "This works as designed" without empathy
- ❌ Over-promising: "We'll ship this next week" without commitment from the team
- ❌ Empty acknowledgment: "Thanks for your feedback" with no substance
- ❌ Robot signatures: "Best regards, PAO" or "Sincerely, The Squad Team"
- ❌ Excessive emoji: More than 1-2 emoji per response
- ❌ Quoting profanity: Even when the original issue contains it, paraphrase instead
- ❌ Link-dumping: Pasting URLs without context ("See: https://...")
- ❌ Open-ended info requests: "Can you provide more information?" without specifying what information
description: "Confirm team roster with selectable menu"
when: "Phase 1 proposal — requires explicit user confirmation"
---
## Context
Init Mode activates when `.squad/team.md` does not exist, or exists but has zero roster entries under `## Members`. The coordinator proposes a team (Phase 1), waits for user confirmation, then creates the team structure (Phase 2).
## Patterns
### Phase 1: Propose the Team
No team exists yet. Propose one — but **DO NOT create any files until the user confirms.**
1.**Identify the user.** Run `git config user.name` to learn who you're working with. Use their name in conversation (e.g., *"Hey Brady, what are you building?"*). Store their name (NOT email) in `team.md` under Project Context. **Never read or store `git config user.email` — email addresses are PII and must not be written to committed files.**
2. Ask: *"What are you building? (language, stack, what it does)"*
3.**Cast the team.** Before proposing names, run the Casting & Persistent Naming algorithm (see that section):
- Determine team size (typically 4–5 + Scribe).
- Determine assignment shape from the user's project description.
- Derive resonance signals from the session and repo context.
- Select a universe. If the universe is custom, allocate character names from that universe based on the related list found in the `.squad/templates/casting/` directory. Prefer custom universes when available.
- Scribe is always "Scribe" — exempt from casting.
- Ralph is always "Ralph" — exempt from casting.
4. Propose the team with their cast names. Example (names will vary per cast):
```
🏗️ {CastName1} — Lead Scope, decisions, code review
⚛️ {CastName2} — Frontend Dev React, UI, components
🔧 {CastName3} — Backend Dev APIs, database, services
🔄 Ralph — (monitor) Work queue, backlog, keep-alive
```
5. Use the `ask_user` tool to confirm the roster. Provide choices so the user sees a selectable menu:
- **question:** *"Look right?"*
- **choices:** `["Yes, hire this team", "Add someone", "Change a role"]`
**⚠️ STOP. Your response ENDS here. Do NOT proceed to Phase 2. Do NOT create any files or directories. Wait for the user's reply.**
### Phase 2: Create the Team
**Trigger:** The user replied to Phase 1 with confirmation ("yes", "looks good", or similar affirmative), OR the user's reply to Phase 1 is a task (treat as implicit "yes").
> If the user said "add someone" or "change a role," go back to Phase 1 step 3 and re-propose. Do NOT enter Phase 2 until the user confirms.
6. Create the `.squad/` directory structure (see `.squad/templates/` for format guides or use the standard structure: team.md, routing.md, ceremonies.md, decisions.md, decisions/inbox/, casting/, agents/, orchestration-log/, skills/, log/).
**Casting state initialization:** Copy `.squad/templates/casting-policy.json` to `.squad/casting/policy.json` (or create from defaults). Create `registry.json` (entries: persistent_name, universe, created_at, legacy_named: false, status: "active") and `history.json` (first assignment snapshot with unique assignment_id).
**Seeding:** Each agent's `history.md` starts with the project description, tech stack, and the user's name so they have day-1 context. Agent folder names are the cast name in lowercase (e.g., `.squad/agents/ripley/`). The Scribe's charter includes maintaining `decisions.md` and cross-agent context sharing.
**Team.md structure:**`team.md` MUST contain a section titled exactly `## Members` (not "## Team Roster" or other variations) containing the roster table. This header is hard-coded in GitHub workflows (`squad-heartbeat.yml`, `squad-issue-assign.yml`, `squad-triage.yml`, `sync-squad-labels.yml`) for label automation. If the header is missing or titled differently, label routing breaks.
**Merge driver for append-only files:** Create or update `.gitattributes` at the repo root to enable conflict-free merging of `.squad/` state across branches:
```
.squad/decisions.md merge=union
.squad/agents/*/history.md merge=union
.squad/log/** merge=union
.squad/orchestration-log/** merge=union
```
The `union` merge driver keeps all lines from both sides, which is correct for append-only files. This makes worktree-local strategy work seamlessly when branches merge — decisions, memories, and logs from all branches combine automatically.
7. Say: *"✅ Team hired. Try: '{FirstCastName}, set up the project structure'"*
8.**Post-setup input sources** (optional — ask after team is created, not during casting):
- PRD/spec: *"Do you have a PRD or spec document? (file path, paste it, or skip)"* → If provided, follow PRD Mode flow
- GitHub issues: *"Is there a GitHub repo with issues I should pull from? (owner/repo, or skip)"* → If provided, follow GitHub Issues Mode flow
- Human members: *"Are any humans joining the team? (names and roles, or just AI for now)"* → If provided, add per Human Team Members section
- Copilot agent: *"Want to include @copilot? It can pick up issues autonomously. (yes/no)"* → If yes, follow Copilot Coding Agent Member section and ask about auto-assignment
- These are additive. Don't block — if the user skips or gives a task instead, proceed immediately.
## Examples
**Example flow:**
1. Coordinator detects no team.md → Init Mode
2. Runs `git config user.name` → "Brady"
3. Asks: *"Hey Brady, what are you building?"*
4. User: *"TypeScript CLI tool with GitHub API integration"*
description: "Confirm team roster with selectable menu"
when: "Phase 1 proposal — requires explicit user confirmation"
---
## Context
Init Mode activates when `.squad/team.md` does not exist, or exists but has zero roster entries under `## Members`. The coordinator proposes a team (Phase 1), waits for user confirmation, then creates the team structure (Phase 2).
## Patterns
### Phase 1: Propose the Team
No team exists yet. Propose one — but **DO NOT create any files until the user confirms.**
1.**Identify the user.** Run `git config user.name` to learn who you're working with. Use their name in conversation (e.g., *"Hey Brady, what are you building?"*). Store their name (NOT email) in `team.md` under Project Context. **Never read or store `git config user.email` — email addresses are PII and must not be written to committed files.**
2. Ask: *"What are you building? (language, stack, what it does)"*
3.**Cast the team.** Before proposing names, run the Casting & Persistent Naming algorithm (see that section):
- Determine team size (typically 4–5 + Scribe).
- Determine assignment shape from the user's project description.
- Derive resonance signals from the session and repo context.
- Select a universe. If the universe is custom, allocate character names from that universe based on the related list found in the `.squad/templates/casting/` directory. Prefer custom universes when available.
- Scribe is always "Scribe" — exempt from casting.
- Ralph is always "Ralph" — exempt from casting.
4. Propose the team with their cast names. Example (names will vary per cast):
```
🏗️ {CastName1} — Lead Scope, decisions, code review
⚛️ {CastName2} — Frontend Dev React, UI, components
🔧 {CastName3} — Backend Dev APIs, database, services
🔄 Ralph — (monitor) Work queue, backlog, keep-alive
```
5. Use the `ask_user` tool to confirm the roster. Provide choices so the user sees a selectable menu:
- **question:** *"Look right?"*
- **choices:** `["Yes, hire this team", "Add someone", "Change a role"]`
**⚠️ STOP. Your response ENDS here. Do NOT proceed to Phase 2. Do NOT create any files or directories. Wait for the user's reply.**
### Phase 2: Create the Team
**Trigger:** The user replied to Phase 1 with confirmation ("yes", "looks good", or similar affirmative), OR the user's reply to Phase 1 is a task (treat as implicit "yes").
> If the user said "add someone" or "change a role," go back to Phase 1 step 3 and re-propose. Do NOT enter Phase 2 until the user confirms.
6. Create the `.squad/` directory structure (see `.squad/templates/` for format guides or use the standard structure: team.md, routing.md, ceremonies.md, decisions.md, decisions/inbox/, casting/, agents/, orchestration-log/, skills/, log/).
**Casting state initialization:** Copy `.squad/templates/casting-policy.json` to `.squad/casting/policy.json` (or create from defaults). Create `registry.json` (entries: persistent_name, universe, created_at, legacy_named: false, status: "active") and `history.json` (first assignment snapshot with unique assignment_id).
**Seeding:** Each agent's `history.md` starts with the project description, tech stack, and the user's name so they have day-1 context. Agent folder names are the cast name in lowercase (e.g., `.squad/agents/ripley/`). The Scribe's charter includes maintaining `decisions.md` and cross-agent context sharing.
**Team.md structure:**`team.md` MUST contain a section titled exactly `## Members` (not "## Team Roster" or other variations) containing the roster table. This header is hard-coded in GitHub workflows (`squad-heartbeat.yml`, `squad-issue-assign.yml`, `squad-triage.yml`, `sync-squad-labels.yml`) for label automation. If the header is missing or titled differently, label routing breaks.
**Merge driver for append-only files:** Create or update `.gitattributes` at the repo root to enable conflict-free merging of `.squad/` state across branches:
```
.squad/decisions.md merge=union
.squad/agents/*/history.md merge=union
.squad/log/** merge=union
.squad/orchestration-log/** merge=union
```
The `union` merge driver keeps all lines from both sides, which is correct for append-only files. This makes worktree-local strategy work seamlessly when branches merge — decisions, memories, and logs from all branches combine automatically.
7. Say: *"✅ Team hired. Try: '{FirstCastName}, set up the project structure'"*
8.**Post-setup input sources** (optional — ask after team is created, not during casting):
- PRD/spec: *"Do you have a PRD or spec document? (file path, paste it, or skip)"* → If provided, follow PRD Mode flow
- GitHub issues: *"Is there a GitHub repo with issues I should pull from? (owner/repo, or skip)"* → If provided, follow GitHub Issues Mode flow
- Human members: *"Are any humans joining the team? (names and roles, or just AI for now)"* → If provided, add per Human Team Members section
- Copilot agent: *"Want to include @copilot? It can pick up issues autonomously. (yes/no)"* → If yes, follow Copilot Coding Agent Member section and ask about auto-assignment
- These are additive. Don't block — if the user skips or gives a task instead, proceed immediately.
## Examples
**Example flow:**
1. Coordinator detects no team.md → Init Mode
2. Runs `git config user.name` → "Brady"
3. Asks: *"Hey Brady, what are you building?"*
4. User: *"TypeScript CLI tool with GitHub API integration"*
> Determines which LLM model to use for each agent spawn.
## SCOPE
✅ THIS SKILL PRODUCES:
- A resolved `model` parameter for every `task` tool call
- Persistent model preferences in `.squad/config.json`
- Spawn acknowledgments that include the resolved model
❌ THIS SKILL DOES NOT PRODUCE:
- Code, tests, or documentation
- Model performance benchmarks
- Cost reports or billing artifacts
## Context
Squad supports 18+ models across three tiers (premium, standard, fast). The coordinator must select the right model for each agent spawn. Users can set persistent preferences that survive across sessions.
## 5-Layer Model Resolution Hierarchy
Resolution is **first-match-wins** — the highest layer with a value wins.
**Key principle:** Layer 0 (persistent config) beats everything. If the user said "always use opus" and it was saved to config.json, every agent gets opus regardless of role or task type. This is intentional — the user explicitly chose quality over cost.
## AGENT WORKFLOW
### On Session Start
1. READ `.squad/config.json`
2. CHECK for `defaultModel` field — if present, this is the Layer 0 override for all spawns
3. CHECK for `agentModelOverrides` field — if present, these are per-agent Layer 0a overrides
4. STORE both values in session context for the duration
### On Every Agent Spawn
1. CHECK Layer 0a: Is there an `agentModelOverrides.{agentName}` in config.json? → Use it.
2. CHECK Layer 0b: Is there a `defaultModel` in config.json? → Use it.
3. CHECK Layer 1: Did the user give a session directive? → Use it.
4. CHECK Layer 2: Does the agent's charter have a `## Model` section? → Use it.
**Never fall UP in tier.** A fast task won't land on a premium model via fallback.
# Model Selection
> Determines which LLM model to use for each agent spawn.
## SCOPE
✅ THIS SKILL PRODUCES:
- A resolved `model` parameter for every `task` tool call
- Persistent model preferences in `.squad/config.json`
- Spawn acknowledgments that include the resolved model
❌ THIS SKILL DOES NOT PRODUCE:
- Code, tests, or documentation
- Model performance benchmarks
- Cost reports or billing artifacts
## Context
Squad supports 18+ models across three tiers (premium, standard, fast). The coordinator must select the right model for each agent spawn. Users can set persistent preferences that survive across sessions.
## 5-Layer Model Resolution Hierarchy
Resolution is **first-match-wins** — the highest layer with a value wins.
**Key principle:** Layer 0 (persistent config) beats everything. If the user said "always use opus" and it was saved to config.json, every agent gets opus regardless of role or task type. This is intentional — the user explicitly chose quality over cost.
## AGENT WORKFLOW
### On Session Start
1. READ `.squad/config.json`
2. CHECK for `defaultModel` field — if present, this is the Layer 0 override for all spawns
3. CHECK for `agentModelOverrides` field — if present, these are per-agent Layer 0a overrides
4. STORE both values in session context for the duration
### On Every Agent Spawn
1. CHECK Layer 0a: Is there an `agentModelOverrides.{agentName}` in config.json? → Use it.
2. CHECK Layer 0b: Is there a `defaultModel` in config.json? → Use it.
3. CHECK Layer 1: Did the user give a session directive? → Use it.
4. CHECK Layer 2: Does the agent's charter have a `## Model` section? → Use it.
A personal squad is a user-level collection of AI agents that travel with you across projects. Unlike project agents (defined in a project's `.squad/` directory), personal agents live in your global config directory and are automatically discovered when you start a squad session.
## Directory Structure
```
~/.config/squad/personal-squad/ # Linux/macOS
%APPDATA%/squad/personal-squad/ # Windows
├── agents/
│ ├── {agent-name}/
│ │ ├── charter.md
│ │ └── history.md
│ └── ...
└── config.json # Optional: personal squad config
```
## How It Works
1.**Ambient Discovery:** When Squad starts a session, it checks for a personal squad directory
2.**Merge:** Personal agents are merged into the session cast alongside project agents
3.**Ghost Protocol:** Personal agents can read project state but not write to it
4.**Kill Switch:** Set `SQUAD_NO_PERSONAL=1` to disable ambient discovery
## Commands
-`squad personal init` — Bootstrap a personal squad directory
-`squad personal list` — List your personal agents
-`squad personal add {name} --role {role}` — Add a personal agent
-`squad personal remove {name}` — Remove a personal agent
-`squad cast` — Show the current session cast (project + personal)
## Ghost Protocol
See `templates/ghost-protocol.md` for the full rules. Key points:
- Personal agents advise; project agents execute
- No writes to project `.squad/` state
- Transparent origin tagging in logs
- Project agents take precedence on conflicts
## Configuration
Optional `config.json` in the personal squad directory:
```json
{
"defaultModel":"auto",
"ghostProtocol":true,
"agents":{}
}
```
## Environment Variables
-`SQUAD_NO_PERSONAL` — Set to any value to disable personal squad discovery
-`SQUAD_PERSONAL_DIR` — Override the default personal squad directory path
# Personal Squad — Skill Document
## What is a Personal Squad?
A personal squad is a user-level collection of AI agents that travel with you across projects. Unlike project agents (defined in a project's `.squad/` directory), personal agents live in your global config directory and are automatically discovered when you start a squad session.
## Directory Structure
```
~/.config/squad/personal-squad/ # Linux/macOS
%APPDATA%/squad/personal-squad/ # Windows
├── agents/
│ ├── {agent-name}/
│ │ ├── charter.md
│ │ └── history.md
│ └── ...
└── config.json # Optional: personal squad config
```
## How It Works
1.**Ambient Discovery:** When Squad starts a session, it checks for a personal squad directory
2.**Merge:** Personal agents are merged into the session cast alongside project agents
3.**Ghost Protocol:** Personal agents can read project state but not write to it
4.**Kill Switch:** Set `SQUAD_NO_PERSONAL=1` to disable ambient discovery
## Commands
-`squad personal init` — Bootstrap a personal squad directory
-`squad personal list` — List your personal agents
-`squad personal add {name} --role {role}` — Add a personal agent
-`squad personal remove {name}` — Remove a personal agent
-`squad cast` — Show the current session cast (project + personal)
## Ghost Protocol
See `templates/ghost-protocol.md` for the full rules. Key points:
- Personal agents advise; project agents execute
- No writes to project `.squad/` state
- Transparent origin tagging in logs
- Project agents take precedence on conflicts
## Configuration
Optional `config.json` in the personal squad directory:
```json
{
"defaultModel":"auto",
"ghostProtocol":true,
"agents":{}
}
```
## Environment Variables
-`SQUAD_NO_PERSONAL` — Set to any value to disable personal squad discovery
-`SQUAD_PERSONAL_DIR` — Override the default personal squad directory path
description: "Core conventions and patterns for this codebase"
domain: "project-conventions"
confidence: "medium"
source: "template"
---
## Context
> **This is a starter template.** Replace the placeholder patterns below with your actual project conventions. Skills train agents on codebase-specific practices — accurate documentation here improves agent output quality.
## Patterns
### [Pattern Name]
Describe a key convention or practice used in this codebase. Be specific about what to do and why.
### Error Handling
<!-- Example: How does your project handle errors? -->
<!-- - Use try/catch with specific error types? -->
<!-- - Log to a specific service? -->
<!-- - Return error objects vs throwing? -->
### Testing
<!-- Example: What test framework? Where do tests live? How to run them? -->
<!-- - Test framework: Jest/Vitest/node:test/etc. -->
<!-- - Test location: test/, __tests__/, *.test.ts, etc. -->
// Add code examples that demonstrate your conventions
```
## Anti-Patterns
<!-- List things to avoid in this codebase -->
- **[Anti-pattern]** — Explanation of what not to do and why.
---
name: "project-conventions"
description: "Core conventions and patterns for this codebase"
domain: "project-conventions"
confidence: "medium"
source: "template"
---
## Context
> **This is a starter template.** Replace the placeholder patterns below with your actual project conventions. Skills train agents on codebase-specific practices — accurate documentation here improves agent output quality.
## Patterns
### [Pattern Name]
Describe a key convention or practice used in this codebase. Be specific about what to do and why.
### Error Handling
<!-- Example: How does your project handle errors? -->
<!-- - Use try/catch with specific error types? -->
<!-- - Log to a specific service? -->
<!-- - Return error objects vs throwing? -->
### Testing
<!-- Example: What test framework? Where do tests live? How to run them? -->
<!-- - Test framework: Jest/Vitest/node:test/etc. -->
<!-- - Test location: test/, __tests__/, *.test.ts, etc. -->
description: "Step-by-step release checklist for Squad — prevents v0.8.22-style disasters"
domain: "release-management"
confidence: "high"
source: "team-decision"
---
## Context
This is the **definitive release runbook** for Squad. Born from the v0.8.22 release disaster (4-part semver mangled by npm, draft release never triggered publish, wrong NPM_TOKEN type, 6+ hours of broken `latest` dist-tag).
**Rule:** No agent releases Squad without following this checklist. No exceptions. No improvisation.
---
## Pre-Release Validation
Before starting ANY release work, validate the following:
### 1. Version Number Validation
**Rule:** Only 3-part semver (major.minor.patch) or prerelease (major.minor.patch-tag.N) are valid. 4-part versions (0.8.21.4) are NOT valid semver and npm will mangle them.
- [ ] dev branch has next preview version: `git show dev:package.json | grep version` shows next preview
---
## Post-Mortem Reference
This skill was created after the v0.8.22 release disaster. Full retrospective: `.squad/decisions/inbox/keaton-v0822-retrospective.md`
**Key learnings:**
1. No release without a runbook = improvisation = disaster
2. Semver validation is mandatory — 4-part versions break npm
3. NPM_TOKEN type matters — User tokens with 2FA fail in CI
4. Draft releases are a footgun — they don't trigger automation
5. Retry logic is essential — npm propagation takes time
**Never again.**
---
name: "release-process"
description: "Step-by-step release checklist for Squad — prevents v0.8.22-style disasters"
domain: "release-management"
confidence: "high"
source: "team-decision"
---
## Context
This is the **definitive release runbook** for Squad. Born from the v0.8.22 release disaster (4-part semver mangled by npm, draft release never triggered publish, wrong NPM_TOKEN type, 6+ hours of broken `latest` dist-tag).
**Rule:** No agent releases Squad without following this checklist. No exceptions. No improvisation.
---
## Pre-Release Validation
Before starting ANY release work, validate the following:
### 1. Version Number Validation
**Rule:** Only 3-part semver (major.minor.patch) or prerelease (major.minor.patch-tag.N) are valid. 4-part versions (0.8.21.4) are NOT valid semver and npm will mangle them.
description: "Team-wide charter and history optimization through skill extraction"
domain: "team-optimization"
confidence: "high"
source: "manual — Brady directive to reduce per-agent context overhead"
---
## Context
When the coordinator hears "team, reskill" (or similar: "optimize context", "slim down charters"), trigger a team-wide optimization pass. The goal: reduce per-agent context consumption by extracting shared patterns from charters and histories into reusable skills.
This is a periodic maintenance activity. Run whenever charter/history bloat is suspected.
## Process
### Step 1: Audit
Read all agent charters and histories. Measure byte sizes. Identify:
- **Boilerplate** — sections repeated across ≥3 charters with <10% variation (collaboration, model, boundaries template)
### Minimal Charter Template (target format after reskill)
```
# {Name} — {Role}
> {Tagline — one sentence capturing voice and philosophy}
## Identity
- **Name:** {Name}
- **Role:** {Role}
- **Expertise:** {comma-separated list}
## What I Own
- {bullet list of owned artifacts/domains}
## How I Work
- {unique patterns and principles — NOT boilerplate}
## Boundaries
**I handle:** {domain list}
**I don't handle:** {explicit exclusions}
## Model
Preferred: {model}
```
### Skill Extraction Threshold
- **1 charter** → leave in charter (unique to that agent)
- **2 charters** → consider extracting if >500 bytes of overlap
- **3+ charters** → always extract to a shared skill
## Anti-Patterns
- Don't delete unique per-agent identity or domain-specific knowledge
- Don't create skills for content only one agent uses
- Don't merge unrelated patterns into a single mega-skill
- Don't remove Model preference line (coordinator needs it for model selection)
- Don't touch `.squad/decisions.md` during reskill
- Don't remove the tagline blockquote — it's the charter's soul in one line
---
name: "reskill"
description: "Team-wide charter and history optimization through skill extraction"
domain: "team-optimization"
confidence: "high"
source: "manual — Brady directive to reduce per-agent context overhead"
---
## Context
When the coordinator hears "team, reskill" (or similar: "optimize context", "slim down charters"), trigger a team-wide optimization pass. The goal: reduce per-agent context consumption by extracting shared patterns from charters and histories into reusable skills.
This is a periodic maintenance activity. Run whenever charter/history bloat is suspected.
## Process
### Step 1: Audit
Read all agent charters and histories. Measure byte sizes. Identify:
- **Boilerplate** — sections repeated across ≥3 charters with <10% variation (collaboration, model, boundaries template)
description: "Reviewer rejection workflow and strict lockout semantics"
domain: "orchestration"
confidence: "high"
source: "extracted"
---
## Context
When a team member has a **Reviewer** role (e.g., Tester, Code Reviewer, Lead), they may approve or reject work from other agents. On rejection, the coordinator enforces strict lockout rules to ensure the original author does NOT self-revise. This prevents defensive feedback loops and ensures independent review.
## Patterns
### Reviewer Rejection Protocol
When a team member has a **Reviewer** role:
- Reviewers may **approve** or **reject** work from other agents.
- On **rejection**, the Reviewer may choose ONE of:
1.**Reassign:** Require a *different* agent to do the revision (not the original author).
2.**Escalate:** Require a *new* agent be spawned with specific expertise.
- The Coordinator MUST enforce this. If the Reviewer says "someone else should fix this," the original agent does NOT get to self-revise.
- If the Reviewer approves, work proceeds normally.
### Strict Lockout Semantics
When an artifact is **rejected** by a Reviewer:
1.**The original author is locked out.** They may NOT produce the next version of that artifact. No exceptions.
2.**A different agent MUST own the revision.** The Coordinator selects the revision author based on the Reviewer's recommendation (reassign or escalate).
3.**The Coordinator enforces this mechanically.** Before spawning a revision agent, the Coordinator MUST verify that the selected agent is NOT the original author. If the Reviewer names the original author as the fix agent, the Coordinator MUST refuse and ask the Reviewer to name a different agent.
4.**The locked-out author may NOT contribute to the revision** in any form — not as a co-author, advisor, or pair. The revision must be independently produced.
5.**Lockout scope:** The lockout applies to the specific artifact that was rejected. The original author may still work on other unrelated artifacts.
6.**Lockout duration:** The lockout persists for that revision cycle. If the revision is also rejected, the same rule applies again — the revision author is now also locked out, and a third agent must revise.
7.**Deadlock handling:** If all eligible agents have been locked out of an artifact, the Coordinator MUST escalate to the user rather than re-admitting a locked-out author.
## Examples
**Example 1: Reassign after rejection**
1. Fenster writes authentication module
2. Hockney (Tester) reviews → rejects: "Error handling is missing. Verbal should fix this."
3. Coordinator: Fenster is now locked out of this artifact
4. Coordinator spawns Verbal to revise the authentication module
4. Coordinator spawns new agent (or existing TS expert) to revise
5. New agent produces v2
6. Keaton reviews v2
**Example 3: Deadlock handling**
1. Fenster writes module → rejected
2. Verbal revises → rejected
3. Hockney revises → rejected
4. All 3 eligible agents are now locked out
5. Coordinator: "All eligible agents have been locked out. Escalating to user: [artifact details]"
**Example 4: Reviewer accidentally names original author**
1. Fenster writes module → rejected
2. Hockney says: "Fenster should fix the error handling"
3. Coordinator: "Fenster is locked out as the original author. Please name a different agent."
4. Hockney: "Verbal, then"
5. Coordinator spawns Verbal
## Anti-Patterns
- ❌ Allowing the original author to self-revise after rejection
- ❌ Treating the locked-out author as an "advisor" or "co-author" on the revision
- ❌ Re-admitting a locked-out author when deadlock occurs (must escalate to user)
- ❌ Applying lockout across unrelated artifacts (scope is per-artifact)
- ❌ Accepting the Reviewer's assignment when they name the original author (must refuse and ask for a different agent)
- ❌ Clearing lockout before the revision is approved (lockout persists through revision cycle)
- ❌ Skipping verification that the revision agent is not the original author
---
name: "reviewer-protocol"
description: "Reviewer rejection workflow and strict lockout semantics"
domain: "orchestration"
confidence: "high"
source: "extracted"
---
## Context
When a team member has a **Reviewer** role (e.g., Tester, Code Reviewer, Lead), they may approve or reject work from other agents. On rejection, the coordinator enforces strict lockout rules to ensure the original author does NOT self-revise. This prevents defensive feedback loops and ensures independent review.
## Patterns
### Reviewer Rejection Protocol
When a team member has a **Reviewer** role:
- Reviewers may **approve** or **reject** work from other agents.
- On **rejection**, the Reviewer may choose ONE of:
1.**Reassign:** Require a *different* agent to do the revision (not the original author).
2.**Escalate:** Require a *new* agent be spawned with specific expertise.
- The Coordinator MUST enforce this. If the Reviewer says "someone else should fix this," the original agent does NOT get to self-revise.
- If the Reviewer approves, work proceeds normally.
### Strict Lockout Semantics
When an artifact is **rejected** by a Reviewer:
1.**The original author is locked out.** They may NOT produce the next version of that artifact. No exceptions.
2.**A different agent MUST own the revision.** The Coordinator selects the revision author based on the Reviewer's recommendation (reassign or escalate).
3.**The Coordinator enforces this mechanically.** Before spawning a revision agent, the Coordinator MUST verify that the selected agent is NOT the original author. If the Reviewer names the original author as the fix agent, the Coordinator MUST refuse and ask the Reviewer to name a different agent.
4.**The locked-out author may NOT contribute to the revision** in any form — not as a co-author, advisor, or pair. The revision must be independently produced.
5.**Lockout scope:** The lockout applies to the specific artifact that was rejected. The original author may still work on other unrelated artifacts.
6.**Lockout duration:** The lockout persists for that revision cycle. If the revision is also rejected, the same rule applies again — the revision author is now also locked out, and a third agent must revise.
7.**Deadlock handling:** If all eligible agents have been locked out of an artifact, the Coordinator MUST escalate to the user rather than re-admitting a locked-out author.
## Examples
**Example 1: Reassign after rejection**
1. Fenster writes authentication module
2. Hockney (Tester) reviews → rejects: "Error handling is missing. Verbal should fix this."
3. Coordinator: Fenster is now locked out of this artifact
4. Coordinator spawns Verbal to revise the authentication module
Spawned agents have read access to the entire repository, including `.env` files containing live credentials. If an agent reads secrets and writes them to `.squad/` files (decisions, logs, history), Scribe auto-commits them to git, exposing them in remote history. This skill codifies absolute prohibitions and safe alternatives.
## Patterns
### Prohibited File Reads
**NEVER read these files:**
-`.env` (production secrets)
-`.env.local` (local dev secrets)
-`.env.production` (production environment)
-`.env.development` (development environment)
-`.env.staging` (staging environment)
-`.env.test` (test environment with real credentials)
- Any file matching `.env.*` UNLESS explicitly allowed (see below)
**Allowed alternatives:**
-`.env.example` (safe — contains placeholder values, no real secrets)
Spawned agents have read access to the entire repository, including `.env` files containing live credentials. If an agent reads secrets and writes them to `.squad/` files (decisions, logs, history), Scribe auto-commits them to git, exposing them in remote history. This skill codifies absolute prohibitions and safe alternatives.
## Patterns
### Prohibited File Reads
**NEVER read these files:**
-`.env` (production secrets)
-`.env.local` (local dev secrets)
-`.env.production` (production environment)
-`.env.development` (development environment)
-`.env.staging` (staging environment)
-`.env.test` (test environment with real credentials)
- Any file matching `.env.*` UNLESS explicitly allowed (see below)
**Allowed alternatives:**
-`.env.example` (safe — contains placeholder values, no real secrets)
description: "Find and resume interrupted Copilot CLI sessions using session_store queries"
domain: "workflow-recovery"
confidence: "high"
source: "earned"
tools:
- name: "sql"
description: "Query session_store database for past session history"
when: "Always — session_store is the source of truth for session history"
---
## Context
Squad agents run in Copilot CLI sessions that can be interrupted — terminal crashes, network drops, machine restarts, or accidental window closes. When this happens, in-progress work may be left in a partially-completed state: branches with uncommitted changes, issues marked in-progress with no active agent, or checkpoints that were never finalized.
Copilot CLI stores session history in a SQLite database called `session_store` (read-only, accessed via the `sql` tool with `database: "session_store"`). This skill teaches agents how to query that store to detect interrupted sessions and resume work.
## Patterns
### 1. Find Recent Sessions
Query the `sessions` table filtered by time window. Include the last checkpoint to understand where the session stopped:
Cross-reference with `gh issue list --label "status:in-progress"` to find issues that are marked in-progress but have no active session.
### 7. Resume a Session
Once you have the session ID:
```bash
# Resume directly
copilot --resume SESSION_ID
```
## Examples
**Recovering from a crash during PR creation:**
1. Query recent sessions filtered by branch name
2. Find the session that was working on the PR
3. Check its last checkpoint — was the code committed? Was the PR created?
4. Resume or manually complete the remaining steps
**Finding yesterday's work on a feature:**
1. Use FTS5 search with feature keywords
2. Filter to the relevant working directory
3. Review checkpoint progress to see how far the session got
4. Resume if work remains, or start fresh with the context
## Anti-Patterns
- ❌ Searching by partial session IDs — always use full UUIDs
- ❌ Resuming sessions that completed successfully — they have no pending work
- ❌ Using `MATCH` with special characters without escaping — wrap paths in double quotes
- ❌ Skipping the automated-session filter — high-volume automated sessions will flood results
- ❌ Assuming FTS5 is semantic search — it's keyword-based; always expand queries with synonyms
- ❌ Ignoring checkpoint data — checkpoints show exactly where the session stopped
---
name: "session-recovery"
description: "Find and resume interrupted Copilot CLI sessions using session_store queries"
domain: "workflow-recovery"
confidence: "high"
source: "earned"
tools:
- name: "sql"
description: "Query session_store database for past session history"
when: "Always — session_store is the source of truth for session history"
---
## Context
Squad agents run in Copilot CLI sessions that can be interrupted — terminal crashes, network drops, machine restarts, or accidental window closes. When this happens, in-progress work may be left in a partially-completed state: branches with uncommitted changes, issues marked in-progress with no active agent, or checkpoints that were never finalized.
Copilot CLI stores session history in a SQLite database called `session_store` (read-only, accessed via the `sql` tool with `database: "session_store"`). This skill teaches agents how to query that store to detect interrupted sessions and resume work.
## Patterns
### 1. Find Recent Sessions
Query the `sessions` table filtered by time window. Include the last checkpoint to understand where the session stopped:
description: "Core conventions and patterns used in the Squad codebase"
domain: "project-conventions"
confidence: "high"
source: "manual"
---
## Context
These conventions apply to all work on the Squad CLI tool (`create-squad`). Squad is a zero-dependency Node.js package that adds AI agent teams to any project. Understanding these patterns is essential before modifying any Squad source code.
## Patterns
### Zero Dependencies
Squad has zero runtime dependencies. Everything uses Node.js built-ins (`fs`, `path`, `os`, `child_process`). Do not add packages to `dependencies` in `package.json`. This is a hard constraint, not a preference.
### Node.js Built-in Test Runner
Tests use `node:test` and `node:assert/strict` — no test frameworks. Run with `npm test`. Test files live in `test/`. The test command is `node --test test/`.
### Error Handling — `fatal()` Pattern
All user-facing errors use the `fatal(msg)` function which prints a red `✗` prefix and exits with code 1. Never throw unhandled exceptions or print raw stack traces. The global `uncaughtException` handler calls `fatal()` as a safety net.
### ANSI Color Constants
Colors are defined as constants at the top of `index.js`: `GREEN`, `RED`, `DIM`, `BOLD`, `RESET`. Use these constants — do not inline ANSI escape codes.
### File Structure
-`.squad/` — Team state (user-owned, never overwritten by upgrades)
-`.squad/templates/` — Template files copied from `templates/` (Squad-owned, overwritten on upgrade)
-`.github/agents/squad.agent.md` — Coordinator prompt (Squad-owned, overwritten on upgrade)
-`templates/` — Source templates shipped with the npm package
-`.squad/skills/` — Team skills in SKILL.md format (user-owned)
-`.squad/decisions/inbox/` — Drop-box for parallel decision writes
### Windows Compatibility
Always use `path.join()` for file paths — never hardcode `/` or `\` separators. Squad must work on Windows, macOS, and Linux. All tests must pass on all platforms.
### Init Idempotency
The init flow uses a skip-if-exists pattern: if a file or directory already exists, skip it and report "already exists." Never overwrite user state during init. The upgrade flow overwrites only Squad-owned files.
### Copy Pattern
`copyRecursive(src, target)` handles both files and directories. It creates parent directories with `{ recursive: true }` and uses `fs.copyFileSync` for files.
- **Adding npm dependencies** — Squad is zero-dep. Use Node.js built-ins only.
- **Hardcoded path separators** — Never use `/` or `\` directly. Always `path.join()`.
- **Overwriting user state on init** — Init skips existing files. Only upgrade overwrites Squad-owned files.
- **Raw stack traces** — All errors go through `fatal()`. Users see clean messages, not stack traces.
- **Inline ANSI codes** — Use the color constants (`GREEN`, `RED`, `DIM`, `BOLD`, `RESET`).
---
name: "squad-conventions"
description: "Core conventions and patterns used in the Squad codebase"
domain: "project-conventions"
confidence: "high"
source: "manual"
---
## Context
These conventions apply to all work on the Squad CLI tool (`create-squad`). Squad is a zero-dependency Node.js package that adds AI agent teams to any project. Understanding these patterns is essential before modifying any Squad source code.
## Patterns
### Zero Dependencies
Squad has zero runtime dependencies. Everything uses Node.js built-ins (`fs`, `path`, `os`, `child_process`). Do not add packages to `dependencies` in `package.json`. This is a hard constraint, not a preference.
### Node.js Built-in Test Runner
Tests use `node:test` and `node:assert/strict` — no test frameworks. Run with `npm test`. Test files live in `test/`. The test command is `node --test test/`.
### Error Handling — `fatal()` Pattern
All user-facing errors use the `fatal(msg)` function which prints a red `✗` prefix and exits with code 1. Never throw unhandled exceptions or print raw stack traces. The global `uncaughtException` handler calls `fatal()` as a safety net.
### ANSI Color Constants
Colors are defined as constants at the top of `index.js`: `GREEN`, `RED`, `DIM`, `BOLD`, `RESET`. Use these constants — do not inline ANSI escape codes.
### File Structure
-`.squad/` — Team state (user-owned, never overwritten by upgrades)
-`.squad/templates/` — Template files copied from `templates/` (Squad-owned, overwritten on upgrade)
-`.github/agents/squad.agent.md` — Coordinator prompt (Squad-owned, overwritten on upgrade)
-`templates/` — Source templates shipped with the npm package
-`.squad/skills/` — Team skills in SKILL.md format (user-owned)
-`.squad/decisions/inbox/` — Drop-box for parallel decision writes
### Windows Compatibility
Always use `path.join()` for file paths — never hardcode `/` or `\` separators. Squad must work on Windows, macOS, and Linux. All tests must pass on all platforms.
### Init Idempotency
The init flow uses a skip-if-exists pattern: if a file or directory already exists, skip it and report "already exists." Never overwrite user state during init. The upgrade flow overwrites only Squad-owned files.
### Copy Pattern
`copyRecursive(src, target)` handles both files and directories. It creates parent directories with `{ recursive: true }` and uses `fs.copyFileSync` for files.
description: "Update tests when changing APIs — no exceptions"
domain: "quality"
confidence: "high"
source: "earned (Fenster/Hockney incident, test assertion sync violations)"
---
## Context
When APIs or public interfaces change, tests must be updated in the same commit. When test assertions reference file counts or expected arrays, they must be kept in sync with disk reality. Stale tests block CI for other contributors.
## Patterns
- **API changes → test updates (same commit):** If you change a function signature, public interface, or exported API, update the corresponding tests before committing
- **Test assertions → disk reality:** When test files contain expected counts (e.g., `EXPECTED_FEATURES`, `EXPECTED_SCENARIOS`), they must match the actual files on disk
- **Add files → update assertions:** When adding docs pages, features, or any counted resource, update the test assertion array in the same commit
- **CI failures → check assertions first:** Before debugging complex failures, verify test assertion arrays match filesystem state
## Examples
✓ **Correct:**
- Changed auth API signature → updated auth.test.ts in same commit
- Added `distributed-mesh.md` to features/ → added `'distributed-mesh'` to EXPECTED_FEATURES array
- Deleted two scenario files → removed entries from EXPECTED_SCENARIOS
✗ **Incorrect:**
- Changed spawn parameters → committed without updating casting.test.ts (CI breaks for next person)
- Added `built-in-roles.md` → left EXPECTED_FEATURES at old count (PR blocked)
- Test says "expected 7 files" but disk has 25 (assertion staleness)
## Anti-Patterns
- Committing API changes without test updates ("I'll fix tests later")
- Treating test assertion arrays as static (they evolve with content)
- Assuming CI passing means coverage is correct (stale assertions can pass while being wrong)
- Leaving gaps for other agents to discover
---
name: "test-discipline"
description: "Update tests when changing APIs — no exceptions"
domain: "quality"
confidence: "high"
source: "earned (Fenster/Hockney incident, test assertion sync violations)"
---
## Context
When APIs or public interfaces change, tests must be updated in the same commit. When test assertions reference file counts or expected arrays, they must be kept in sync with disk reality. Stale tests block CI for other contributors.
## Patterns
- **API changes → test updates (same commit):** If you change a function signature, public interface, or exported API, update the corresponding tests before committing
- **Test assertions → disk reality:** When test files contain expected counts (e.g., `EXPECTED_FEATURES`, `EXPECTED_SCENARIOS`), they must match the actual files on disk
- **Add files → update assertions:** When adding docs pages, features, or any counted resource, update the test assertion array in the same commit
- **CI failures → check assertions first:** Before debugging complex failures, verify test assertion arrays match filesystem state
## Examples
✓ **Correct:**
- Changed auth API signature → updated auth.test.ts in same commit
- Added `distributed-mesh.md` to features/ → added `'distributed-mesh'` to EXPECTED_FEATURES array
- Deleted two scenario files → removed entries from EXPECTED_SCENARIOS
✗ **Incorrect:**
- Changed spawn parameters → committed without updating casting.test.ts (CI breaks for next person)
- Added `built-in-roles.md` → left EXPECTED_FEATURES at old count (PR blocked)
- Test says "expected 7 files" but disk has 25 (assertion staleness)
## Anti-Patterns
- Committing API changes without test updates ("I'll fix tests later")
- Treating test assertion arrays as static (they evolve with content)
- Assuming CI passing means coverage is correct (stale assertions can pass while being wrong)
- Leaving gaps for other agents to discover
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.