- Extract signature validation to internal/sigvalidate/ shared package
- Change function signature to accept []byte instead of hex strings,
eliminating unnecessary hex encode/decode round-trip
- Add signature validation tests to cmd/server/decoder_test.go
- Both cmd/server and cmd/ingestor now import from the shared package
Addresses all review feedback on PR #686.
Test validateAdvertSignature with valid signatures, tampered data,
wrong timestamps, malformed inputs, and wrong-length keys/signatures.
Also test decodeAdvert with validation enabled vs disabled.
The signature validation UI uses badge-success and badge-danger classes
that were not defined in style.css, resulting in unstyled badges.
Add them using the same color patterns as existing badge-hash variants.
- Release notes for 95 commits since v3.4.1
- OpenAPI/Swagger docs: /api/spec and /api/docs called out everywhere
- Deployment guide: new API Documentation section
- README: API docs link added
- FAQ: 'Where is the API documentation?' entry
- Test plans for v3.4.2 validation
## Problem
All table sorting on the Nodes page was broken — clicking column headers
did nothing. Affected:
- Nodes list table
- Node detail → Neighbors table
- Node detail → Observers table
## Root Cause
**Not a race condition** — the actual bug was a **data attribute
mismatch**.
`TableSort.init()` (in `table-sort.js`) queries for `th[data-sort-key]`
to find sortable columns. But all table headers in `nodes.js` used
`data-sort="..."` instead of `data-sort-key="..."`. The selector never
matched any headers, so no click handlers were attached and sorting
silently failed.
Additionally, `data-type="number"` was used but TableSort's built-in
comparator is named `numeric`, causing numeric columns to fall back to
text comparison.
The packets table (`packets.js`) was unaffected because it already used
the correct `data-sort-key` and `data-type="numeric"` attributes.
## Fix
1. **`public/nodes.js`**: Changed all `data-sort="..."` to
`data-sort-key="..."` on `<th>` elements (nodes list, neighbors table,
observers table)
2. **`public/nodes.js`**: Changed `data-type="number"` to
`data-type="numeric"` to match TableSort's comparator names
3. **`public/packets.js`**: Added timestamp tiebreaker to packet sort
for stable ordering when primary column values are equal
## Testing
- All existing tests pass (`npm test`)
- No changes to test infrastructure needed — this was a pure HTML
attribute fix
Fixes#679
---------
Co-authored-by: you <you@example.com>
## Fix: Channel Color Picker — Data Shape Mismatch + Redesign (#674)
### Problem
The channel color picker was completely non-functional — dead code.
Three locations in `live.js` attempted to read
`decoded.header.payloadTypeName` and `decoded.payload.channelName`, but:
1. The decoded payload structure is flat
(`decoded.payload.channelHash`), not nested with separate
`header`/`payload` objects within the payload
2. The field is `channelHash` (an integer), not `channelName`
3. `_ccChannel` was **never set** on any DOM element, so all picker
handlers exited early
Additionally, the picker had zero discoverability — hidden behind
right-click/long-press with no visual affordance.
### Changes
**M1 — Fix the data shape bug:**
- Fixed `_ccChannel` assignment in 3 locations in `live.js` to use
`decoded.payload.channelHash` (converted to string)
- Fixed `_getChannelStyle()` to use the same flat structure
- Channel colors now key on the hash string (e.g. `"5"`) matching the
channels API
**M2 — Redesign for discoverability:**
- Reduced palette from 10 to **8 maximally-distinct colors** (removed
teal/rose — too close to cyan/red)
- Removed `<input type="color">` custom picker, "Apply" button, title
bar, close button
- Popover is now just 8 circle swatches + "Clear color" — click outside
to dismiss
- Added **12px clickable color dots** next to channel names on the
channels page (primary configuration surface)
- Unassigned channels show a dashed-border empty circle; assigned show
filled
- Channel list items get `border-left: 3px solid` when colored
- **Removed long-press handler entirely** — dots handle mobile
interaction
- Mobile: bottom-sheet with 36px touch targets via `@media (pointer:
coarse)`
**M3 — Visual encoding:**
- Left border only (3px) — no background tint (per Tufte spec: minimum
effective dose)
- Consistent encoding across live feed items, channel list, packets
table
### Tests
17 new tests in `test-channel-color-picker.js`:
- `_ccChannel` correctly set for GRP_TXT with various `channelHash`
values (including 0)
- `_ccChannel` not set for non-GRP_TXT packets
- `getRowStyle` returns `border-left:3px` only (no background)
- Palette is exactly 8 colors, no teal/rose
- All existing tests pass (62 + 29 + 490)
Fixes#674
---------
Co-authored-by: you <you@example.com>
Removes linux/arm64 from multi-platform build and drops QEMU setup.
All infra (prod + staging) is x86. QEMU emulation was adding ~12min
to every CI run for an unused architecture.
The buildFieldTable test expected hash_size=4 for path byte 0xC0 with
hash_count=0. After #653, zero hash_count shows 'hash_count=0 (direct
advert)' instead. Updated test and added new test verifying hash_size
IS shown when hash_count > 0.
## Noise Floor: Line Chart → Color-Coded Column Chart
Implements M3a from the [RF Health Dashboard
spec](https://github.com/Kpa-clawbot/CoreScope/issues/600#issuecomment-2784399622)
— replacing the noise floor line chart with discrete color-coded
columns.
### What changed
**`public/analytics.js`** — replaced `rfNFLineChart()` with
`rfNFColumnChart()`:
- **Color-coded bars by threshold**: green (`< -100 dBm`), yellow (`-100
to -85 dBm`), red (`≥ -85 dBm`)
- **Instant hover tooltips**: exact dBm value + UTC timestamp via native
SVG `<title>` — no delay
- **Column highlighting on hover**: CSS `:hover` with opacity change +
border stroke
- **Inline legend**: green/yellow/red threshold key in chart header
- **Removed reference lines**: the `-100 warning` and `-85 critical`
dashed lines are eliminated — threshold info is now encoded directly in
bar color (data-ink ratio improvement)
- **No gap detection**: column charts render discrete bars — each data
point is an independent observation, so line-chart-style gap detection
doesn't apply. Every sample gets a bar.
- **Reboot markers**: vertical dashed lines with "reboot" labels at
reboot timestamps (shared `rfRebootMarkers` helper, same as other RF
charts)
- **Division-by-zero guard**: constant values or single data points use
a ±5 dBm window so bars render with visible height
- **Sparklines unchanged**: fleet overview sparklines remain as
polylines (correct at 140×24px scale)
### Why columns instead of lines
A polyline connecting discrete 5-minute noise floor samples creates
false visual continuity — it implies interpolation between measurements
that doesn't exist. When readings jump between -115 and -95 irregularly,
the line becomes a jagged mess. Column bars encode each sample as a
discrete, independent observation: one bar = one measurement.
### Testing
- 12 unit tests in `test-frontend-helpers.js` covering: SVG output,
threshold color coding, tooltips, empty/single/constant data, legend
rendering, reboot markers, shared time axis
- All existing tests pass (packet-filter: 62, aging: 29,
frontend-helpers: 490)
### No backend changes
Pure frontend change — ~150 lines in `analytics.js`.
Fixes#600
---------
Co-authored-by: you <you@example.com>
## Problem
The "Paths Through This Node" API endpoint (`/api/nodes/{pubkey}/paths`)
returns unrelated packets when two nodes share a hex prefix. For
example, querying paths for "Kpa Roof Solar" (`c0dedad4...`) returns 316
packets that actually belong to "C0ffee SF" (`C0FFEEC7...`) because both
share the `c0` prefix in the `byPathHop` index.
Fixes#655
## Root Cause
`handleNodePaths()` in `routes.go` collects candidates from the
`byPathHop` index using 2-char and 4-char hex prefixes for speed, but
never verifies that the target node actually appears in each candidate's
resolved path. The broad index lookup is intentional, but the
**post-filter was missing**.
## Fix
Added `nodeInResolvedPath()` helper in `store.go` that checks whether a
transmission's `resolved_path` (from the neighbor affinity graph via
`resolveWithContext`) contains the target node's full pubkey. The
filter:
- **Includes** packets where `resolved_path` contains the target node's
full pubkey
- **Excludes** packets where `resolved_path` resolved to a different
node (prefix collision)
- **Excludes** packets where `resolved_path` is nil/empty (ambiguous —
avoids false positives)
The check examines both the best observation's resolved_path
(`tx.ResolvedPath`) and all individual observations, so packets are
included if *any* observation resolved the target.
## Tests
- `TestNodeInResolvedPath` — unit test for the helper with 5 cases
(match, different node, nil, all-nil elements, match in observation
only)
- `TestNodePathsPrefixCollisionFilter` — integration test: two nodes
sharing `aa` prefix, verifies the collision packet is excluded from one
and included for the other
- Updated test DB schema to include `resolved_path` column and seed data
with resolved pubkeys
- All existing tests pass (165 additions, 8 modifications)
## Performance
No impact on hot paths. The filter runs once per API call on the
already-collected candidate set (typically small). `nodeInResolvedPath`
is O(observations × hops) per candidate — negligible since observations
per transmission are typically 1–5.
---------
Co-authored-by: you <you@example.com>
## Summary
TRACE packets on the live map previously animated the **full intended
route** regardless of how far the trace actually reached. This made it
impossible to distinguish a completed route from a failed one —
undermining the primary diagnostic purpose of trace packets.
## Changes
### Backend — `cmd/server/decoder.go`
- Added `HopsCompleted *int` field to the `Path` struct
- For TRACE packets, the header path contains SNR bytes (one per hop
that actually forwarded). Before overwriting `path.Hops` with the full
intended route from the payload, we now capture the header path's
`HashCount` as `hopsCompleted`
- This field is included in API responses and WebSocket broadcasts via
the existing JSON serialization
### Frontend — `public/live.js`
- For TRACE packets with `hopsCompleted < totalHops`:
- Animate only the **completed** portion (solid line + pulse)
- Draw the **unreached** remainder as a dashed/ghosted line (25%
opacity, `6,8` dash pattern) with ghost markers
- Dashed lines and ghost markers auto-remove after 10 seconds
- When `hopsCompleted` is absent or equals total hops, behavior is
unchanged
### Tests — `cmd/server/decoder_test.go`
- `TestDecodePacket_TraceHopsCompleted` — partial completion (2 of 4
hops)
- `TestDecodePacket_TraceNoSNR` — zero completion (trace not forwarded
yet)
- `TestDecodePacket_TraceFullyCompleted` — all hops completed
## How it works
The MeshCore firmware appends an SNR byte to `pkt->path[]` at each hop
that forwards a TRACE packet. The count of these SNR bytes (`path_len`)
indicates how far the trace actually got. CoreScope's decoder already
parsed the header path, but the TRACE-specific code overwrote it with
the payload hops (full intended route) without preserving the progress
information. Now we save that count first.
Fixes#651
---------
Co-authored-by: you <you@example.com>
## Summary
The "By Repeaters" section on the Hash Stats analytics page was counting
**all** node types (companions, room servers, sensors, etc.) instead of
only repeaters. This made the "By Repeaters" distribution identical to
"Multi-Byte Hash Adopters", defeating the purpose of the breakdown.
Fixes#652
## Root Cause
`computeAnalyticsHashSizes()` in `cmd/server/store.go` built its
`byNode` map from advert packet data without cross-referencing node
roles from the node store. Both `distributionByRepeaters` and
`multiByteNodes` consumed this unfiltered map.
## Changes
### `cmd/server/store.go`
- Build a `nodeRoleByPK` lookup map from `getCachedNodesAndPM()` at the
start of the function
- Store `role` in each `byNode` entry when processing advert packets
- **`distributionByRepeaters`**: filter to only count nodes whose role
contains "repeater"
- **`multiByteNodes`**: include `role` field in output so the frontend
can filter/group by node type
### `cmd/server/coverage_test.go`
- Add `TestHashSizesDistributionByRepeatersFiltersRole`: verifies that
companion nodes are excluded from `distributionByRepeaters` but included
in `multiByteNodes` with correct role
### `cmd/server/routes_test.go`
- Fix `TestHashAnalyticsZeroHopAdvert`: invalidate node cache after DB
insert so role lookup works
- Fix `TestAnalyticsHashSizeSameNameDifferentPubkey`: insert node
records as repeaters + invalidate cache
## Testing
All `cmd/server` tests pass (68 insertions, 3 deletions across 3 files).
Co-authored-by: you <you@example.com>
## Fix: Zero-hop DIRECT packets report bogus hash_size
Closes#649
### Problem
When a DIRECT packet has zero hops (pathByte lower 6 bits = 0), the
generic `hash_size = (pathByte >> 6) + 1` formula produces a bogus value
(1-4) instead of 0/unknown. This causes incorrect hash size displays and
analytics for zero-hop direct adverts.
### Solution
**Frontend (JS):**
- `packets.js` and `nodes.js` now check `(pathByte & 0x3F) === 0` to
detect zero-hop packets and suppress bogus hash_size display.
**Backend (Go):**
- Both `cmd/server/decoder.go` and `cmd/ingestor/decoder.go` reset
`HashSize=0` for DIRECT packets where `pathByte & 0x3F == 0` (hash_count
is zero).
- TRACE packets are excluded since they use hashSize to parse hop data
from the payload.
- The condition uses `pathByte & 0x3F == 0` (not `pathByte == 0x00`) to
correctly handle the case where hash_size bits are non-zero but
hash_count is zero — matching the JS frontend approach.
### Testing
**Backend:**
- Added 4 tests each in `cmd/server/decoder_test.go` and
`cmd/ingestor/decoder_test.go`:
- DIRECT + pathByte 0x00 → HashSize=0 ✅
- DIRECT + pathByte 0x40 (hash_size bits set, hash_count=0) → HashSize=0
✅
- Non-DIRECT + pathByte 0x00 → HashSize=1 (unchanged) ✅
- DIRECT + pathByte 0x01 (1 hop) → HashSize=1 (unchanged) ✅
- All existing tests pass (`go test ./...` in both cmd/server and
cmd/ingestor)
**Frontend:**
- Verified hash size display is suppressed for zero-hop direct adverts
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- **`app.js`**: `getDistanceUnit()`, `formatDistance(km)`,
`formatDistanceRound(km)` helpers. Auto mode uses `navigator.language` —
miles for `en-US`, `en-GB`, `my`, `lr`; km everywhere else.
- **`customize-v2.js`**: Distance Unit preference (km / mi / auto) in
Display Settings panel. Stored in
`localStorage['meshcore-distance-unit']` via the existing apply
pipeline. Override dot and reset work. Display tab badge counts it.
- **`nodes.js`**: Neighbor table distance cell uses `formatDistance()`.
- **`analytics.js`**: All rendered km values use `formatDistance()` or
`formatDistanceRound()`. Column headers (`km`/`mi`) respond to the
active unit. Collision classification thresholds (Local < 50 km /
Regional 50–200 km / Distant > 200 km) also adapt.
Default is `auto` — no change for existing users unless their locale
maps to miles.
## Test plan
- [x] `node test-frontend-helpers.js` — 456 passed, 0 failed (10 new
formatDistance tests)
- [ ] Set unit to **mi** in customize → Neighbors table shows `7.6 mi`
instead of `12.3 km`
- [ ] Analytics → Distance tab → stat cards, leaderboard, and column
headers all show miles
- [ ] Collision tool → Local/Regional/Distant thresholds show `31 mi` /
`124 mi`
- [ ] Route patterns popup shows miles per hop and total
- [ ] Reset override dot → unit returns to auto
Closes#621🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- `originLat` was declared with `const` inside two block-scoped
`if`/`else` branches in `resolveHopPositions` (lines 1914 and 1921) but
referenced at line 1945 outside both blocks → `ReferenceError: originLat
is not defined` thrown on every packet render on the live page.
- Fix: introduce `senderLat` derived directly from
`payload.lat`/`payload.lon` at the point of use, using the same
null/zero guard as the existing declarations.
## Test plan
- [x] Live page no longer shows `ReferenceError: originLat is not
defined` in the console
- [x] Packet path animations still render correctly for packets with GPS
coords
- [x] Packets without GPS coords still handled (senderLat === null,
anchor not added)
Closes#647🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
Fixes remaining text inconsistencies in the Prefix Tool after #643 added
the repeater filter.
The Torvalds review on #643 flagged:
1. **Must-fix (already addressed in #643):** "About these numbers" text
— fixed
2. **Out-of-scope:** Empty state says "No nodes" should say "No
repeaters"
This PR fixes ALL remaining "nodes" references in the Prefix Tool to say
"repeaters":
- Empty state: "No nodes in the network yet" → "No repeaters in the
network yet"
- Stat card label: "Total nodes" → "Total repeaters"
- Region note link: "Check all nodes →" → "Check all repeaters →"
- Recommendation text: "With N nodes" → "With N repeaters"
Verified: zero occurrences of stale "all nodes", "Total nodes", or "No
nodes" remain in the Prefix Tool section.
Closes#642
Co-authored-by: you <you@example.com>
## Summary
- **nodes.js**: `#/nodes?tab=repeater` and `#/nodes?search=foo` — role
tab and search query are now URL-addressable; state resets to defaults
on re-navigation
- **packets.js**: `#/packets?timeWindow=60` and
`#/packets?region=US-SFO` — time window and region filter survive
refresh and are shareable
- **channels.js**: `#/channels/{hash}?node=Name` — node detail panel is
URL-addressable; auto-opens on load, URL updates on open/close
- **region-filter.js**: adds `RegionFilter.setSelected(codesArray)` to
public API (needed for URL-driven init)
All changes use `history.replaceState` (not `pushState`) to avoid
polluting browser history. URL params override localStorage on load;
localStorage remains fallback.
## Implementation notes
- Router strips query string before computing `routeParam`, so all pages
read URL params directly from `location.hash`
- `buildNodesQuery(tab, searchStr)` and `buildPacketsUrl(timeWindowMin,
regionParam)` are pure functions exposed on `window` for testability
- Region URL param is applied after `RegionFilter.init()` via a
`_pendingUrlRegion` module-level var to keep ordering explicit
- `showNodeDetail` captures `selectedHash` before the async `lookupNode`
call to avoid stale URL construction
## Test plan
- [x] `node test-frontend-helpers.js` — 459 passed, 0 failed (includes 6
`buildNodesQuery` + 5 `buildPacketsUrl` unit tests)
- [x] Navigate to `#/nodes?tab=repeater` — Repeaters tab active on load
- [x] Click a tab, verify URL updates to `#/nodes?tab=room`
- [x] Navigate to `#/packets?timeWindow=60` — time window dropdown shows
60 min
- [x] Change time window, verify URL updates
- [x] Navigate to `#/channels/{hash}` and click a sender name — URL
updates to `?node=Name`
- [x] Reload that URL — node panel re-opens
Closes#536🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Hash Issues and Prefix Tool tabs showed different collision counts
because the Prefix Tool was including all node types (companions, rooms,
sensors) while Hash Issues correctly filtered to repeaters only.
**Only repeaters matter for prefix collisions** — they're the nodes that
relay packets using hash-based addressing. Non-repeater collisions are
harmless noise.
## Changes
1. **Filtered Prefix Tool to repeaters only** — matches Hash Issues'
scope
2. **Updated explanatory text** — both tabs now clearly state they cover
repeaters
3. **Added cross-reference links** between the two tabs
4. **Added hash_size badges** in Prefix Tool results
Both tabs should now agree on collision counts for each byte size.
## Review Status
- ✅ Self-review
- ✅ Torvalds review — caught stale 'regardless of role' text, fixed
- ✅ All tests pass
Fixes#642
---------
Co-authored-by: you <you@example.com>
## Summary
Replace source-grep virtual scroll tests with behavioral tests that
exercise actual logic. Fixes#405, Fixes#409.
## What changed
### packets.js
- **Extracted `_calcVisibleRange()`** — pure function containing the
binary-search range calculation logic previously inline in
`renderVisibleRows()`. Takes offsets, scroll position, viewport
dimensions, row height, thead offset, and buffer as parameters. Returns
`{ startIdx, endIdx, firstEntry, lastEntry }`.
- `renderVisibleRows()` now calls `_calcVisibleRange()` instead of
inline math — no behavioral change.
- Exported via `_packetsTestAPI` for direct testing.
### test-frontend-helpers.js
- **Removed 8 source-grep tests** that used
`packetsSource.includes(...)` to check strings exist in source code (not
behavior):
- "renderVisibleRows uses cumulative offsets not flat entry count"
- "renderVisibleRows skips DOM rebuild when range unchanged"
- "lazy row generation — HTML built only for visible slice"
- "observer filter Set is hoisted, not recreated per-packet"
- "packets.js display filter checks _children for observer match"
- "packets.js WS filter checks _children for observer match"
- "buildFlatRowHtml has null-safe decoded_json"
- "pathHops null guard in buildFlatRowHtml / detail pane"
- "destroy cleans up virtual scroll state"
- **Added 11 behavioral tests for `_calcVisibleRange()`** loaded from
the actual packets.js via sandbox:
- Top of list (scroll = 0)
- Middle of list (scroll to row 50)
- Bottom of list (scroll past end)
- Empty array (0 entries)
- Single item
- Exact row boundary
- Large dataset (30K items)
- Various row heights (24px instead of 36px)
- Thead offset shifting visible range
- Expanded groups with variable row counts
- Buffer clamped at boundaries
- **Kept all existing behavioral tests**: `cumulativeRowOffsets`,
`getRowCount`, observer filter logic (#537).
## Test count
- Removed: 8 source-grep tests
- Added: 11 behavioral tests
- Net: +3 tests (446 total, 0 failures)
## Why
Source-grep tests (`packetsSource.includes('...')`) are brittle — they
break on refactors even when behavior is preserved, and they pass even
when the tested code is buggy. Behavioral tests exercise real
inputs/outputs and catch actual regressions.
Co-authored-by: you <you@example.com>
## Summary
Implements M2 of the table sorting spec (#620): sortable nodes list +
neighbor/observer tables.
### Changes
**Shared utility (`public/table-sort.js`)**
- IIFE pattern, no dependencies, no build step
- DOM-reorder sorting (no innerHTML rebuild) — preserves event listeners
- `data-value` attributes for raw sortable values, `data-type` on `<th>`
for type detection
- Built-in comparators: text (`localeCompare`), number, date, dBm
- `aria-sort` attributes, keyboard support (Enter/Space), sort arrows
- localStorage persistence with `storageKey` option
- `onSort` callback for custom re-render triggers
**Nodes list table**
- Wired via `TableSort.init` with `onSort` callback that triggers
`renderRows()`
- Keeps JS-array-level sorting for claimed/favorites pinning (TableSort
can't handle pinned rows)
- Replaces old `sortState`, `toggleSort()`, `sortArrow()` with TableSort
controller
- Test hooks preserved for backward compatibility (fallback state for
non-DOM tests)
**Neighbor table**
- Added `data-sort` and `data-value` attributes to all columns (name,
role, score, count, last_seen, distance)
- Default sort: count descending
- `TableSort.init` called after neighbor data renders
**Observer table (full detail page)**
- Converted from plain `<table>` to sortable table with data attributes
- Sortable columns: observer, region, packets, avg SNR, avg RSSI
- Default sort: packets descending
### Testing
- 18 new unit tests for `table-sort.js` (custom DOM mock, no jsdom
dependency)
- All 445 existing frontend tests pass unchanged
- All packet-filter (62) and aging (29) tests pass
### Note
This branch includes `table-sort.js` since M1 hasn't merged yet. The
utility code is identical to the M1 spec.
---------
Co-authored-by: you <you@example.com>
## Consolidate CI Pipeline — Build + Publish to GHCR + Deploy Staging
### What
Merges the separate `publish.yml` workflow into `deploy.yml`, creating a
single CI/CD pipeline:
**`go-test → e2e-test → build-and-publish → deploy → publish-badges`**
### Why
- Two workflows doing overlapping builds was wasteful and error-prone
- `publish.yml` had a bug: `BUILD_TIME=$(date ...)` in a `with:` block
never executed (literal string)
- The old build job had duplicate/conflicting `APP_VERSION` assignments
### Changes
- **`build-and-publish` job** replaces old `build` job — builds locally
for staging, then does multi-arch GHCR push (gated to push events only,
PRs skip)
- **Build metadata** computed in a dedicated step, passed via
`GITHUB_OUTPUT` — no more shell expansion bugs
- **`APP_VERSION`** is `v1.2.3` on tag push, `edge` on master push
- **Deploy** now pulls the `edge` image from GHCR and tags for compose
compatibility, with fallback to local build
- **`publish.yml` deleted** — no duplicate workflow
- **Top-level `permissions`** block with `packages:write` for GHCR auth
- **Triggers** now include `tags: ['v*']` for release publishing
### Status
- ✅ Rebased onto master
- ✅ Self-reviewed (all checklist items pass)
- ✅ Ready for merge
Co-authored-by: you <you@example.com>
## Summary
Implements M1 of the table sorting spec (#620): a shared `TableSort`
utility module and integration with the packets table.
### What's included
**1. `public/table-sort.js` — Shared sort utility (IIFE, no
dependencies)**
- `TableSort.init(tableEl, options)` — attaches click-to-sort on `<th
data-sort-key="...">` elements
- Built-in comparators: text (localeCompare), numeric, date (ISO), dBm
(strips suffix)
- NaN/null values sort last consistently
- Visual: ▲/▼ `<span class="sort-arrow">` appended to active column
header
- Accessibility: `aria-sort="ascending|descending|none"`, keyboard
support (Enter/Space)
- DOM reorder via `appendChild` loop (no innerHTML rebuild)
- `domReorder: false` option for virtual scroll tables (packets)
- `storageKey` option for localStorage persistence
- Custom comparator override per column
- `onSort(column, direction)` callback
- `destroy()` for clean teardown
**2. Packets table integration**
- All columns sortable: region, time, hash, size, HB, type, observer,
path, rpt
- Default sort: time descending (matches existing behavior)
- Uses `domReorder: false` + `onSort` callback to sort the data array,
then re-render via virtual scroll
- Works with both grouped and ungrouped views
- WebSocket updates respect active sort column
- Sort preference persisted in localStorage (`meshcore-packets-sort`)
**3. Tests — 22 unit tests (`test-table-sort.js`)**
- All 4 built-in comparators (text, numeric, date, dBm)
- NaN/null edge cases
- Direction toggle on click
- aria-sort attribute correctness
- Visual indicator (▲/▼) presence and updates
- onSort callback
- domReorder: false behavior
- destroy() cleanup
- Custom comparator override
### Performance
Packets table sorting works at the data array level (single `Array.sort`
call), not DOM level. Virtual scroll then renders only visible rows. No
new DOM nodes are created during sort — it's purely a data reorder +
re-render of the existing visible window. Expected sort time for 30K
packets: ~50-100ms (array sort) + existing virtual scroll render time.
Closes#620 (M1)
Co-authored-by: you <you@example.com>
## Summary
Implements the `DISABLE_CADDY` environment variable in the Docker
entrypoint, fixing #629.
## Problem
The `DISABLE_CADDY` env var was documented but had no effect — the
entrypoint only handled `DISABLE_MOSQUITTO`.
## Changes
### New supervisord configs
- **`supervisord-go-no-caddy.conf`** — mosquitto + ingestor + server (no
Caddy)
- **`supervisord-go-no-mosquitto-no-caddy.conf`** — ingestor + server
only
### Updated entrypoint (`docker/entrypoint-go.sh`)
Handles all 4 combinations:
| DISABLE_MOSQUITTO | DISABLE_CADDY | Config used |
|---|---|---|
| false | false | `supervisord.conf` (default) |
| true | false | `supervisord-no-mosquitto.conf` |
| false | true | `supervisord-no-caddy.conf` |
| true | true | `supervisord-no-mosquitto-no-caddy.conf` |
### Dockerfiles
Added COPY lines for the new configs in both `Dockerfile` and
`Dockerfile.go`.
## Testing
```bash
# Verify correct config selection
docker run -e DISABLE_CADDY=true corescope
# Should log: [config] Caddy reverse proxy disabled (DISABLE_CADDY=true)
docker run -e DISABLE_CADDY=true -e DISABLE_MOSQUITTO=true corescope
# Should log both disabled messages
```
Fixes#629
Co-authored-by: you <you@example.com>
## Summary
Fixes critical and major mobile accessibility items from #630, focused
on small phone viewports (320px–375px).
### Critical fixes
1. **Touch targets ≥ 44px** — All interactive elements (filter buttons,
tab buttons, search inputs, nav buttons, region pills, dropdowns) get
`min-height: 44px; min-width: 44px` via `@media (pointer: coarse)` —
desktop/mouse users are unaffected.
2. **ARIA live regions** — Added `aria-live="polite"` to: packet list
(`#pktLeft`), node list (`#nodesLeft`), analytics content
(`#analyticsContent`), live feed (`#liveFeed` with `role="log"`). Screen
readers now announce dynamic content updates.
3. **Color-only status indicators** — Status dots in live view marked
`aria-hidden="true"` (text labels like "Online"/"Degraded"/"Offline"
already present alongside).
4. **Detail panel on mobile** — Side panel (`panel-right`) renders as a
full-screen fixed overlay on ≤640px. Close button (✕) added to nodes
detail panel. Escape key closes both nodes and packets detail panels.
### Major fixes
5. **Analytics tabs overflow** — Tabs switch to `flex-wrap: nowrap;
overflow-x: auto` on ≤640px, preventing overflow on 320px screens.
6. **Table horizontal scroll** — Added `.table-scroll-wrap` class and
`min-width: 480px` on `.data-table` at ≤640px for horizontal scrolling
when columns don't fit.
7. **SPA focus management** — On every page navigation, focus moves to
first heading (`h1`/`h2`/`h3`) or falls back to `#app`. Uses
`requestAnimationFrame` for correct DOM timing.
### Bonus
- Analytics tabs get `role="tablist"` + `aria-label` for screen reader
semantics.
### Known follow-ups (not blocking)
- Individual tab buttons should get `role="tab"` + `aria-selected` +
`aria-controls` for complete ARIA tab pattern.
- `sr-status-label` and `table-scroll-wrap` CSS classes are defined but
not yet used in JS — ready for future use when status text labels and
table wrappers are wired up.
Closes#630
Co-authored-by: you <you@example.com>
## Summary
Auto-generated OpenAPI 3.0.3 spec endpoint (`/api/spec`) and Swagger UI
(`/api/docs`) for the CoreScope API.
## What
- **`cmd/server/openapi.go`** — Route metadata map
(`routeDescriptions()`) + spec builder that walks the mux router to
generate a complete OpenAPI 3.0.3 spec at runtime. Includes:
- All 47 API endpoints grouped by tag (admin, analytics, channels,
config, nodes, observers, packets)
- Query parameter documentation for key endpoints (packets, nodes,
search, resolve-hops)
- Path parameter extraction from mux `{name}` patterns
- `ApiKeyAuth` security scheme for API-key-protected endpoints
- Swagger UI served as a self-contained HTML page using unpkg CDN
- **`cmd/server/openapi_test.go`** — Tests for spec endpoint (validates
JSON structure, required fields, path count, security schemes,
self-exclusion of `/api/spec` and `/api/docs`), Swagger UI endpoint, and
`extractPathParams` helper.
- **`cmd/server/routes.go`** — Stores router reference on `Server`
struct for spec generation; registers `/api/spec` and `/api/docs`
routes.
## Design Decisions
- **Runtime spec generation** vs static YAML: The spec walks the actual
router, so it can never drift from registered routes. Route metadata
(summaries, descriptions, tags, auth flags) is maintained in a parallel
map — the test enforces minimum path count to catch drift.
- **No external dependencies**: Uses only stdlib + existing gorilla/mux.
Swagger UI loaded from unpkg CDN (no vendored assets).
- **Security tagging**: Auth-protected endpoints (those behind
`requireAPIKey` middleware) are tagged with `security: [{ApiKeyAuth:
[]}]` in the spec, matching the actual middleware configuration.
## Testing
- `go test -run TestOpenAPI` — validates spec structure, field presence,
path count ≥ 20, security schemes
- `go test -run TestSwagger` — validates HTML response with swagger-ui
references
- `go test -run TestExtractPathParams` — unit tests for path parameter
extraction
---------
Co-authored-by: you <you@example.com>
## Zero-Config Defaults + Deployment Docs
Make CoreScope start with zero configuration — no `config.json`
required. The ingestor falls back to sensible defaults (local MQTT
broker, standard topics, default DB path) when no config file exists.
### What changed
**`cmd/ingestor/config.go`** — `LoadConfig` no longer errors on missing
config file. Instead it logs a message and uses defaults. If no MQTT
sources are configured (from file or env), defaults to
`mqtt://localhost:1883` with `meshcore/#` topic.
**`cmd/ingestor/main.go`** — Removed redundant "no MQTT sources" fatal
(now handled in config layer). Improved the "no connections established"
fatal with actionable hints.
**`README.md`** — Replaced "Docker (Recommended)" section with a
one-command quickstart using the pre-built image. No build step, no
config file, just `docker run`.
**`docs/deployment.md`** — New comprehensive deployment guide covering
Docker, Compose, config reference, MQTT setup, TLS/HTTPS, monitoring,
backup, and troubleshooting.
### Zero-config flow
```
docker run -d -p 80:80 -v corescope-data:/app/data ghcr.io/kpa-clawbot/corescope:latest
```
1. No config.json found → defaults used, log message printed
2. No MQTT sources → defaults to `mqtt://localhost:1883`
3. Internal Mosquitto broker already running in container → connection
succeeds
4. Dashboard shows empty, ready for packets
### Review fixes (commit 13b89bb)
- Removed `DISABLE_CADDY` references from all docs — this env var was
never implemented in the entrypoint
- Fixed `/api/stats` example in deployment guide — showed nonexistent
fields (`mqttConnected`, `uptimeSeconds`, `activeNodes`)
- Improved MQTT connection failure message with actionable
troubleshooting hints
Closes#610
---------
Co-authored-by: you <you@example.com>
## Summary
Mobile UX fixes for the channel color picker (addresses #619).
## Changes
### Commit 1: Mobile UX improvements
- **Bottom-sheet pattern on mobile**: Color picker renders as a fixed
bottom sheet on touch devices (`@media (pointer: coarse)`) with
`env(safe-area-inset-bottom)` for notched phones
- **40px touch targets**: Swatches enlarged from default to 40×40px on
mobile
- **Native color picker hidden on touch**: `<input type="color">` is
hidden on mobile — preset swatches only
- **Scroll lock**: `document.body.style.overflow = 'hidden'` while
popover is open, restored on close
- **CSS context menu suppression**: `-webkit-touch-callout: none` and
`user-select: none` on `.live-feed-item`
- **Long-press with `passive: true`**: touchstart listener is passive to
avoid scroll jank
### Commit 2: Remove preventDefault on touchstart
- Removed `e.preventDefault()` from the touchstart handler — it was
blocking scroll initiation on feed items
- Context menu suppression handled entirely via CSS (see above)
## Desktop behavior
Unchanged. All mobile-specific styles scoped under `@media (pointer:
coarse)`. Desktop positioning logic unchanged.
## Review Status
- ✅ Rebased onto master (no conflicts)
- ✅ Self-review complete — all checklist items verified
- ✅ Tufte analysis posted as comment
---------
Co-authored-by: you <you@example.com>
## Summary
Addresses user feedback on #600 — two improvements to RF Health detail
panel charts:
### 1. Auto-scale airtime Y-axis
Previously fixed 0-100% which made low-activity nodes unreadable (e.g.
0.1% TX barely visible). Now auto-scales to the actual data range with
20% headroom (minimum 1%), matching how the noise floor chart already
works.
### 2. Hover tooltips on all chart data points
Invisible SVG `<circle>` elements with native `<title>` tooltips on
every data point across all 4 charts:
- **Noise floor**: `NF: -112.3 dBm` + UTC timestamp
- **Airtime**: `TX: 2.1%` or `RX: 8.3%` + UTC timestamp
- **Error rate**: `Err: 0.05%` + UTC timestamp
- **Battery**: `Batt: 3.85V` + UTC timestamp
Uses native browser SVG tooltips — zero dependencies, accessible, no JS
event handlers.
### Design rationale (Tufte)
- Auto-scaling increases data-ink ratio by eliminating wasted vertical
space
- Tooltips provide detail-on-demand without cluttering the chart with
labels on every point
### Spec update
Added M2 feedback improvements section to
`docs/specs/rf-health-dashboard.md`.
---------
Co-authored-by: you <you@example.com>
## Summary
Documents the lock ordering for all five mutexes in `PacketStore`
(`store.go`) to prevent future deadlocks.
## What changed
Added a comment block above the `PacketStore` struct documenting:
- All 5 mutexes (`mu`, `cacheMu`, `channelsCacheMu`, `groupedCacheMu`,
`regionObsMu`)
- What each mutex guards
- The required acquisition order (numbered 1–5)
- The nesting relationships that exist today (`cacheMu →
channelsCacheMu` in `invalidateCachesFor` and `rebuildAnalyticsCaches`)
- Confirmation that no reverse ordering exists (no deadlock risk)
## Verification
- Grepped all lock acquisition sites to confirm no reverse nesting
exists
- `go build ./...` passes — documentation-only change
Fixes#413
---------
Co-authored-by: you <you@example.com>
## Summary
Replaces hardcoded `VSCROLL_ROW_HEIGHT = 36` and `theadHeight = 40` in
the virtual scroll logic with dynamic DOM measurement, so the values
stay correct if CSS changes.
## Changes
- `VSCROLL_ROW_HEIGHT`: measured once from the first rendered data row's
`offsetHeight` after the initial full rebuild. Falls back to 36px until
measurement occurs.
- `theadHeight`: measured from the actual `<thead>` element's
`offsetHeight` on every `renderVisibleRows` call. Falls back to 40px if
no thead is found.
- Both variables are now `let` instead of `const` to allow runtime
updates.
## Performance
No performance impact — both measurements are single `offsetHeight`
reads (no reflow triggered since the DOM was just written). Row height
measurement runs only once (guarded by `_vscrollRowHeightMeasured`
flag). Thead measurement is a single property read per scroll event.
Fixes#407
Co-authored-by: you <you@example.com>
## Summary
Fixes#420 — wires `cacheTTL` config values to server-side cache
durations that were previously hardcoded.
## Problem
`collisionCacheTTL` was hardcoded at 60s in `store.go`. The config has
`cacheTTL.analyticsHashSizes: 3600` (1 hour) but it was never read — the
`/api/config/cache` endpoint just passed the raw map to the client
without applying values server-side.
## Changes
- **`store.go`**: Add `cacheTTLSec()` helper to safely extract duration
values from the `cacheTTL` config map. `NewPacketStore` now accepts an
optional `cacheTTL` map (variadic, backward-compatible) and wires:
- `cacheTTL.analyticsHashSizes` → `collisionCacheTTL`
- `cacheTTL.analyticsRF` → `rfCacheTTL`
- **Default changed**: `collisionCacheTTL` default raised from 60s →
3600s (1 hour). Hash collision computation is expensive and data changes
rarely — 60s was causing unnecessary recomputation.
- **`main.go`**: Pass `cfg.CacheTTL` to `NewPacketStore`.
- **Tests**: Added `TestCacheTTLFromConfig` and `TestCacheTTLDefaults`
in eviction_test.go. Updated existing `TestHashCollisionsCacheTTL` for
the new default.
## Audit of other cacheTTL values
The remaining `cacheTTL` keys (`stats`, `nodeDetail`, `nodeHealth`,
`nodeList`, `bulkHealth`, `networkStatus`, `observers`, `channels`,
`channelMessages`, `analyticsTopology`, `analyticsChannels`,
`analyticsSubpaths`, `analyticsSubpathDetail`, `nodeAnalytics`,
`nodeSearch`, `invalidationDebounce`) are **client-side only** — served
via `/api/config/cache` and consumed by the frontend. They don't have
corresponding server-side caches to wire to. The only server-side caches
(`rfCache`, `topoCache`, `hashCache`, `chanCache`, `distCache`,
`subpathCache`, `collisionCache`) all use either `rfCacheTTL` or
`collisionCacheTTL`, both now configurable.
## Complexity
O(1) config lookup at store init time. No hot-path impact.
Co-authored-by: you <you@example.com>
Closes#616
## What
Adds a **Distance** column to the neighbor table on the node detail
page.
When both the viewed node and a neighbor have GPS coordinates recorded,
the table shows the haversine distance between them (e.g. `3.2 km`).
When either node lacks GPS, the cell shows `—`.
## Changes
**Backend** (`cmd/server/neighbor_api.go`):
- Added `distance_km *float64` (omitempty) to `NeighborEntry`
- In `handleNodeNeighbors`: look up source node coords from `nodeMap`,
then for each resolved (non-ambiguous) neighbor with GPS, compute
`haversineKm` and set the field
**Frontend** (`public/nodes.js`):
- Added `Distance` column header between Last Seen and Conf
- Cell renders `X.X km` or `—` (muted) when unavailable
**Tests** (`cmd/server/neighbor_api_test.go`):
- `TestNeighborAPI_DistanceKm_WithGPS`: two nodes with real coords →
`distance_km` is positive
- `TestNeighborAPI_DistanceKm_NoGPS`: two nodes at 0,0 → `distance_km`
is nil
## Verification
Test at **https://staging.on8ar.eu** — navigate to any node detail page
and scroll to the Neighbors section. Nodes with GPS coordinates show a
distance; those without show `—`.
## Summary
Adds two config knobs for controlling backfill scope and neighbor graph
data retention, plus removes the dead synchronous backfill function.
## Changes
### Config knobs
#### `resolvedPath.backfillHours` (default: 24)
Controls how far back (in hours) the async backfill scans for
observations with NULL `resolved_path`. Transmissions with `first_seen`
older than this window are skipped, reducing startup time for instances
with large historical datasets.
#### `neighborGraph.maxAgeDays` (default: 30)
Controls the maximum age of `neighbor_edges` entries. Edges with
`last_seen` older than this are pruned from both SQLite and the
in-memory graph. Pruning runs on startup (after a 4-minute stagger) and
every 24 hours thereafter.
### Dead code removal
- Removed the synchronous `backfillResolvedPaths` function that was
replaced by the async version.
### Implementation details
- `backfillResolvedPathsAsync` now accepts a `backfillHours` parameter
and filters by `tx.FirstSeen`
- `NeighborGraph.PruneOlderThan(cutoff)` removes stale edges from the
in-memory graph
- `PruneNeighborEdges(conn, graph, maxAgeDays)` prunes both DB and
in-memory graph
- Periodic pruning ticker follows the same pattern as metrics pruning
(24h interval, staggered start)
- Graceful shutdown stops the edge prune ticker
### Config example
Both knobs added to `config.example.json` with `_comment` fields.
## Tests
- Config default/override tests for both knobs
- `TestGraphPruneOlderThan` — in-memory edge pruning
- `TestPruneNeighborEdgesDB` — SQLite + in-memory pruning together
- `TestBackfillRespectsHourWindow` — verifies old transmissions are
excluded by backfill window
---------
Co-authored-by: you <you@example.com>
## Summary
Implements M2 of channel color highlighting (#271): a right-click
context menu popover for quick-assigning colors to hash channels.
Builds on M1 (PR #607) which provides `ChannelColors.set/get/remove`
storage primitives.
## What's new
### Color picker popover (`channel-color-picker.js`)
- **Right-click** any GRP_TXT/CHAN row in the **live feed** or **packets
table** → opens a color picker popover at the click point
- **Long-press** (500ms) on mobile triggers the same popover
- **10 preset swatches** — maximally distinct, ColorBrewer-inspired
palette
- **Custom hex** — native `<input type="color">` with Apply button
- **Clear button** — removes color assignment (hidden when no color
assigned)
- **Popover positioning** — auto-adjusts to avoid viewport overflow
- **Dismiss** — click outside or Escape key
### Immediate feedback
- Assigning a color instantly re-styles all visible live feed items with
that channel
- Packets table triggers `renderVisibleRows()` via exposed
`window._packetsRenderVisible`
### Wiring
- Feed items store `_ccPkt` packet reference for channel extraction
- Picker installed via `registerPage` init hooks in both `live.js` and
`packets.js`
- Single shared popover DOM element, repositioned on each open
### Styling
- Dark card with border, matching existing CoreScope dropdown patterns
- CSS in `style.css` under `.cc-picker-*` classes
- Uses CSS variables (`--surface-1`, `--border`, `--accent`, etc.) for
theme compatibility
## Files changed
| File | Change |
|------|--------|
| `public/channel-color-picker.js` | New — popover component (IIFE, no
dependencies except `ChannelColors`) |
| `public/index.html` | Script tag for picker |
| `public/live.js` | Store `_ccPkt` on feed items, install picker on
init |
| `public/packets.js` | Install picker on init, expose
`_packetsRenderVisible` |
| `public/style.css` | Popover CSS |
| `test-channel-colors.js` | 2 new tests for picker loading and graceful
degradation |
## Testing
- All 21 channel-colors tests pass (19 M1 + 2 M2)
- All 445 frontend-helpers tests pass
- All 62 packet-filter tests pass
## Performance
No hot-path impact. The popover is a single shared DOM element created
lazily on first use. Context menu handlers use event delegation on the
feed/table containers (one listener each, not per-row). The
`refreshVisibleRows` function only iterates currently-visible DOM
elements.
Closes milestone M2 of #271.
---------
Co-authored-by: you <you@example.com>
## Summary
Implements M1 of the [channel color highlighting
spec](docs/specs/channel-color-highlighting.md) for issue #271.
Allows users to assign custom highlight colors to specific hash
channels. When a `GRP_TXT` packet arrives with an assigned channel
color, the feed row and packets table row get:
- **4px colored left border** in the assigned color
- **Subtle background tint** (color at 10% opacity)
## What's included
### `public/channel-colors.js` — Storage model
- `ChannelColors.get(channel)` → hex color or null
- `ChannelColors.set(channel, color)` — assign a color
- `ChannelColors.remove(channel)` — clear assignment
- `ChannelColors.getAll()` → all assignments
- `ChannelColors.getRowStyle(typeName, channel)` → inline CSS string for
row highlighting
- Uses `localStorage` key `live-channel-colors`
- Gracefully handles corrupt/missing localStorage data
### Feed row highlighting (`public/live.js`)
- Both `addFeedItem` (live WS) and `addFeedItemDOM` (replay/DB load)
apply channel color styles
- Reads `decoded.payload.channelName` from the packet
### Packets table highlighting (`public/packets.js`)
- `buildFlatRowHtml` and `buildGroupRowHtml` apply channel color styles
to `<tr>` elements
- Reads channel from `getParsedDecoded(p).channel`
### Tests (`test-channel-colors.js`)
- 16 unit tests covering storage CRUD, edge cases (null, empty, corrupt
data), and style generation
- Tests verify only GRP_TXT/CHAN types get coloring, other types are
unaffected
## Design decisions
- **Only GRP_TXT/CHAN packets** — other types retain default
`TYPE_COLORS` styling
- **Channel color takes priority** over default type colors for row
highlighting
- **No UI for assigning colors yet** — that's M2 (right-click context
menu + color picker)
- **Storage key abstracted** behind functions to ease future migration
if customizer rework (#288) lands
- **10% opacity tint** (`#hexcolor` + `1a` suffix) ensures readability
in both dark/light modes
## Performance
- `getRowStyle()` is O(1) — single localStorage read + JSON parse per
call
- No per-packet API calls; all data is client-side
- No impact on hot rendering paths beyond one localStorage read per row
render
Closes#271 (M1 only — further milestones in separate PRs)
---------
Co-authored-by: you <you@example.com>
## Summary
Adds collapsible/minimizable UI panels on the live map page so overlay
panels don't block map content on medium-sized screens.
Fixes#279
## Changes
### Collapsible Legend Panel (all screen sizes)
- The legend toggle button (🎨/✕) is now visible at **all** screen sizes,
not just mobile
- Clicking it smoothly collapses/expands the legend with a CSS
transition
- Collapsed state persists in `localStorage` (`live-legend-hidden`)
- Feed panel already had hide/show with localStorage — no changes needed
there
### Medium Breakpoint (768px)
New `@media (max-width: 768px)` rules for tablet/small laptop screens:
- Feed panel: 360px → 280px wide, max-height 340px → 200px
- Node detail panel: 320px → 260px wide
- Legend: smaller font (10px) and tighter padding
- Header: reduced gap and padding
- Stats/toggles: smaller font sizes
### What's NOT changed
- Mobile (≤640px): existing behavior preserved (feed/legend hidden
entirely)
- Desktop (>768px): no changes — panels render at full size as before
## Testing
- `test-packet-filter.js`: 62 passed
- `test-aging.js`: 29 passed
- `test-frontend-helpers.js`: 445 passed
---------
Co-authored-by: you <you@example.com>
The button click handler used document.getElementById() which fails on
/packet/[ID] pages because renderDetail() runs before the container is
appended to the DOM. Changed to panel.querySelector() which searches
within the detached element tree.
Fixes#601
## M2: Airtime + Channel Quality + Battery Charts
Implements M2 of #600 — server-side delta computation and three new
charts in the RF Health detail view.
### Backend Changes
**Delta computation** for cumulative counters (`tx_air_secs`,
`rx_air_secs`, `recv_errors`):
- Computes per-interval deltas between consecutive samples
- **Reboot handling:** detects counter reset (current < previous), skips
that delta, records reboot timestamp
- **Gap handling:** if time between samples > 2× interval, inserts null
(no interpolation)
- Returns `tx_airtime_pct` and `rx_airtime_pct` as percentages
(delta_secs / interval_secs × 100)
- Returns `recv_error_rate` as delta_errors / (delta_recv +
delta_errors) × 100
**`resolution` query param** on `/api/observers/{id}/metrics`:
- `5m` (default) — raw samples
- `1h` — hourly aggregates (GROUP BY hour with AVG/MAX)
- `1d` — daily aggregates
**Schema additions:**
- `packets_sent` and `packets_recv` columns added to `observer_metrics`
(migration)
- Ingestor parses these fields from MQTT stats messages
**API response** now includes:
- `tx_airtime_pct`, `rx_airtime_pct`, `recv_error_rate` (computed
deltas)
- `reboots` array with timestamps of detected reboots
- `is_reboot_sample` flag on affected samples
### Frontend Changes
Three new charts in the RF Health detail view, stacked vertically below
noise floor:
1. **Airtime chart** — TX (red) + RX (blue) as separate SVG lines,
Y-axis 0-100%, direct labels at endpoints
2. **Error Rate chart** — `recv_error_rate` line, shown only when data
exists
3. **Battery chart** — voltage line with 3.3V low reference, shown only
when battery_mv > 0
All charts:
- Share X-axis and time range (aligned vertically)
- Reboot markers as vertical hairlines spanning all charts
- Direct labels on data (no legends)
- Resolution auto-selected: `1h` for 7d/30d ranges
- Charts hidden when no data exists
### Tests
- `TestComputeDeltas`: normal deltas, reboot detection, gap detection
- `TestGetObserverMetricsResolution`: 5m/1h/1d downsampling verification
- Updated `TestGetObserverMetrics` for new API signature
---------
Co-authored-by: you <you@example.com>
- Change RF Health detail view from bottom-of-page to a right-sliding side panel
- Grid stays visible and stable when detail is open (no layout shift)
- Click another observer updates panel in place; close button (×) dismisses
- On mobile (<640px): panel stacks below grid at full width
- Filter out observers with insufficient data (<2 sparkline points) from grid entirely
- Follows the same split-layout pattern used by the nodes page
## RF Health Dashboard — M1: Observer Metrics Storage, API & Small
Multiples Grid
Implements M1 of #600.
### What this does
Adds a complete RF health monitoring pipeline: MQTT stats ingestion →
SQLite storage → REST API → interactive dashboard with small multiples
grid.
### Backend Changes
**Ingestor (`cmd/ingestor/`)**
- New `observer_metrics` table via migration system (`_migrations`
pattern)
- Parse `tx_air_secs`, `rx_air_secs`, `recv_errors` from MQTT status
messages (same pattern as existing `noise_floor` and `battery_mv`)
- `INSERT OR REPLACE` with timestamps rounded to nearest 5-min interval
boundary (using ingestor wall clock, not observer timestamps)
- Missing fields stored as NULLs — partial data is always better than no
data
- Configurable retention pruning: `retention.metricsDays` (default 30),
runs on startup + every 24h
**Server (`cmd/server/`)**
- `GET /api/observers/{id}/metrics?since=...&until=...` — per-observer
time-series data
- `GET /api/observers/metrics/summary?window=24h` — fleet summary with
current NF, avg/max NF, sample count
- `parseWindowDuration()` supports `1h`, `24h`, `3d`, `7d`, `30d` etc.
- Server-side metrics retention pruning (same config, staggered 2min
after packet prune)
### Frontend Changes
**RF Health tab (`public/analytics.js`, `public/style.css`)**
- Small multiples grid showing all observers simultaneously — anomalies
pop out visually
- Per-observer cell: name, current NF value, battery voltage, sparkline,
avg/max stats
- NF status coloring: warning (amber) at ≥-100 dBm, critical (red) at
≥-85 dBm — text color only, no background fills
- Click any cell → expanded detail view with full noise floor line chart
- Reference lines with direct text labels (`-100 warning`, `-85
critical`) — not color bands
- Min/max points labeled directly on the chart
- Time range selector: preset buttons (1h/3h/6h/12h/24h/3d/7d/30d) +
custom from/to datetime picker
- Deep linking: `#/analytics?tab=rf-health&observer=...&range=...`
- All charts use SVG, matching existing analytics.js patterns
- Responsive: 3-4 columns on desktop, 1 on mobile
### Design Decisions (from spec)
- Labels directly on data, not in legends
- Reference lines with text labels, not color bands
- Small multiples grid, not card+accordion (Tufte: instant visual fleet
comparison)
- Ingestor wall clock for all timestamps (observer clocks may drift)
### Tests Added
**Ingestor tests:**
- `TestRoundToInterval` — 5 cases for rounding to 5-min boundaries
- `TestInsertMetrics` — basic insertion with all fields
- `TestInsertMetricsIdempotent` — INSERT OR REPLACE deduplication
- `TestInsertMetricsNullFields` — partial data with NULLs
- `TestPruneOldMetrics` — retention pruning
- `TestExtractObserverMetaNewFields` — parsing tx_air_secs, rx_air_secs,
recv_errors
**Server tests:**
- `TestGetObserverMetrics` — time-series query with since/until filters,
NULL handling
- `TestGetMetricsSummary` — fleet summary aggregation
- `TestObserverMetricsAPIEndpoints` — DB query verification
- `TestMetricsAPIEndpoints` — HTTP endpoint response shape
- `TestParseWindowDuration` — duration parsing for h/d formats
### Test Results
```
cd cmd/ingestor && go test ./... → PASS (26s)
cd cmd/server && go test ./... → PASS (5s)
```
### What's NOT in this PR (deferred to M2+)
- Server-side delta computation for cumulative counters
- Airtime charts (TX/RX percentage lines)
- Channel quality chart (recv_error_rate)
- Battery voltage chart
- Reboot detection and chart annotations
- Resolution downsampling (1h, 1d aggregates)
- Pattern detection / automated diagnosis
---------
Co-authored-by: you <you@example.com>
## Summary
- Adds a new **Prefix Tool** tab to the Analytics page (alongside Hash
Stats / Hash Issues)
- **Network Overview**: per-tier collision stats (1/2/3-byte) and a
network-size-based recommendation — collapsible, folded by default
- **Prefix Checker**: accepts a 1/2/3-byte hex prefix or full public
key; shows colliding nodes at each tier with severity badges (✅ / ⚠️ /
🔴); clicking a node navigates to its detail page
- **Prefix Generator**: picks a random collision-free prefix at the
chosen hash size; links to
[meshcore-web-keygen](https://agessaman.github.io/meshcore-web-keygen/)
with the prefix pre-filled
- **Hash Issues tab**: adds a "🔎 Check a prefix →" shortcut in the nav
- **Deep-link support**: `#/analytics?tab=prefix-tool&prefix=A3F1`
pre-fills and runs the checker; `?generate=2` pre-selects and runs the
generator
- **No new API endpoints** — 100% client-side using the existing
`/nodes` list
## Verification
Live on staging:
**https://staging.on8ar.eu/#/analytics?tab=prefix-tool**
## Test plan
- [x] Network Overview card is collapsed by default; expands on click;
stats are correct
- [x] Prefix Checker: 2-char input shows 1-byte results; 4-char shows
2-byte; 6-char shows 3-byte; 64-char pubkey shows all three tiers
- [x] Prefix Checker: invalid hex shows error; odd-length input shows
error
- [x] Prefix Generator: Generate picks an unused prefix; "Try another"
cycles; keygen link opens with prefix pre-filled
- [x] Deep link `?prefix=A3F1` pre-fills checker and scrolls to it
- [x] Deep link `?generate=2` pre-selects 2-byte and runs generator
- [x] Hash Issues tab shows "🔎 Check a prefix →" in the nav
- [x] FAQ link at bottom of generator opens correct MeshCore docs anchor
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
The error-state `<tbody>` row (shown when packet loading fails)
hardcoded `colspan="10"`, while the virtual scroll spacers and the
empty-state row both use `_getColCount()` (which reads from the actual
`<thead>` and falls back to 11). One-line fix: replace the hardcoded
value with `_getColCount()`.
Fixes#406
## Test plan
- [x] Trigger the error state (e.g. kill the backend mid-load) — error
row should span all columns with no gap on the right
- [x] `node test-packets.js` — 72 passed, 0 failed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Replace full \`tbody\` teardown+rebuild on every scroll frame with a
range-diff that only adds/removes the delta rows at the edges of the
visible window
- \`buildFlatRowHtml\` / \`buildGroupRowHtml\` now accept an
\`entryIdx\` parameter and emit \`data-entry-idx\` on every \`<tr>\` so
the diff can target rows precisely (including expanded group children)
- Full rebuild is retained for initial render and large scroll jumps
past the buffer (no range overlap)
- Also loads \`packet-helpers.js\` in the test sandbox, fixing 7
pre-existing test failures for the builder functions; adds 4 new tests
covering \`data-entry-idx\` output
Fixes#414
## Test plan
- [x] Open packets page with 500+ packets, scroll rapidly — DOM
inspector should show incremental \`<tr>\` adds/removes rather than full
\`tbody\` teardown
- [x] Expand a grouped packet, scroll away and back — expanded children
re-render correctly
- [x] Large scroll jump (jump to bottom via scrollbar) — full rebuild
fires, no visual glitch
- [x] \`node test-packets.js\` — 72 passed, 0 failed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
Weak passphrases with no KDF stretching are the #1 practical threat.
Timestamp in plaintext block 0 serves as known-plaintext oracle for
instant key verification from a single captured packet.
Key findings:
- decode_base64() output used directly as AES key, no KDF
- Short passphrases produce <16 byte keys (reduced key space)
- No salt means global precomputed attacks work
- 3-word passphrase crackable in ~2 min on commodity GPU
Reviewed by djb and Dijkstra personas. Corrections applied:
- GPU throughput upgraded from 10^9 to 10^10 AES/sec baseline
- Oracle strengthened: bytes 4+ (type byte, sender name) also predictable
- Dictionary size assumptions made explicit
- Zipf's law caveat added (humans don't choose uniformly)
- base64 short-passphrase key truncation issue documented
Formal analysis of MeshCore's ECB encryption for channel and direct messages.
Reviewed by djb and Dijkstra expert personas through 3 revisions.
Key findings:
- Block 0 has accidental nonce (4-byte timestamp) preventing repetition
- Blocks 1+ are pure deterministic ECB with no nonce — vulnerable to
frequency analysis for repeated message content
- Partial final block attack: zero-padding reduces search space
- HMAC key reuse: AES key is first 16 bytes of HMAC key (same material)
- Recommended fix: switch to AES-128-CTR mode
## Summary
`txToMap()` previously always allocated observation sub-maps for every
packet, even though the `/api/packets` handler immediately stripped them
via `delete(p, "observations")` unless `expand=observations` was
requested. A typical page of 50 packets with ~5 observations each caused
300+ unnecessary map allocations per request.
## Changes
- **`txToMap`**: Add variadic `includeObservations bool` parameter.
Observations are only built when `true` is passed, eliminating
allocations when they'd just be discarded.
- **`PacketQuery`**: Add `ExpandObservations bool` field to thread the
caller's intent through the query pipeline.
- **`routes.go`**: Set `ExpandObservations` based on
`expand=observations` query param. Removed the post-hoc `delete(p,
"observations")` loop — observations are simply never created when not
requested.
- **Single-packet lookups** (`GetPacketByID`, `GetPacketByHash`): Always
pass `true` since detail views need observations.
- **Multi-node/analytics queries**: Default (no flag) = no observations,
matching prior behavior.
## Testing
- Added `TestTxToMapLazyObservations` covering all three cases: no flag,
`false`, and `true`.
- All existing tests pass (`go test ./...`).
## Perf Impact
Eliminates ~250 observation map allocations per /api/packets request (at
default page size of 50 with ~5 observations each). This is a
constant-factor improvement per request — no algorithmic complexity
change.
Fixes#374
Co-authored-by: you <you@example.com>
## Summary
Optimizes `QueryGroupedPackets()` in `store.go` to eliminate two major
inefficiencies on every grouped packet list request:
### Changes
1. **Cache `UniqueObserverCount` on `StoreTx`** — Instead of iterating
all observations to count unique observers on every query
(O(total_observations) per request), we now track unique observers at
ingest time via an `observerSet` map and pre-computed
`UniqueObserverCount` field. This is updated incrementally as
observations arrive.
2. **Defer map construction until after pagination** — Previously,
`map[string]interface{}` was built for ALL 30K+ filtered results before
sorting and paginating. Now the grouped cache stores sorted `[]*StoreTx`
pointers (lightweight), and `groupedTxsToPage()` builds maps only for
the requested page (typically 50 items). This eliminates ~30K map
allocations per cache miss.
3. **Lighter cache footprint** — The grouped cache now stores
`[]*StoreTx` instead of `*PacketResult` with pre-built maps, reducing
memory pressure and GC work.
### Complexity
- Observer counting: O(1) per query (was O(total_observations))
- Map construction: O(page_size) per query (was O(n) where n = all
filtered results)
- Sort remains O(n log n) on cache miss, but the cache (3s TTL) absorbs
repeated requests
### Testing
- `cd cmd/server && go test ./...` — all tests pass
- `cd cmd/ingestor && go build ./...` — builds clean
Fixes#370
---------
Co-authored-by: you <you@example.com>
## Summary
Replace `time.Tick()` with `time.NewTicker()` in the auto-prune
goroutine so it stops cleanly during graceful shutdown.
## Problem
`time.Tick` creates a ticker that can never be garbage collected or
stopped. While the prune goroutine runs for the process lifetime, it
won't stop during graceful shutdown — the goroutine leaks past the
shutdown sequence.
## Fix
- Create a `time.NewTicker` and a done channel
- Use `select` to listen on both the ticker and done channel
- Stop the ticker and close the done channel in the shutdown path (after
`poller.Stop()`)
- Pattern matches the existing `StartEvictionTicker()` approach
## Testing
- `go build ./...` — compiles cleanly
- `go test ./...` — all tests pass
Fixes#377
Co-authored-by: you <you@example.com>
## Summary
Combines the chained `filterTxSlice` calls in `filterPackets()` into a
single pass over the packet slice.
## Problem
When multiple filter parameters are specified (e.g.,
`type=4&route=1&since=...&until=...`), each filter created a new
intermediate `[]*StoreTx` slice. With N filters, this meant N separate
scans and N-1 unnecessary allocations.
## Fix
All filter predicates (type, route, observer, hash, since, until,
region, node) are pre-computed before the loop, then evaluated in a
single `filterTxSlice` call. This eliminates all intermediate
allocations.
**Preserved behavior:**
- Fast-path index lookups for hash-only and observer-only queries remain
unchanged
- Node-only fast-path via `byNode` index preserved
- All existing filter semantics maintained (same comparison operators,
same null checks)
**Complexity:** Single `O(n)` pass regardless of how many filters are
active, vs previous `O(n * k)` where k = number of active filters (each
pass is O(n) but allocates).
## Testing
All existing tests pass (`cd cmd/server && go test ./...`).
Fixes#373
Co-authored-by: you <you@example.com>
## Summary
Sort `snrVals` and `rssiVals` once upfront in `computeAnalyticsRF()` and
read min/max/median directly from the sorted slices, instead of copying
and sorting per stat call.
## Changes
- Sort both slices once before computing stats (2 sorts total instead of
4+ copy+sorts)
- Read `min` from `sorted[0]`, `max` from `sorted[len-1]`, `median` from
`sorted[len/2]`
- Remove the now-unused `sortedF64` and `medianF64` helper closures
## Performance impact
With 100K+ observations, this eliminates multiple O(n log n) copy+sort
operations. Previously each call to `medianF64` did a full copy + sort,
and `minF64`/`maxF64` did O(n) scans on the unsorted array. Now: 2
in-place sorts total, O(1) lookups for min/max/median.
Fixes#366
Co-authored-by: you <you@example.com>
## Summary
`EvictStale()` was doing O(n) linear scans per evicted item to remove
from secondary indexes (`byObserver`, `byPayloadType`, `byNode`).
Evicting 1000 packets from an observer with 50K observations meant 1000
× 50K = 50M comparisons — all under a write lock.
## Fix
Replace per-item removal with batch single-pass filtering:
1. **Collect phase**: Walk evicted packets once, building sets of
evicted tx IDs, observation IDs, and affected index keys
2. **Filter phase**: For each affected index slice, do a single pass
keeping only non-evicted entries
**Before**: O(evicted_count × index_slice_size) per index — quadratic in
practice
**After**: O(evicted_count + index_slice_size) per affected key — linear
## Changes
- `cmd/server/store.go`: Restructured `EvictStale()` eviction loop into
collect + batch-filter pattern
## Testing
- All existing tests pass (`cd cmd/server && go test ./...`)
Fixes#368
Co-authored-by: you <you@example.com>
## Summary
`QueryMultiNodePackets()` was scanning ALL packets with
`strings.Contains` on JSON blobs — O(packets × pubkeys × json_length).
With 30K+ packets and multiple pubkeys, this caused noticeable latency
on `/api/packets?nodes=...`.
## Fix
Replace the full scan with lookups into the existing `byNode` index,
which already maps pubkeys to their transmissions. Merge results with
hash-based deduplication, then apply time filters.
**Before:** O(N × P × J) where N=all packets, P=pubkeys, J=avg JSON
length
**After:** O(M × P) where M=packets per pubkey (typically small), plus
O(R log R) sort for pagination correctness
Results are sorted by `FirstSeen` after merging to maintain the
oldest-first ordering expected by the pagination logic.
Fixes#357
Co-authored-by: you <you@example.com>
## Problem
`GetNodeAnalytics()` in `store.go` scans ALL 30K+ packets doing
`strings.Contains` on every JSON blob when the node has a name, then
filters by time range *after* the full scan. This is `O(packets ×
json_length)` on every `/api/nodes/{pubkey}/analytics` request.
## Fix
Move the `fromISO` time check inside the scan loop so old packets are
skipped **before** the expensive `strings.Contains` matching. For the
non-name path (indexed-only), the time filter is also applied inline,
eliminating the separate `allPkts` intermediate slice.
### Before
1. Scan all packets → collect matches (including old ones) → `allPkts`
2. Filter `allPkts` by time → `packets`
### After
1. Scan packets, skip `tx.FirstSeen <= fromISO` immediately → `packets`
This avoids `strings.Contains` calls on packets outside the requested
time window (typically 7 days out of months of data).
## Complexity
- **Before:** `O(total_packets × avg_json_length)` for name matching
- **After:** `O(recent_packets × avg_json_length)` — only packets within
the time window are string-matched
## Testing
- `cd cmd/server && go test ./...` — all tests pass
Fixes#367
Co-authored-by: you <you@example.com>
## Summary
Consolidates the 4 parallel `/api/analytics/subpaths` calls in the Route
Patterns tab into a single `/api/analytics/subpaths-bulk` endpoint,
eliminating 3 redundant server-side scans of the subpath index on cache
miss.
## Changes
### Backend (`cmd/server/routes.go`, `cmd/server/store.go`)
- New `GET
/api/analytics/subpaths-bulk?groups=2-2:50,3-3:30,4-4:20,5-8:15`
endpoint
- Groups format: `minLen-maxLen:limit` comma-separated
- `GetAnalyticsSubpathsBulk()` iterates `spIndex` once, bucketing
entries into per-group accumulators by hop length
- Hop name resolution is done once per raw hop and shared across groups
- Results are cached per-group for compatibility with existing
single-key cache lookups
- Region-filtered queries fall back to individual
`GetAnalyticsSubpaths()` calls (region filtering requires
per-transmission observer checks)
### Frontend (`public/analytics.js`)
- `renderSubpaths()` now makes 1 API call instead of 4
- Response shape: `{ results: [{ subpaths, totalPaths }, ...] }` —
destructured into the same `[d2, d3, d4, d5]` variables
### Tests (`cmd/server/routes_test.go`)
- `TestAnalyticsSubpathsBulk`: validates 3-group response shape, missing
params error, invalid format error
## Performance
- **Before:** 4 API calls → 4 scans of `spIndex` + 4× hop resolution on
cache miss
- **After:** 1 API call → 1 scan of `spIndex` + 1× hop resolution
(shared cache)
- Cache miss cost reduced by ~75% for this tab
- No change on cache hit (individual group caching still works)
Fixes#398
Co-authored-by: you <you@example.com>
## Summary
Fixes the N+1 API call pattern when changing observation sort mode on
the packets page. Previously, switching sort to Path or Time fired
individual `/api/packets/{hash}` requests for **every**
multi-observation group without cached children — potentially 100+
concurrent requests.
## Changes
### Backend: Batch observations endpoint
- **New endpoint:** `POST /api/packets/observations` accepts `{"hashes":
["h1", "h2", ...]}` and returns all observations keyed by hash in a
single response
- Capped at 200 hashes per request to prevent abuse
- 4 test cases covering empty input, invalid JSON, too-many-hashes, and
valid requests
### Frontend: Use batch endpoint
- `packets.js` sort change handler now collects all hashes needing
observation data and sends a single POST request instead of N individual
GETs
- Same behavior, single round-trip
## Performance
- **Before:** Changing sort with 100 visible groups → 100 concurrent API
requests, browser connection queueing (6 per host), several seconds of
lag
- **After:** Single POST request regardless of group count, response
time proportional to store lookup (sub-millisecond per hash in memory)
Fixes#389
---------
Co-authored-by: you <you@example.com>
## Summary
Coalesce WS-triggered `renderTableRows()` calls using
`requestAnimationFrame` instead of `setTimeout` debouncing.
Fixes#396
## Problem
During high WebSocket throughput, multiple WS batches could each trigger
a `renderTableRows()` call via `setTimeout(..., 200)`. With rapid
batches, this caused the 50K-row table to be fully rebuilt every few
hundred milliseconds, causing UI jank.
## Solution
Replace the `setTimeout`-based debounce with a `requestAnimationFrame`
coalescing pattern:
1. **`scheduleWSRender()`** — sets a dirty flag and schedules a single
rAF callback
2. **Dirty flag** — multiple WS batches within the same frame just set
the flag; only one render fires
3. **Cleanup** — `destroy()` cancels any pending rAF and resets the
dirty flag
This ensures at most **one `renderTableRows()` per animation frame**
(~16ms), regardless of how many WS batches arrive.
## Performance justification
- **Before:** Each WS batch → `setTimeout(renderTableRows, 200)` — N
batches in <200ms = N renders
- **After:** N batches in one frame → 1 render on next rAF (~16ms)
- Worst case goes from O(N) renders per second to O(60) renders per
second (frame-capped)
## Changes
- `public/packets.js`: Add `scheduleWSRender()` with rAF + dirty flag;
replace setTimeout in WS handler; clean up in `destroy()`
- `test-frontend-helpers.js`: Update tests to verify rAF coalescing
pattern instead of setTimeout debounce
## Testing
- All existing tests pass (`npm test` — 0 failures)
- Updated 2 test cases to verify new rAF coalescing behavior
Co-authored-by: you <you@example.com>
## Summary
Compress `public/og-image.png` from **1,159,050 bytes (1.1MB)** to
**234,899 bytes (235KB)** — an **80% reduction**.
## What Changed
- Applied lossy PNG quantization via `pngquant` (quality 45-65, speed 1)
- Image dimensions unchanged: 1200×630px (standard OG image size)
- Visual quality remains suitable for social media previews
## Why
A 1.1MB OpenGraph image is excessive. Typical OG images are 50-200KB.
This reduces deployment size and Git repo bloat without affecting
functionality (browsers don't preload OG images).
## Testing
- Unit tests pass (`npm run test:unit`)
- No code changes — image-only commit
- `index.html` reference unchanged (`<meta property="og:image"
content="/og-image.png">`)
Fixes#397
Co-authored-by: you <you@example.com>
## Summary
Reduces the analytics nodes tab from 3 parallel API calls to 2 by
computing network status (active/degraded/silent counts) client-side
instead of fetching from `/nodes/network-status`.
## What Changed
**`public/analytics.js` — `renderNodesTab()`:**
- Removed the `/nodes/network-status` API call from the `Promise.all`
batch
- Added client-side computation of active/degraded/silent counts using
the shared `getHealthThresholds()` function from `roles.js`
- Uses `nodesResp.total` and `nodesResp.counts` (already returned by
`/nodes` endpoint) for total node count and role breakdown
## Why This Works
The `/nodes` response already includes:
- `total` — count of all matching nodes (server-computed across full DB)
- `counts` — role counts across all nodes (from `GetAllRoleCounts()`)
- Per-node `last_seen`/`last_heard` timestamps
The `getHealthThresholds()` function in `roles.js` provides the same
degraded/silent thresholds used server-side, so client-side status
computation produces equivalent results for the loaded node set.
## Performance
- **Before:** 3 parallel API calls (`/nodes`, `/nodes/bulk-health`,
`/nodes/network-status`)
- **After:** 2 parallel API calls (`/nodes`, `/nodes/bulk-health`)
- Network status computation is O(n) over the 200 loaded nodes —
negligible client-side cost
- The `/nodes/network-status` endpoint scanned ALL nodes in the DB on
every call; this eliminates that server-side work entirely
## Testing
- All frontend helper tests pass (445/445)
- All packet filter tests pass (62/62)
- All aging tests pass (29/29)
- All Go backend tests pass
Fixes#392
---------
Co-authored-by: you <you@example.com>
## Summary
Eliminates visible marker flicker on zoom/resize events in the map page
when displaying 500+ nodes.
## Problem
`renderMarkers()` was called on every `zoomend` and `resize` event,
which did `markerLayer.clearLayers()` followed by a full rebuild of all
markers. With many nodes, this caused a visible flash where all markers
disappeared briefly before being re-added.
## Solution
Instead of rebuilding all markers from scratch on zoom/resize:
1. **Store Leaflet layer references** on marker data objects
(`_leafletMarker`, `_leafletLine`, `_leafletDot`) during the initial
full render
2. **Add `_repositionMarkers()`** — re-runs `deconflictLabels()` at the
new zoom level and updates existing marker positions via
`setLatLng()`/`setLatLngs()` without clearing the layer group
3. **Debounce zoom/resize handlers** (150ms) to coalesce rapid events
during animated zooms
4. **Dynamically manage offset indicators** — adds/removes deconfliction
offset lines and dots as positions change at different zoom levels
Full `renderMarkers()` is still called for filter changes, data updates,
and theme changes — only zoom/resize uses the lightweight repositioning
path.
## Complexity
- `_repositionMarkers()`: O(n) — single pass over stored marker data
- `deconflictLabels()`: O(n × k) where k is max spiral offsets (48) —
unchanged
- No new API calls, no DOM rebuilds
Fixes#393
---------
Co-authored-by: you <you@example.com>
## Summary
`replayRecent()` in `live.js` fetched observation details for 8 packet
groups **sequentially** — each `await fetch()` waited for the previous
to complete before starting the next.
## Change
Replaced the sequential `for` loop with `Promise.all()` to fetch all 8
detail API calls **concurrently**. The mapping from results to live
packets is unchanged.
**Before:** 8 sequential fetches (total time ≈ sum of all request
durations)
**After:** 8 parallel fetches (total time ≈ max of all request
durations)
## Notes
- `replayRecent()` is currently disabled (commented out at line 856), so
this is dormant code — no runtime risk
- No behavioral change: same data mapping, same rendering, same VCR
buffer population
- All existing tests pass
Fixes#394
---------
Co-authored-by: you <you@example.com>
## Summary
Eliminates the N+1 API call storm when toggling off "Group by Hash" in
the packets table.
## Problem
When ungrouped mode was active, `loadPackets()` fired individual
`/api/packets/{hash}` requests for every multi-observation packet. With
200+ multi-obs packets, this created 200+ parallel HTTP requests —
overwhelming both browser connection limits and the server.
## Fix
The server already supports `expand=observations` on the `/api/packets`
endpoint, which returns observations inline. Instead of:
1. Always fetching grouped (`groupByHash=true`)
2. Then N+1 fetching each packet's children individually
We now:
1. Fetch grouped when grouped mode is active (`groupByHash=true`)
2. Fetch with `expand=observations` when ungrouped — **single API call**
3. Flatten observations client-side
**Result: 200+ API calls → 1 API call.**
## Changes
- `public/packets.js`: Replaced N+1 observation fetching loop with
single `expand=observations` query parameter, flatten inline
observations client-side.
## Testing
- All frontend tests pass (packet-filter: 62/62, frontend-helpers:
445/445)
- All Go backend tests pass
Fixes#382
Co-authored-by: you <you@example.com>
## Summary
`handleObservers()` in `routes.go` was calling `GetNodeLocations()`
which fetches ALL nodes from the DB just to match ~10 observer IDs
against node public keys. With 500+ nodes this is wasteful.
## Changes
- **`db.go`**: Added `GetNodeLocationsByKeys(keys []string)` — queries
only the rows matching the given public keys using a parameterized
`WHERE LOWER(public_key) IN (?, ?, ...)` clause.
- **`routes.go`**: `handleObservers` now collects observer IDs and calls
the targeted method instead of the full-table scan.
- **`coverage_test.go`**: Added `TestGetNodeLocationsByKeys` covering
known key, empty keys, and unknown key cases.
## Performance
With ~10 observers and 500+ nodes, the query goes from scanning all 500
rows to fetching only ~10. The original `GetNodeLocations()` is
preserved for any other callers.
Fixes#378
Co-authored-by: you <you@example.com>
## Summary
Skip `updateTimeline()` canvas redraws in `bufferPacket()` when the
browser tab is hidden (`_tabHidden === true`). Instead, batch-update the
timeline once when the tab becomes visible again via the
`visibilitychange` handler.
Fixes#385
## What Changed
**`public/live.js`** — two surgical edits:
1. **`bufferPacket()`**: Removed `updateTimeline()` call from the
`_tabHidden` early-return path. When the tab is backgrounded, packets
are still buffered (for VCR) but no canvas work is done.
2. **`visibilitychange` handler**: Added `updateTimeline()` call when
the tab is restored, so the timeline catches up in a single repaint
instead of N repaints (one per buffered packet).
## Performance Impact
At 5+ packets/sec with a backgrounded tab, this eliminates continuous
canvas redraws (`updateTimeline()` calls `ctx.clearRect` + full canvas
redraw + `updateTimelinePlayhead()`) that are invisible to the user. CPU
usage drops to near-zero for timeline rendering while backgrounded.
## Tests
All existing tests pass:
- `test-packet-filter.js` — 62 passed
- `test-aging.js` — 29 passed
- `test-frontend-helpers.js` — 445 passed
Co-authored-by: you <you@example.com>
## Summary
Replace N+1 per-hop DB queries in `handleResolveHops` with O(1) lookups
against the in-memory prefix map that already exists in the packet
store.
## Problem
Each hop in the `resolve-hops` API triggered a separate `SELECT ... LIKE
?` query against the nodes table. With 10 hops, that's 10 DB round-trips
— unnecessary when `getCachedNodesAndPM()` already maintains an
in-memory prefix map that can resolve hops instantly.
## Changes
- **routes.go**: Replace the per-hop DB query loop with `pm.m[hopLower]`
lookups from the prefix map. Convert `nodeInfo` → `HopCandidate` inline.
Remove unused `rows`/`sql.Scan` code.
- **store.go**: Add `InvalidateNodeCache()` method to force prefix map
rebuild (needed by tests that insert nodes after store initialization).
- **routes_test.go**: Give `TestResolveHopsAmbiguous` a proper store so
hops resolve via the prefix map.
- **resolve_context_test.go**: Call `InvalidateNodeCache()` after
inserting test nodes. Fix confidence assertion — with GPS candidates and
no affinity context, `resolveWithContext` correctly returns
`gps_preference` (previously masked because the prefix map didn't have
the test nodes).
## Complexity
O(1) per hop lookup via hash map vs O(n) DB scan per hop. No hot-path
impact — this endpoint is called on-demand, not in a render loop.
Fixes#369
---------
Co-authored-by: you <you@example.com>
## Summary
Replace full `buildDistanceIndex()` rebuild with incremental
`removeTxFromDistanceIndex`/`addTxToDistanceIndex` for only the
transmissions whose paths actually changed during
`IngestNewObservations`.
## Problem
When any transmission's best path changed during observation ingestion,
the **entire distance index was rebuilt** — iterating all 30K+ packets,
resolving all hops, and computing haversine distances. This
`O(total_packets × avg_hops)` operation ran under a write lock, blocking
all API readers.
A 30-second debounce (`distRebuildInterval`) was added in #557 to
mitigate this, but it only delayed the pain — the full rebuild still
happened, just less frequently.
## Fix
- Added `removeTxFromDistanceIndex(tx)` — filters out all
`distHopRecord` and `distPathRecord` entries for a specific transmission
- Added `addTxToDistanceIndex(tx)` — computes and appends new distance
records for a single transmission
- In `IngestNewObservations`, changed path-change handling to call
remove+add for each affected tx instead of marking dirty and waiting for
a full rebuild
- Removed `distDirty`, `distLast`, and `distRebuildInterval` since
incremental updates are cheap enough to apply immediately
## Complexity
- **Before:** `O(total_packets × avg_hops)` per rebuild (30K+ packets)
- **After:** `O(changed_txs × avg_hops + total_dist_records)` — the
remove is a linear scan of the distance slices, but only for affected
txs; the add is `O(hops)` per changed tx
The remove scan over `distHops`/`distPaths` slices is linear in slice
length, but this is still far cheaper than the full rebuild which also
does JSON parsing, hop resolution, and haversine math for every packet.
## Tests
- Updated `TestDistanceRebuildDebounce` →
`TestDistanceIncrementalUpdate` to verify incremental behavior and check
for duplicate path records
- All existing tests pass (`go test ./...` in both `cmd/server` and
`cmd/ingestor`)
Fixes#365
---------
Co-authored-by: you <you@example.com>
## Summary
Cache `resolveRegionObservers()` results with a 30-second TTL to
eliminate repeated database queries for region→observer ID mappings.
## Problem
`resolveRegionObservers()` queried the database on every call despite
the observers table changing infrequently (~20 rows). It's called from
10+ hot paths including `filterPackets()`, `GetChannels()`, and multiple
analytics compute functions. When analytics caches are cold, parallel
requests each hit the DB independently.
## Solution
- Added a dedicated `regionObsMu` mutex + `regionObsCache` map with 30s
TTL
- Uses a separate mutex (not `s.mu`) to avoid deadlocks — callers
already hold `s.mu.RLock()`
- Cache is lazily populated per-region and fully invalidated after TTL
expires
- Follows the same pattern as `getCachedNodesAndPM()` (30s TTL,
on-demand rebuild)
## Changes
- **`cmd/server/store.go`**: Added `regionObsMu`, `regionObsCache`,
`regionObsCacheTime` fields; rewrote `resolveRegionObservers()` to check
cache first; added `fetchAndCacheRegionObs()` helper
- **`cmd/server/coverage_test.go`**: Added
`TestResolveRegionObserversCaching` — verifies cache population, cache
hits, and nil handling for unknown regions
## Testing
- All existing Go tests pass (`go test ./...`)
- New test verifies caching behavior (population, hits, nil for unknown
regions)
Fixes#362
---------
Co-authored-by: you <you@example.com>
## Summary
`GetStoreStats()` ran 5 sequential DB queries on every call. This
combines them into **2 concurrent queries**:
1. **Node/observer counts** — single query using subqueries: `SELECT
(SELECT COUNT(*) FROM nodes WHERE ...), (SELECT COUNT(*) FROM nodes),
(SELECT COUNT(*) FROM observers)`
2. **Observation counts** — single query using conditional aggregation:
`SUM(CASE WHEN timestamp > ? THEN 1 ELSE 0 END)` scoped to the 24h
window, avoiding a full table scan for the 1h count
Both queries run concurrently via goroutines + `sync.WaitGroup`.
## What changed
- `cmd/server/store.go`: Rewrote `GetStoreStats()` — 5 sequential
`QueryRow` calls → 2 concurrent combined queries
- Error handling now propagates query errors instead of silently
ignoring them
## Performance justification
- **Before:** 5 sequential round-trips to SQLite, with 2 potentially
expensive `COUNT(*)` scans on the `observations` table
- **After:** 2 concurrent round-trips; the observation query scans the
24h window once instead of separately scanning for 1h and 24h
- The 10s cache (`statsTTL`) remains, so this fires at most once per 10s
— but when it does fire, it's ~2.5x fewer round-trips and the
observation scan is halved
## Tests
- `go test ./...` passes for both `cmd/server` and `cmd/ingestor`
Fixes#363
---------
Co-authored-by: you <you@example.com>
## Summary
Extracts a shared `fetchNodeDetail(pubkey)` helper in `nodes.js` that
fetches both `/nodes/{pubkey}` and `/nodes/{pubkey}/health` in parallel.
Both `selectNode()` (side panel) and `loadFullNode()` (full-screen view)
now call this single function instead of duplicating the fetch logic.
## What Changed
- **New:** `fetchNodeDetail(pubkey)` — shared async function that
returns node data with `.healthData` attached
- **Modified:** `loadFullNode()` — uses `fetchNodeDetail()` instead of
inline `Promise.all`
- **Modified:** `selectNode()` — uses `fetchNodeDetail()` instead of
inline `Promise.all`
## Why
The duplicate `api()` calls weren't a major perf issue (TTL caching
mitigates most cases), but the duplicated logic was unnecessary tech
debt. On mobile, `selectNode()` redirects to `loadFullNode()` via hash
change, so the two code paths could fire sequentially with expired
cache.
## Testing
- All frontend helper tests pass (445/445)
- All packet filter tests pass (62/62)
- All aging tests pass (29/29)
- No behavioral change — only code structure improvement
Fixes#391
Co-authored-by: you <you@example.com>
## Summary
`GetSubpathDetail()` iterated ALL packets to find those containing a
specific subpath — `O(packets × hops × subpath_length)`. With 30K+
packets this caused user-visible latency on every subpath detail click.
## Changes
### `cmd/server/store.go`
- Added `spTxIndex map[string][]*StoreTx` alongside existing `spIndex` —
tracks which transmissions contain each subpath key
- Extended `addTxToSubpathIndexFull()` and
`removeTxFromSubpathIndexFull()` to maintain both indexes simultaneously
- Original `addTxToSubpathIndex()`/`removeTxFromSubpathIndex()` wrappers
preserved for backward compatibility
- `buildSubpathIndex()` now populates both `spIndex` and `spTxIndex`
during `Load()`
- All incremental update sites (ingest, path change, eviction) use the
`Full` variants
- `GetSubpathDetail()` rewritten: direct `O(1)` map lookup on
`spTxIndex[key]` instead of scanning all packets
### `cmd/server/coverage_test.go`
- Added `TestSubpathTxIndexPopulated`: verifies `spTxIndex` is
populated, counts match `spIndex`, and `GetSubpathDetail` returns
correct results for both existing and non-existent subpaths
## Complexity
- **Before:** `O(total_packets × avg_hops × subpath_length)` per request
- **After:** `O(matched_txs)` per request (direct map lookup)
## Tests
All tests pass: `cmd/server` (4.6s), `cmd/ingestor` (25.6s)
Fixes#358
---------
Co-authored-by: you <you@example.com>
## Summary
`buildPrefixMap()` was generating map entries for every prefix length
from 2 to `len(pubkey)` (up to 64 chars), creating ~31 entries per node.
With 500 nodes that's ~15K map entries; with 1K+ nodes it balloons to
31K+.
## Changes
**`cmd/server/store.go`:**
- Added `maxPrefixLen = 8` constant — MeshCore path hops use 2–6 char
prefixes, 8 gives headroom
- Capped the prefix generation loop at `maxPrefixLen` instead of
`len(pk)`
- Added full pubkey as a separate map entry when key is longer than
`maxPrefixLen`, ensuring exact-match lookups (used by
`resolveWithContext`) still work
**`cmd/server/coverage_test.go`:**
- Added `TestPrefixMapCap` with subtests for:
- Short prefix resolution still works
- Full pubkey exact-match resolution still works
- Intermediate prefixes beyond the cap correctly return nil
- Short keys (≤8 chars) have all prefix entries
- Map size is bounded
## Impact
- Map entries per node: ~31 → ~8 (one per prefix length 2–8, plus one
full-key entry)
- Total map size for 500 nodes: ~15K entries → ~4K entries (~75%
reduction)
- No behavioral change for path hop resolution (2–6 char prefixes)
- No behavioral change for exact pubkey lookups
## Tests
All existing tests pass:
- `cmd/server`: ✅
- `cmd/ingestor`: ✅Fixes#364
---------
Co-authored-by: you <you@example.com>
## Summary
Index node path lookups in `handleNodePaths()` instead of scanning all
packets on every request.
## Problem
`handleNodePaths()` iterated ALL packets in the store (`O(total_packets
× avg_hops)`) with prefix string matching on every hop. This caused
user-facing latency on every node detail page load with 30K+ packets.
## Fix
Added a `byPathHop` index (`map[string][]*StoreTx`) that maps lowercase
hop prefixes and resolved full pubkeys to their transmissions. The
handler now does direct map lookups instead of a full scan.
### Index lifecycle
- **Built** during `Load()` via `buildPathHopIndex()`
- **Incrementally updated** during `IngestNewFromDB()` (new packets) and
`IngestNewObservations()` (path changes)
- **Cleaned up** during `EvictStale()` (packet removal)
### Query strategy
The handler looks up candidates from the index using:
1. Full pubkey (matches resolved hops from `resolved_path`)
2. 2-char prefix (matches short raw hops)
3. 4-char prefix (matches medium raw hops)
4. Any longer raw hops starting with the 4-char prefix
This reduces complexity from `O(total_packets × avg_hops)` to
`O(matching_txs + unique_hop_keys)`.
## Tests
- `TestNodePathsEndpointUsesIndex` — verifies the endpoint returns
correct results using the index
- `TestPathHopIndexIncrementalUpdate` — verifies add/remove operations
on the index
All existing tests pass.
Fixes#359
Co-authored-by: you <you@example.com>
## Summary
Fixes#566 — The "Inconsistent Hash Sizes" list on the Analytics page
included all node types and had no time window, causing false positives.
## Changes
### 1. Role filter on inconsistent nodes (`cmd/server/store.go`)
Added role filter to the `inconsistentNodes` loop in
`computeHashCollisions()` so only repeaters and room servers are
included. Companions are excluded since they were never affected by the
firmware bug. This matches the existing role filter on collision
bucketing from #441.
```go
// Before:
if cn.HashSizeInconsistent {
// After:
if cn.HashSizeInconsistent && (cn.Role == "repeater" || cn.Role == "room_server") {
```
### 2. 7-day time window on hash size computation
(`cmd/server/store.go`)
Added a 7-day recency cutoff to `computeNodeHashSizeInfo()`. Adverts
older than 7 days are now skipped, preventing legitimate historical
config changes (e.g., testing different byte sizes) from creating
permanent false positives.
### 3. Frontend description text (`public/analytics.js`)
Updated the description to reflect the filtered scope: now says
"Repeaters and room servers" instead of "Nodes", mentions the 7-day
window, and notes that companions are excluded.
## Tests
- `TestInconsistentNodesExcludesCompanions` — verifies companions are
excluded while repeaters and room servers are included
- `TestHashSizeInfoTimeWindow` — verifies adverts older than 7 days are
excluded from hash size computation
- Updated existing hash size tests to use recent timestamps (compatible
with the new time window)
- All existing tests pass: `cmd/server` ✅, `cmd/ingestor` ✅
## Perf justification
The time window filter adds a single string comparison per advert in the
scan loop — O(n) with a tiny constant. No impact on hot paths.
---------
Co-authored-by: you <you@example.com>
## Summary
Replace O(n) map iteration in `MaxTransmissionID()` and
`MaxObservationID()` with O(1) field lookups.
## What Changed
- Added `maxTxID` and `maxObsID` fields to `PacketStore`
- Updated `Load()`, `IngestNewFromDB()`, and `IngestNewObservations()`
to track max IDs incrementally as entries are added
- `MaxTransmissionID()` and `MaxObservationID()` now return the tracked
field directly instead of iterating the entire map
## Performance
Before: O(n) iteration over 30K+ map entries under a read lock
After: O(1) field return
## Tests
- Added `TestMaxTransmissionIDIncremental` verifying the incremental
field matches brute-force iteration over the maps
- All existing tests pass (`cmd/server` and `cmd/ingestor`)
Fixes#356
Co-authored-by: you <you@example.com>
## Summary
Adds a byte-size filter to the map page, allowing users to filter
repeater markers by their hash prefix size (1-byte, 2-byte, or 3-byte).
## What changed
**`public/map.js`** — single file change:
1. **New filter state**: Added `byteSize` to the `filters` object
(default: `'all'`), persisted in `localStorage`
2. **New UI section**: Added a "Byte Size" fieldset with button group
(`All | 1-byte | 2-byte | 3-byte`) in the map controls panel, between
"Node Types" and "Display"
3. **Filter logic**: In `_renderMarkersInner`, when `byteSize !==
'all'`, repeater nodes are filtered by their `hash_size` field.
Non-repeater nodes (companions, rooms, sensors) are unaffected — they
pass through regardless of the byte-size filter setting
4. **Event binding**: Button click handlers update the filter, persist
to localStorage, and re-render markers
## Design decisions
- **Client-side only** — no backend changes needed. The `hash_size`
field is already included in the `/api/nodes` response
- **Repeaters only** — byte size is a repeater configuration concept;
other node roles don't have configurable path prefix sizes
- **Matches existing pattern** — uses the same button-group UI as the
Status filter (All/Active/Stale)
- **`hash_size` defaults to 1** — consistent with how the rest of the
codebase treats missing `hash_size` (`node.hash_size || 1`)
## Performance
No new API calls. Filter is a simple string comparison inside the
existing `nodes.filter()` loop in `_renderMarkersInner` — O(1) per node,
negligible overhead.
Fixes#565
Co-authored-by: you <you@example.com>
## Problem
Closes#563. Addresses the *Packet store estimated memory* item in #559.
`estimatedMemoryMB()` used a hardcoded formula:
```go
return float64(len(s.packets)*5120+s.totalObs*500) / 1048576.0
```
This ignored three data structures that grow continuously with every
ingest cycle:
| Structure | Production size | Heap not counted |
|---|---|---|
| `distHops []distHopRecord` | 1,556,833 records | ~300 MB |
| `distPaths []distPathRecord` | 93,090 records | ~25 MB |
| `spIndex map[string]int` | 4,113,234 entries | ~400 MB |
Result: formula reported ~1.2 GB while actual heap was ~5 GB. With
`maxMemoryMB: 1024`, eviction calculated it only needed to shed ~200 MB,
removed a handful of packets, and stopped. Memory kept growing until the
OOM killer fired.
## Fix
Replace `estimatedMemoryMB()` with `runtime.ReadMemStats` so all data
structures are automatically counted:
```go
func (s *PacketStore) estimatedMemoryMB() float64 {
if s.memoryEstimator != nil {
return s.memoryEstimator()
}
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
return float64(ms.HeapAlloc) / 1048576.0
}
```
Replace the eviction simulation loop (which re-used the same wrong
formula) with a proportional calculation: if heap is N× over budget,
evict enough packets to keep `(1/N) × 0.9` of the current count. The 0.9
factor adds a 10% buffer so the next ingest cycle doesn't immediately
re-trigger. All major data structures (distHops, distPaths, spIndex)
scale with packet count, so removing a fraction of packets frees roughly
the same fraction of total heap.
## Testing
- Updated `TestEvictStale_MemoryBasedEviction` to inject a deterministic
estimator via the new `memoryEstimator` field.
- Added `TestEvictStale_MemoryBasedEviction_UnderestimatedHeap`:
verifies that when actual heap is 5× over limit (the production failure
scenario), eviction correctly removes ~80%+ of packets.
```
=== RUN TestEvictStale_MemoryBasedEviction
[store] Evicted 538 packets (1076 obs)
--- PASS
=== RUN TestEvictStale_MemoryBasedEviction_UnderestimatedHeap
[store] Evicted 820 packets (1640 obs)
--- PASS
```
Full suite: `go test ./...` — ok (10.3s)
## Perf note
`runtime.ReadMemStats` runs once per eviction tick (every 60 s) and once
per `/api/perf/store` call. Cost is negligible.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Fixes critical performance issue in neighbor graph computation that
consumed 65% of CPU (30+ seconds) on a 325K packet dataset.
## Changes
### Fix 1: Cache strings.ToLower results
- Added cachedToLower() helper that caches lowercased strings in a local
map
- Pubkeys repeat across hundreds of thousands of observations
- Pre-computes fromLower once per transaction instead of once per
observation
- **Impact:** Eliminates ~8.4s (25.3% CPU)
### Fix 2: Cache parsed DecodedJSON via StoreTx.ParsedDecoded()
- Added ParsedDecoded() method on StoreTx using sync.Once for
thread-safe lazy caching
- json.Unmarshal on decoded_json now runs at most once per packet
lifetime
- Result reused by extractFromNode, indexByNode, trackAdvertPubkey
- **Impact:** Eliminates ~8.8s (26.3% CPU)
### Fix 3: Extend neighbor graph TTL from 60s to 5 minutes
- The graph depends on traffic patterns, not individual packets
- Reduces rebuild frequency 5x
- **Impact:** ~80% reduction in sustained CPU from graph rebuilds
## Tests
- 7 new tests added, all 26+ existing neighbor graph tests pass
- BenchmarkBuildFromStore: 727us/op, 237KB/op, 6030 allocs/op
Related: #559
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: you <you@example.com>
detectSchema() runs at DB open time before ensureResolvedPathColumn()
adds the column during Load(). On first run (or any run where the column
was just added), hasResolvedPath stayed false, causing Load() to skip
reading resolved_path from SQLite. This forced a full backfill of all
observations on every restart, burning CPU for minutes on large DBs.
Fix: set hasResolvedPath = true after ensureResolvedPathColumn succeeds.
## Summary
Implements server-side hop prefix resolution at ingest time with a
persisted neighbor graph. Hop prefixes in `path_json` are now resolved
to full 64-char pubkeys at ingest and stored as `resolved_path` on each
observation, eliminating the need for client-side resolution via
`HopResolver`.
Fixes#555
## What changed
### New file: `cmd/server/neighbor_persist.go`
SQLite persistence layer for the neighbor graph and resolved paths:
- `neighbor_edges` table creation and management
- Load/build/persist neighbor edges from/to SQLite
- `resolved_path` column migration on observations
- `resolvePathForObs()` — resolves hop prefixes using
`resolveWithContext` with 4-tier priority (affinity → geo → GPS → first
match)
- Cold startup backfill for observations missing `resolved_path`
- Async persistence of edges and resolved paths during ingest
(non-blocking)
### Modified: `cmd/server/store.go`
- `StoreObs` gains `ResolvedPath []*string` field
- `StoreTx` gains `ResolvedPath []*string` (cached from best
observation)
- `Load()` dynamically includes `resolved_path` in SQL query when column
exists
- `IngestNewFromDB()` resolves paths at ingest time and persists
asynchronously
- `pickBestObservation()` propagates `ResolvedPath` to transmission
- `txToMap()` and `enrichObs()` include `resolved_path` in API responses
- All 7 `pm.resolve()` call sites migrated to `pm.resolveWithContext()`
with the persisted graph
- Broadcast maps include `resolved_path` per observation
### Modified: `cmd/server/db.go`
- `DB` struct gains `hasResolvedPath bool` flag
- `detectSchema()` checks for `resolved_path` column existence
- Graceful degradation when column is absent (test DBs, old schemas)
### Modified: `cmd/server/main.go`
- Startup sequence: ensure tables → load/build graph → backfill resolved
paths → re-pick best observations
### Modified: `cmd/server/routes.go`
- `mapSliceToTransmissions()` and `mapSliceToObservations()` propagate
`resolved_path`
- Node paths handler uses `resolveWithContext` with graph
### Modified: `cmd/server/types.go`
- `TransmissionResp` and `ObservationResp` gain `ResolvedPath []*string`
with `omitempty`
### New file: `cmd/server/neighbor_persist_test.go`
16 tests covering:
- Path resolution (unambiguous, empty, unresolvable prefixes)
- Marshal/unmarshal of resolved_path JSON
- SQLite table creation and column migration (idempotent)
- Edge persistence and loading
- Schema detection
- Full Load() with resolved_path
- API response serialization (present when set, omitted when nil)
## Design decisions
1. **Async persistence** — resolved paths and neighbor edges are written
to SQLite in a goroutine to avoid blocking the ingest loop. The
in-memory state is authoritative.
2. **Schema compatibility** — `DB.hasResolvedPath` flag allows the
server to work with databases that don't yet have the `resolved_path`
column. SQL queries dynamically include/exclude the column.
3. **`pm.resolve()` retained** — Not removed as dead code because
existing tests use it directly. All production call sites now use
`resolveWithContext` with the persisted graph.
4. **Edge persistence is conservative** — Only unambiguous edges (single
candidate) are persisted to `neighbor_edges`. Ambiguous prefixes are
handled by the in-memory `NeighborGraph` via Jaccard disambiguation.
5. **`null` = unresolved** — Ambiguous prefixes store `null` in the
resolved_path array. Frontend falls back to prefix display.
## Performance
- `resolveWithContext` per hop: ~1-5μs (map lookups, no DB queries)
- Typical packet has 0-5 hops → <25μs total resolution overhead per
packet
- Edge/path persistence is async → zero impact on ingest latency
- Backfill is one-time on first startup with the new column
## Test results
```
cd cmd/server && go test ./... -count=1 → ok (4.4s)
cd cmd/ingestor && go test ./... -count=1 → ok (25.5s)
```
---------
Co-authored-by: you <you@example.com>
## Summary
Implements **M4 (frontend consumers)** from the [resolved-path
spec](https://github.com/Kpa-clawbot/CoreScope/blob/resolved-path-spec/docs/specs/resolved-path.md)
for #555.
The server (PR #556, M1-M3) now returns `resolved_path` on all
packet/observation API responses and WebSocket broadcasts. This PR
updates all frontend consumers to **prefer `resolved_path`** over
client-side HopResolver, with full fallback for old packets.
## What changed
### `hop-resolver.js`
- Added `resolveFromServer(hops, resolvedPath)` — takes the short hex
prefixes and aligned array of full pubkeys from `resolved_path`, looks
up node names from the existing nodesList. Returns the same `{ [hop]: {
name, pubkey, ... } }` format as `resolve()`.
### `packet-helpers.js`
- Added `getResolvedPath(p)` — cached JSON parser for the new
`resolved_path` field (mirrors `getParsedPath`).
- Updated `clearParsedCache()` to also clear `_parsedResolvedPath`.
### `packets.js`
- **Bulk load** (`loadPackets`): calls `cacheResolvedPaths(packets)`
before the existing `resolveHops` fallback.
- **WebSocket updates**: pre-populates `hopNameCache` from
`resolved_path` on incoming packets before falling back to HopResolver
for any remaining unknown hops.
- **Group expansion** (`pktToggleGroup`): caches resolved paths from
child observations.
- **Packet detail** (`selectPacket`): prefers `resolveFromServer` when
`resolved_path` is available.
- **Show Route button**: uses `resolved_path` pubkeys directly instead
of client-side disambiguation.
- **Observation spreading**: carries `resolved_path` field when
constructing observation packets.
### `live.js`
- `resolveHopPositions` accepts optional `resolvedPath` parameter;
prefers server-resolved pubkeys, falls back to HopResolver for null
entries.
- Normalized WS packet objects now carry `resolved_path`.
### Files NOT changed (no resolution changes needed)
- **`analytics.js`** — only uses `HopResolver.haversineKm` (a utility
function). Topology, subpath, and hop distance data comes pre-resolved
from the server API (handled by M2/M3).
- **`nodes.js`** — gets pre-resolved path data from
`/nodes/:pubkey/paths` API; no client-side hop resolution.
- **`map.js`** — `drawPacketRoute` already handles full 64-char pubkeys
via exact match. The updated `packets.js` now passes full pubkeys from
`resolved_path` to the map.
## Fallback pattern
```javascript
// In hop-resolver.js
function resolveFromServer(hops, resolvedPath) {
// Returns resolved entries for non-null pubkeys
// Skips null entries (unresolved) — caller falls back to HopResolver
}
// In packets.js — bulk load
await cacheResolvedPaths(packets); // server-side first
await resolveHops([...allHops]); // client-side fallback for remaining
```
Old packets without `resolved_path` continue to work exactly as before
via the existing HopResolver. `hop-resolver.js` is NOT removed — it
remains the fallback.
## Tests
- 10 new tests for `resolveFromServer()` and `getResolvedPath()`
- All 445 frontend helper tests pass
- All 62 packet filter tests pass
- All 29 aging tests pass
Closes#555 (M4 milestone)
---------
Co-authored-by: you <you@example.com>
## Problem
On busy meshes (325K+ transmissions, 50 observers), the distance index
rebuild runs on **every ingest poll** (~1s interval), computing
haversine distances for 1M+ hop records. Each rebuild takes 2-3 seconds
but new observations arrive faster than it can finish, creating a CPU
hot loop that starves the HTTP server.
Discovered on the Cascadia Mesh instance where `corescope-server` was
consuming 15 minutes of CPU time in 10 minutes of uptime, the API was
completely unresponsive, and health checks were timing out.
### Server logs showing the hot loop:
```
[store] Built distance index: 1797778 hop records, 207072 path records
[store] Built distance index: 1797806 hop records, 207075 path records
[store] Built distance index: 1797811 hop records, 207075 path records
[store] Built distance index: 1797820 hop records, 207075 path records
```
Every 2 seconds, nonstop.
## Root Cause
`IngestNewObservations` calls `buildDistanceIndex()` synchronously
whenever `pickBestObservation` selects a longer path. With 50 observers
sending observations every second, paths change on nearly every poll
cycle, triggering a full rebuild each time.
## Fix
- Mark distance index dirty on path changes instead of rebuilding inline
- Rebuild at most every **30 seconds** (configurable via `distLast`
timer)
- Set `distLast` after initial `Load()` to prevent immediate re-rebuild
on first ingest
- Distance data is at most 30s stale — acceptable for an analytics view
## Testing
- `go build`, `go vet`, `go test` all pass
- No behavioral change for the initial load or the analytics API
response shape
- Distance data freshness goes from real-time to 30s max staleness
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: you <you@example.com>
## Summary
Fixes#388 — expanded groups were fetched sequentially with O(n)
`packets.find()` lookups.
## Changes
1. **Parallel fetch**: Replaced sequential `for...of + await` loop in
`loadPackets()` with `Promise.all()` so all expanded group children are
fetched concurrently.
2. **O(1) Map lookup**: Replaced 3 instances of `packets.find(p =>
p.hash === hash)` with `hashIndex.get(hash)`:
- `loadPackets()` expanded group restore (~line 553)
- `select-observation` click handler (~line 1015)
- `pktToggleGroup()` (~line 2012)
## Perf justification
- **Before**: N expanded groups → N sequential API calls + N ×
O(packets.length) array scans
- **After**: N parallel API calls + N × O(1) Map lookups
- Typical N is 1-3 (minor severity as noted in issue), but the fix is
trivial and correct
## Tests
All existing tests pass: `test-packet-filter.js` (62), `test-aging.js`
(29), `test-frontend-helpers.js` (433).
Co-authored-by: you <you@example.com>
## Summary
Show the transport badge ("T") in the live packet feed, matching the
packets table (#337).
## Changes
- Add `transportBadge(pkt.route_type)` to all 4 feed rendering paths in
`live.js`:
- Grouped feed items (initial history load)
- `addFeedItemDOM()` (VCR replay)
- Dedup new feed items (live WebSocket updates)
- Node detail panel recent packets list
- Uses existing `transportBadge()` from `app.js` and `.badge-transport`
CSS from `style.css`
## Testing
- 2 new source-level assertions in `test-live.js` verifying
`transportBadge()` calls exist
- All existing tests pass (67 passed in test-live.js, no new failures)
Fixes#338
Co-authored-by: you <you@example.com>
## Summary
`nodeActivity` (an object tracking per-node packet counts for heatmap
intensity) grows without bound — entries are added on every packet flash
but never removed, even when stale nodes are pruned.
## Changes
- **Delete `nodeActivity[key]`** alongside `nodeMarkers[key]` and
`nodeData[key]` when removing stale WS-only nodes in `pruneStaleNodes()`
- **Prune orphaned entries** — after the main prune loop, sweep
`nodeActivity` and delete any key that has no corresponding `nodeData`
entry (catches edge cases where nodes were removed by other code paths)
- Both run every 60s via the existing `pruneStaleNodes` interval timer
## Testing
- Added 2 regression tests in `test-frontend-helpers.js` verifying stale
node cleanup and orphan removal
- All 435 frontend helper tests pass, plus packet-filter (62) and aging
(29)
Fixes#390
---------
Co-authored-by: you <you@example.com>
## Summary
Augments the shared `HopResolver` with neighbor-graph affinity data so
that when multiple nodes match a hop prefix, the resolver prefers
candidates that are known neighbors of the adjacent hop — instead of
relying solely on geo-distance.
Fixes#528
## Changes
### `public/hop-resolver.js`
- Added `affinityMap` — stores bidirectional neighbor adjacency with
scores
- Added `setAffinity(graph)` — ingests `/api/analytics/neighbor-graph`
edge data into O(1) Map lookups
- Added `getAffinity(pubkeyA, pubkeyB)` — returns affinity score between
two nodes (0 if not neighbors)
- Added `pickByAffinity(candidates, adjacentPubkey, anchor, ...)` —
picks best candidate: affinity-neighbor first (highest score), then
geo-distance fallback
- Modified forward and backward passes in `resolve()` to track the
previously-resolved pubkey and use `pickByAffinity` instead of raw
geo-sort
### `public/live.js`
- Added `fetchAffinityData()` — fetches `/api/analytics/neighbor-graph`
once and calls `HopResolver.setAffinity()`
- Added `startAffinityRefresh()` — refreshes affinity data every 60
seconds
- Both are called from `loadNodes()` after HopResolver is initialized
### `test-hop-resolver-affinity.js` (new)
- Affinity prefers neighbor candidate over geo-closest
- Cold start (no affinity data) falls back to geo-closest
- Null/undefined affinity doesn't crash
- Bidirectional score lookup
- Highest affinity score wins among multiple neighbors
- Unambiguous hops unaffected by affinity
## Performance
- API calls: 1 at load + 1 per 60s (no per-packet calls)
- Per-packet resolve: O(1) Map lookups, <0.5ms
- Memory: ~50KB for 2K-node graph
---------
Co-authored-by: you <you@example.com>
Fixes#441
## Summary
Hash collision analysis was including ALL node types, inflating
collision counts with irrelevant data. Per MeshCore firmware analysis,
**only repeaters matter for collision analysis** — they're the only role
that forwards packets and appears in routing `path[]` arrays.
## Root Causes Fixed
1. **`hash_size==0` nodes counted in all buckets** — nodes with unknown
hash size were included via `cn.HashSize == bytes || cn.HashSize == 0`,
polluting every bucket
2. **Non-repeater roles included** — companions, rooms, sensors, and
observers were counted even though their hash collisions never cause
routing ambiguity
## Fix
Changed `computeHashCollisions()` filter from:
```go
// Before: include everything except companions
if cn.HashSize == bytes && cn.Role != "companion" {
```
To:
```go
// After: only include repeaters (per firmware analysis)
if cn.HashSize == bytes && cn.Role == "repeater" {
```
## Why only repeaters?
From [MeshCore firmware
analysis](https://github.com/Kpa-clawbot/CoreScope/issues/441#issuecomment-4185218547):
- Only repeaters override `allowPacketForward()` to return `true`
- Only repeaters append their hash to `path[]` during relay
- Companions, rooms, sensors, observers never forward packets
- Cross-role collisions are benign (companion silently drops, real
repeater still forwards)
## Tests
- `TestHashCollisionsOnlyRepeaters` — verifies companions, rooms,
sensors, and hash_size==0 nodes are all excluded
---------
Co-authored-by: you <you@example.com>
## Summary
VCR replay functions (`vcrReplayFromTs`, `vcrRewind`,
`fetchNextReplayPage`) fetch up to 10K packets and process them all
synchronously on the main thread via `expandToBufferEntries`, causing
multi-second UI freezes — especially on mobile.
## Fix
- Added `expandToBufferEntriesAsync()` — processes packets in chunks of
200, yielding to the event loop via `setTimeout(0)` between chunks
- Updated all three VCR replay callers to use the async variant
- Kept the synchronous `expandToBufferEntries()` for backward
compatibility (tests, small datasets)
- Exposed `_liveExpandToBufferEntriesAsync` on window for test access
## Perf justification
- **Before:** 10K packets × ~2 observations = 20K+ objects created
synchronously, blocking the main thread for 1-3 seconds on mobile
- **After:** Same work split into chunks of 200 packets (~400 entries)
with event loop yields between chunks. Each chunk takes <5ms, keeping
the UI responsive (well under the 16ms frame budget)
- Chunk size of 200 is tunable via `VCR_CHUNK_SIZE`
## Tests
- Added regression test: sync expand correctness at scale (500 packets →
1000 entries)
- Added structural test: verifies `VCR_CHUNK_SIZE` exists and async
function yields via `setTimeout`
- All existing tests pass (`npm test`)
Fixes#395
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#410 — virtual scroll height miscalculation for expanded group
rows.
## Root Cause
When WebSocket messages add children to an already-expanded packet
group, `_rowCounts` becomes stale during the 200ms render debounce
window. Scroll events during this window call `renderVisibleRows()` with
stale row counts, causing wrong total height, spacer heights, and
visible range calculations.
## Changes
**public/packets.js:**
- Added `_rowCountsDirty` flag to track when row counts need
recomputation
- Added `_invalidateRowCounts()` — marks row counts as stale and clears
cumulative cache
- Added `_refreshRowCountsIfDirty()` — lazily recomputes `_rowCounts`
from `_displayPackets`
- Called `_invalidateRowCounts()` when WS handler adds children to
expanded groups (line ~402)
- Called `_refreshRowCountsIfDirty()` at top of `renderVisibleRows()`
before using row counts
- Reset `_rowCountsDirty` in all cleanup paths (destroy, empty display)
**test-packets.js:**
- Added 4 regression tests for `_invalidateRowCounts` /
`_refreshRowCountsIfDirty`
## Complexity
O(n) recomputation of `_rowCounts` when dirty (same as existing
`renderTableRows` path). Only triggers when WS modifies expanded group
children, which is infrequent relative to scroll events.
Co-authored-by: you <you@example.com>
## Summary
Fixes#533 — server cache hit rate always 0%.
## Root Cause
`invalidateCachesFor()` is called at the end of every
`IngestNewFromDB()` and `IngestNewObservations()` cycle (~2-5s). Since
new data arrives continuously, caches are cleared faster than any
analytics request can hit them, resulting in a permanent 0% cache hit
rate. The cache TTL (15s/60s) is irrelevant because entries are evicted
by invalidation long before they expire.
## Fix
Rate-limit cache invalidation with a 10-second cooldown:
- First call after cooldown goes through immediately
- Subsequent calls during cooldown accumulate dirty flags in
`pendingInv`
- Next call after cooldown merges pending + current flags and applies
them
- Eviction bypasses cooldown (data removal requires immediate clearing)
Analytics data may be at most ~10s stale, which is acceptable for a
dashboard.
## Changes
- **`store.go`**: Added `lastInvalidated`, `pendingInv`, `invCooldown`
fields. Refactored `invalidateCachesFor()` to rate-limit non-eviction
invalidation. Extracted `applyCacheInvalidation()` helper.
- **`cache_invalidation_test.go`**: Added 4 new tests:
- `TestInvalidationRateLimited` — verifies caches survive during
cooldown
- `TestInvalidationCooldownAccumulatesFlags` — verifies flag merging
- `TestEvictionBypassesCooldown` — verifies eviction always clears
immediately
- `BenchmarkCacheHitDuringIngestion` — confirms 100% hit rate during
rapid ingestion (was 0%)
## Perf Proof
```
BenchmarkCacheHitDuringIngestion-16 3467889 1018 ns/op 100.0 hit%
```
Before: 0% hit rate under continuous ingestion. After: 100% hit rate
during cooldown periods.
Co-authored-by: you <you@example.com>
## Summary
`GetPerfStoreStats()` and `GetPerfStoreStatsTyped()` iterated **all**
ADVERT packets and called `json.Unmarshal` on each one — under a read
lock — on every `/api/perf` and `/api/health` request. With 5K+ adverts,
each health check triggered thousands of JSON parses.
## Fix
Added a refcounted `advertPubkeys map[string]int` to `PacketStore` that
tracks distinct pubkeys incrementally during `Load()`,
`IngestNewFromDB()`, and eviction. The perf/health handlers now just
read `len(s.advertPubkeys)` — O(1) with zero allocations.
## Benchmark Results (5K adverts, 200 distinct pubkeys)
| Method | ns/op | allocs/op |
|--------|-------|-----------|
| `GetPerfStoreStatsTyped` | **78** | **0** |
| `GetPerfStoreStats` | **2,565** | **9** |
Before this change, both methods performed O(N) JSON unmarshals per
call.
## Tests Added
- `TestAdvertPubkeyTracking` — verifies incremental tracking through
add/evict lifecycle
- `TestAdvertPubkeyPublicKeyField` — covers the `public_key` JSON field
variant
- `TestAdvertPubkeyNonAdvert` — ensures non-ADVERT packets don't affect
count
- `BenchmarkGetPerfStoreStats` — 5K adverts benchmark
- `BenchmarkGetPerfStoreStatsTyped` — 5K adverts benchmark
Fixes#360
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#534 — mobile filter dropdown doesn't expand on packets page.
## Root Cause
CSS specificity battle in the mobile media query. The hide rule uses
`:not()` pseudo-classes which add specificity:
```css
/* Higher specificity due to :not() */
.filter-bar > *:not(.filter-toggle-btn):not(.col-toggle-wrap) { display: none; }
/* Lower specificity — loses even with .filters-expanded */
.filter-bar.filters-expanded > * { display: inline-flex; }
```
The JS toggle correctly adds/removes `.filters-expanded`, but the CSS
expanded rule could never win.
## Fix
Match the `:not()` selectors in the expanded rule so `.filters-expanded`
makes it strictly more specific:
```css
.filter-bar.filters-expanded > *:not(.filter-toggle-btn):not(.col-toggle-wrap) { display: inline-flex; }
```
Added a comment explaining the specificity dependency so future devs
don't repeat this.
## Tests
Added Playwright E2E test: mobile viewport (480×800), navigates to
packets page, clicks filter toggle, verifies filter inputs become
visible.
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#355 — replaces O(n²) observation dedup in `Load()`,
`IngestNewFromDB()`, and `IngestNewObservations()` with an O(1)
map-based lookup.
## Changes
- Added `obsKeys map[string]bool` field to `StoreTx` for O(1) dedup
keyed on `observerID + "|" + pathJSON`
- Replaced all 3 linear-scan dedup sites in `store.go` with map lookups
- Lazy-init `obsKeys` for transmissions created before this change (in
`IngestNewFromDB` and `IngestNewObservations`)
- Added regression test (`TestObsDedupCorrectness`) verifying dedup
correctness
- Added nil-map safety test (`TestObsDedupNilMapSafety`)
- Added benchmark comparing map vs linear scan
## Benchmark Results (ARM64, 16 cores)
| Observations | Map (O(1)) | Linear (O(n)) | Speedup |
|---|---|---|---|
| 10 | 34 ns/op | 41 ns/op | 1.2x |
| 50 | 34 ns/op | 186 ns/op | 5.5x |
| 100 | 34 ns/op | 361 ns/op | 10.6x |
| 500 | 34 ns/op | 4,903 ns/op | **146x** |
Map lookup is constant time regardless of observation count. The linear
scan degrades quadratically — at 500 observations per transmission
(realistic for popular packets seen by many observers), the old code is
146x slower per dedup check.
All existing tests pass.
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#354
Replaces the O(n²) selection sort in `sortedCopy()` with Go's built-in
`sort.Float64s()` (O(n log n)).
## Changes
- **`cmd/server/routes.go`**: Replaced manual nested-loop selection sort
with `sort.Float64s(cp)`
- **`cmd/server/helpers_test.go`**: Added regression test with
1000-element random input + benchmark
## Benchmark Results (ARM64)
```
BenchmarkSortedCopy/n=256 ~16μs/op 1 alloc
BenchmarkSortedCopy/n=1000 ~95μs/op 1 alloc
BenchmarkSortedCopy/n=10000 ~1.3ms/op 1 alloc
```
With the old O(n²) sort, n=10000 would take ~50ms+. The new
implementation scales as O(n log n).
## Testing
- All existing `TestSortedCopy` tests pass (unchanged behavior)
- New `TestSortedCopyLarge` validates correctness on 1000 random
elements
- `go test ./...` passes in `cmd/server`
Co-authored-by: you <you@example.com>
Fixes#537
## Problem
Observer filter in grouped mode only checked `p.observer_id` (the
primary observer), ignoring child observations. Grouped packets seen by
multiple observers would be hidden when filtering for a non-primary
observer.
## Fix
Two filter paths updated to also check `p._children`:
1. **Client-side display filter** (line ~1293): removed the
`!groupByHash` guard and added `_children` check so grouped packets are
included when any child observation matches
2. **WS real-time filter** (line ~360): added `_children` fallback check
The grouped row rendering (line ~1042) already correctly uses
`_observerFilterSet` for child filtering — no changes needed there.
## Tests
Added 5 tests in `test-frontend-helpers.js`:
- Grouped packet with matching child observer is shown
- Grouped packet with no matching observers is hidden
- WS filter passes/rejects grouped packets correctly
- Source code assertions verifying both filter paths check `_children`
Co-authored-by: you <you@example.com>
## Summary
- When `groupByHash=true`, each group only carries its representative
(best-path) `observer_id`. The client-side filter was checking only that
field, silently dropping groups that were seen by the selected observer
but had a different representative.
- `loadPackets` now passes the `observer` param to the server so
`filterPackets`/`buildGroupedWhere` do the correct "any observation
matches" check.
- Client-side observer filter in `renderTableRows` is skipped for
grouped mode (server already filtered correctly).
- Both `db.go` and `store.go` observer filtering extended to support
comma-separated IDs (multi-select UI).
## Test plan
- [ ] Set an observer filter on the Packets screen with grouping enabled
— all groups that have **any** observation from the selected observer(s)
should appear, not just groups where that observer is the representative
- [ ] Multi-select two observers — groups seen by either should appear
- [ ] Toggle to flat (ungrouped) mode — per-observation filter still
works correctly
- [ ] Existing grouped packets tests pass: `cd cmd/server && go test
./...`
Fixes#464🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
Fixes#525
The `checklist()` function in `home.js` treated steps and FAQ/checklist
as mutually exclusive — if `homeCfg.checklist` existed, steps were
skipped entirely. Adding a single FAQ via the customizer made all intro
steps disappear.
Now renders steps first, then FAQ below with a '❓ FAQ' header. Falls
back to Bay Area hardcoded defaults only when neither exists.
---------
Co-authored-by: you <you@example.com>
## Summary
Part of #523 — fixes bugs 5 and 7 (bug 6 was a duplicate of bug 7).
### Bug 5: Show Neighbors button throws `window._mapSelectRefNode is not
a function`
**Root cause:** Map popup HTML used inline `onclick` calling
`window._mapSelectRefNode`, which was deleted on SPA page destroy. If a
popup persisted after navigation, clicks would throw.
**Fix:** Replaced inline `onclick` with event delegation. A
document-level click handler catches all `[data-show-neighbors]` clicks
and calls `selectReferenceNode` directly. The global
`window._mapSelectRefNode` is still exposed for existing Playwright
tests but is no longer relied upon by the UI.
### Bug 7: Blue text on dark blue background (dark mode contrast)
**Root cause:** Neighbor table cells inside `.node-detail-section` /
`.node-full-card` inherited accent/link color instead of using
`var(--text)`, making text unreadable in dark mode.
**Fix:** Added explicit `color: var(--text)` on `.node-detail-section
.data-table td` and `.node-full-card .data-table td`. Only `<a>` tags
within those cells retain `color: var(--accent)`.
### Files changed
- `public/map.js` — event delegation for Show Neighbors
- `public/style.css` — contrast fix for neighbor table cells
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#525 — Customizer v2 home section shows empty fields and adding
FAQ kills steps.
## Root Cause
Server returned `home: null` from `/api/config/theme` when no home
config existed in config.json or theme.json. The customizer had no
built-in defaults, so all home fields appeared empty. When a user added
a single override (e.g. FAQ), `computeEffective` started from `home:
null`, created `home: {}`, and only applied the user's override — wiping
steps and everything else.
## Fix
### Server-side (primary)
In `handleConfigTheme()`, replaced the conditional `home` assignment
with `mergeMap` using built-in defaults matching what `home.js`
hardcodes:
- `heroTitle`: "CoreScope"
- `heroSubtitle`: "Real-time MeshCore LoRa mesh network analyzer"
- `steps`: 4 default getting-started steps
- `footerLinks`: Packets + Network Map links
Config/theme overrides merge on top, so customization still works.
### Client-side (defense-in-depth)
Added `DEFAULT_HOME` constant in `customize-v2.js`. `computeEffective()`
now falls back to these defaults when server returns `home: null`,
ensuring the customizer works even without server defaults.
## Tests
- **Go**: `TestConfigThemeHomeDefaults` — verifies `/api/config/theme`
returns non-null home with heroTitle, steps, footerLinks when no config
is set
- **JS**: Two new tests in `test-frontend-helpers.js` — verifies
`computeEffective` provides defaults when home is null, and that user
overrides merge correctly with defaults
Co-authored-by: you <you@example.com>
## Summary
Fixes the neighbor affinity graph returning empty results despite
abundant ADVERT data in the store.
**Root cause:** `extractFromNode()` in `neighbor_graph.go` only checked
for `"from_node"` and `"from"` fields in the decoded JSON, but real
ADVERT packets store the originator public key as `"pubKey"`. This meant
`fromNode` was always empty, so:
- Zero-hop edges (originator↔observer) were never created
- Originator↔path[0] edges were never created
- Only observer↔path[last] edges could be created (and only for
non-empty paths)
**Fix:** Check `"pubKey"` first in `extractFromNode()`, then fall
through to `"from_node"` and `"from"` for other packet types.
## Bugs Fixed
| Bug | Issue | Fix |
|-----|-------|-----|
| Empty graph results | #522 | `extractFromNode()` now reads `pubKey`
field from ADVERTs |
| 3-4s response time | #523 comment | Graph was rebuilding correctly
with 60s TTL cache — the slow response was due to iterating all packets
finding zero matches. With edges now being found, the cache works as
designed. |
| Incomplete visualization | #523 comment | Downstream of bug 1+2 —
fixed by fixing the builder |
| Accessibility | #523 comment | Added text-based neighbor list, dynamic
aria-label, keyboard focus CSS, dashed lines for ambiguous edges,
confidence symbols |
## Changes
- **`cmd/server/neighbor_graph.go`** — Fixed `extractFromNode()` to
check `pubKey` field (real ADVERT format)
- **`cmd/server/neighbor_graph_test.go`** — Added 2 new tests:
`TestBuildNeighborGraph_AdvertPubKeyField` (real ADVERT format) and
`TestBuildNeighborGraph_OneByteHashPrefixes` (1-byte prefix collision
scenario)
- **`public/analytics.js`** — Added accessible text-based neighbor list,
dynamic aria-label, dashed line pattern for ambiguous edges
- **`public/style.css`** — Added `:focus-visible` keyboard focus
indicator for canvas
## Testing
All Go tests pass (`go test ./... -count=1`). New tests verify the fix
prevents regression.
Fixes#523, Fixes#522
---------
Co-authored-by: you <you@example.com>
Fixes#518, Fixes#514, Fixes#515, Fixes#516
## Summary
Fixes all customizer v2 bugs from the consolidated tracker (#518). Both
server and client changes.
## Server Changes (`routes.go`)
- **typeColors defaults** — added all 10 type color defaults matching
`roles.js` `TYPE_COLORS`. Previously returned `{}`, causing all type
colors to render as black.
- **themeDark defaults** — added 22 dark mode color defaults matching
the Default preset. Previously returned `{}`, causing dark mode to have
no server-side defaults.
## Client Changes (`customize-v2.js`)
- [x] **P0: Phantom override cleanup on init** — new
`_cleanPhantomOverrides()` runs on startup, scanning
`cs-theme-overrides` and removing any values that match server defaults
(arrays via `JSON.stringify`, scalars via `===`).
- [x] **P1: `setOverride` auto-prunes matching defaults** — after
debounced write, iterates the delta and removes any key whose value
matches the server default. Prevents phantom overrides from
accumulating.
- [x] **P1: `_countOverrides` counts only real diffs** — now iterates
keys and calls `_isOverridden()` instead of blindly counting
`Object.keys().length`. Badge count reflects actual overrides only.
- [x] **P1: `_isOverridden` handles arrays/objects** — uses
`JSON.stringify` comparison for non-scalar values (home.steps,
home.checklist, etc.).
- [x] **P1: Type color fallback** — `_renderNodes()` falls back to
`window.TYPE_COLORS` when effective typeColors are empty, preventing
black color swatches.
- [x] **P1: Dark/light toggle re-renders panel** — MutationObserver on
`data-theme` now calls `_refreshPanel()` when panel is open, so
switching modes updates the Theme tab immediately.
## Tests
6 new unit tests added to `test-customizer-v2.js`:
- Phantom scalar overrides cleaned on init
- Phantom array overrides cleaned on init
- Real overrides preserved after cleanup
- `isOverridden` handles matching arrays (returns false)
- `isOverridden` handles differing arrays (returns true)
- `setOverride` prunes value matching server default
All 48 tests pass. Go tests pass.
---------
Co-authored-by: you <you@example.com>
Removes a stale `<<<<<<< HEAD` conflict marker that was accidentally
left in during the PR #510 rebase. This breaks Playwright E2E tests in
CI.
One-line fix — line 1311 deletion.
Co-authored-by: you <you@example.com>
## Summary
Adds a **Neighbor Graph** tab to the Analytics page — an interactive
force-directed graph visualization of the mesh network's neighbor
affinity data.
Part of #482 (Milestone 7 — Analytics Graph Visualization)
## What's New
### Neighbor Graph Tab
- New "Neighbor Graph" tab in the analytics tab bar
- Force-directed graph layout using HTML5 Canvas (vanilla JS, no
external libs)
- Nodes rendered as circles, colored by role using existing
`ROLE_COLORS`
- Edges as lines with thickness proportional to affinity score
- Ambiguous edges highlighted in yellow
### Interactions
- **Click node** → navigates to node detail page (`#/nodes/{pubkey}`)
- **Hover node** → tooltip showing name, role, neighbor count
- **Drag nodes** → rearrange layout interactively
- **Mouse wheel** → zoom in/out (towards cursor position)
- **Drag background** → pan the view
### Filters
- **Role checkboxes** — toggle repeater, companion, room, sensor
visibility
- **Minimum score slider** — filter out weak edges (0.00–1.00)
- **Confidence filter** — show all / high confidence only / hide
ambiguous
### Stats Summary
Displays above the graph: total nodes, total edges, average score,
resolved %, ambiguous count
### Data Source
Uses `GET /api/analytics/neighbor-graph` endpoint from M2, with region
filtering via the shared RegionFilter component.
## Performance
- Canvas-based rendering (not SVG) for performance with large graphs
- Force simulation uses `requestAnimationFrame` with cooling/dampening —
stops iterating when layout stabilizes
- O(n²) repulsion is acceptable for typical mesh sizes (~500 nodes); for
larger meshes, a Barnes-Hut approximation could be added later
- Animation frame is properly cleaned up on page destroy
## Tests
- Updated tab count assertion (≥10 tabs)
- New Playwright test: tab loads, canvas renders, stats shown (≥3 stat
cards)
- New Playwright test: filter changes update stats
## Files Changed
- `public/analytics.js` — new tab + full graph visualization
implementation
- `test-e2e-playwright.js` — 2 new tests + updated assertion
---------
Co-authored-by: you <you@example.com>
## Summary
Milestone 4 of #482: adds affinity-aware hop resolution to improve
disambiguation accuracy across all hop resolution in the app.
### What changed
**Backend — `prefixMap.resolveWithContext()` (store.go)**
New method that applies a 4-tier disambiguation priority when multiple
nodes match a hop prefix:
| Priority | Strategy | When it wins |
|----------|----------|-------------|
| 1 | **Affinity graph score** | Neighbor graph has data, score ratio ≥
3× runner-up |
| 2 | **Geographic proximity** | Context nodes have GPS, pick closest
candidate |
| 3 | **GPS preference** | At least one candidate has coordinates |
| 4 | **First match** | No signal — current naive fallback |
The existing `resolve()` method is unchanged for backward compatibility.
New callers that have context (originator, observer, adjacent hops) can
use `resolveWithContext()` for better results.
**API — `handleResolveHops` (routes.go)**
Enhanced `/api/resolve-hops` endpoint:
- New query params: `from_node`, `observer` — provide context for
affinity scoring
- New response fields on `HopCandidate`: `affinityScore` (float,
0.0–1.0)
- New response fields on `HopResolution`: `bestCandidate` (pubkey when
confident), `confidence` (one of `unique_prefix`, `neighbor_affinity`,
`ambiguous`)
- Backward compatible: without context params, behavior is identical to
before (just adds `confidence` field)
**Types (types.go)**
- `HopCandidate.AffinityScore *float64`
- `HopResolution.BestCandidate *string`
- `HopResolution.Confidence string`
### Tests
- 7 unit tests for `resolveWithContext` covering all 4 priority tiers +
edge cases
- 2 unit tests for `geoDistApprox`
- 4 API tests for enhanced `/api/resolve-hops` response shape
- All existing tests pass (no regressions)
### Impact
This improves ALL hop resolution across the app — analytics, route
display, subpath analysis, and any future feature that resolves hop
prefixes. The affinity graph (from M1/M2) now feeds directly into
disambiguation decisions.
Part of #482
---------
Co-authored-by: you <you@example.com>
## Summary
Replace broken client-side path walking in `selectReferenceNode()` with
server-side `/api/nodes/{pubkey}/neighbors` API call, fixing #484 where
Show Neighbors returned zero results due to hash collision
disambiguation failures.
**Fixes #484** | Part of #482
## What changed
### `public/map.js` — `selectReferenceNode()` function
**Before:** Client-side path walking — fetched
`/api/nodes/{pubkey}/paths`, walked each path to find hops adjacent to
the selected node by comparing full pubkeys. This fails on hash
collisions because path hops only contain short prefixes (1-2 bytes),
and the hop resolver can pick the wrong collision candidate.
**After:** Server-side affinity resolution — fetches
`/api/nodes/{pubkey}/neighbors?min_count=3` which uses the neighbor
affinity graph (built in M1/M2) to return disambiguated neighbors. For
ambiguous edges, all candidates are included in the neighbor set (better
to show extra markers than miss real neighbors).
**Fallback:** When the affinity API returns zero neighbors (cold start,
insufficient data), the function falls back to the original path-walking
approach. This ensures the feature works even before the affinity graph
has accumulated enough observations.
## Tests
4 new Playwright E2E tests (in both `test-show-neighbors.js` and
`test-e2e-playwright.js`):
1. **Happy path** — Verifies the `/neighbors` API is called and the
reference node UI activates
2. **Hash collision disambiguation** — Two nodes sharing prefix "C0" get
different neighbor sets via the affinity API (THE critical test for
#484)
3. **Fallback to path walking** — Empty affinity response triggers
fallback to `/paths` API
4. **Ambiguous candidates** — Ambiguous edge candidates are included in
the neighbor set
All tests use Playwright route interception to mock API responses,
testing the frontend logic independently of server state.
## Spec reference
See [neighbor-affinity-graph.md](docs/specs/neighbor-affinity-graph.md),
sections:
- "Replacing Show Neighbors on the map" (lines ~461-504)
- "Milestone 3: Show Neighbors Fix (#484)" (lines ~1136-1152)
- Test specs a & b (lines ~754-800)
---------
Co-authored-by: you <you@example.com>
## Summary
Implements the customizer v2 per the [approved
spec](docs/specs/customizer-rework.md), replacing the v1 customizer's
scattered state management with a clean event-driven architecture.
Resolves#502.
## What Changed
### New: `public/customize-v2.js`
Complete rewrite of the customizer as a self-contained IIFE with:
- **Single localStorage key** (`cs-theme-overrides`) replacing 7
scattered keys
- **Three state layers:** server defaults (immutable) → user overrides
(delta) → effective config (computed)
- **Full data flow pipeline:** `write → read-back → merge → atomic
SITE_CONFIG assign → apply CSS → dispatch theme-changed`
- **Color picker optimistic CSS** (Decision #12): `input` events update
CSS directly for responsiveness; `change` events trigger the full
pipeline
- **Override indicator dots** (●) on each field — click to reset
individual values
- **Section-level override count badges** on tabs
- **Browser-local banner** in panel header: "These settings are saved in
your browser only"
- **Auto-save status indicator** in footer: "All changes saved" /
"Saving..." / "⚠️ Storage full"
- **Export/Import** with full shape validation (`validateShape()`)
- **Presets** flow through the standard pipeline
(`writeOverrides(presetData) → pipeline`)
- **One-time migration** from 7 legacy localStorage keys (exact field
mapping per spec)
- **Validation** on all writes: color format, opacity range, timestamp
enum values
- **QuotaExceededError handling** with visible user warning
### Modified: `public/app.js`
Replaced ~80 lines of inline theme application code with a 15-line
`_customizerV2.init(cfg)` call. The customizer v2 handles all merging,
CSS application, and global state updates.
### Modified: `public/index.html`
Swapped `customize.js` → `customize-v2.js` script tag.
### Added: `docs/specs/customizer-rework.md`
The full approved spec, included in the repo for reference.
## Migration
On first page load:
1. Checks if `cs-theme-overrides` already exists → skip if yes
2. Reads all 7 legacy keys (`meshcore-user-theme`,
`meshcore-timestamp-*`, `meshcore-heatmap-opacity`,
`meshcore-live-heatmap-opacity`)
3. Maps them to the new delta format per the spec's field-by-field
mapping
4. Writes to `cs-theme-overrides`, removes all legacy keys
5. Continues with normal init
Users with existing customizations will see them preserved
automatically.
## Dark/Light Mode
- `theme` section stores light mode overrides, `themeDark` stores dark
mode overrides
- `meshcore-theme` localStorage key remains **separate** (view
preference, not customization)
- Switching modes re-runs the full pipeline with the correct section
## Testing
- All existing tests pass (`test-packet-filter.js`, `test-aging.js`,
`test-frontend-helpers.js`)
- Old `customize.js` is NOT modified — left in place for reference but
no longer loaded
## Not in Scope (per spec)
- Undo/redo stack
- Cross-tab synchronization
- Server-side admin import endpoint
- Map config / geo-filter overrides
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#483 — navigating away from the live page while matrix/hop
animations are running throws `TypeError: Cannot read properties of null
(reading 'addLayer')`.
## Root Cause
`destroy()` sets `animLayer = null` and `pathsLayer = null`, but
in-flight `requestAnimationFrame` callbacks continue executing and
attempt to call `.addTo(animLayer)` or `.removeLayer()` on the now-null
references.
The entry guards at the top of `drawMatrixLine()` and
`drawAnimatedLine()` only protect the initial call — not the rAF
continuation loops inside `tick()`, `fadeOut()`, `animateLine()`, and
`animateFade()`.
## Fix
Added null-guards (`if (!animLayer || !pathsLayer) return`) at the top
of all four rAF callback functions in `live.js`:
1. **`tick()`** (line ~2203) — matrix animation main loop
2. **`fadeOut()`** (line ~2253) — matrix animation fade-out
3. **`animateLine()`** (line ~2302) — standard line animation main loop
4. **`animateFade()`** (line ~2337) — standard line fade-out
This pattern is already used elsewhere in the file (e.g., line 1873,
1886) for the same purpose.
## Testing
- All unit tests pass (`npm test` — 0 failures)
- Go server tests pass (`cmd/server` + `cmd/ingestor`)
- Change is defensive only (early return on null) — no behavioral change
when layers exist
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#504 — Expanding a packet in the packets UI showed the same path
on every observation instead of each observation's unique path.
## Root Cause
PR #400 (fixing #387) added caching of `JSON.parse` results as
`_parsedPath` and `_parsedDecoded` properties on packet objects. When
observation packets are created via object spread (`{...parentPacket,
...obs}`), these cache properties are copied from the parent. Subsequent
calls to `getParsedPath(obsPacket)` hit the stale cache and return the
parent's path, ignoring the observation's own `path_json`.
## Fix
After every object spread that creates an observation packet from a
parent packet, delete the cache properties so they get re-parsed from
the observation's own data:
```js
delete obsPacket._parsedPath;
delete obsPacket._parsedDecoded;
```
Applied to all 5 spread sites in `public/packets.js`:
- Line 271: detail pane observation selection
- Line 504: flat view observation expansion
- Line 840: grouped view observation expansion
- Line 1012: child observation selection in grouped view
- Line 1982: WebSocket live update observation expansion
## Tests
Added 2 new tests in `test-frontend-helpers.js`:
1. Verifies observation packets get their own path after cache
invalidation (not the parent's)
2. Verifies observation path differs from parent path after cache
invalidation
All 431 frontend helper tests pass. All 62 packet filter tests pass.
---------
Co-authored-by: you <you@example.com>
## Problem
As described in #387, `JSON.parse()` is called repeatedly on the same
packet data across render cycles. With 30K packets, each render cycle
parses 60K+ JSON strings unnecessarily.
## Analysis
The server sends `decoded_json` and `path_json` as JSON strings. The
frontend parses them on-demand in multiple locations:
- `renderTableRows()` — for every row, every render
- WebSocket handling — when processing filtered packets
- `loadPackets()` — during packet loading
- Detail view rendering — when showing packet details
This creates O(n×m) parsing overhead where n = packet count and m =
render cycles.
## Solution
Add cached parse helpers that store parsed results on the packet object:
```javascript
function getParsedPath(p) {
if (p._parsedPath === undefined) {
try { p._parsedPath = JSON.parse(p.path_json || '[]'); } catch { p._parsedPath = []; }
}
return p._parsedPath;
}
```
Same pattern for `getParsedDecoded()`.
## Changes
- `public/packets.js`: Add helpers + replace 15+ JSON.parse calls
- `public/live.js`: Add helpers + replace 5 JSON.parse calls
## Benchmarks
Before: 60K+ JSON.parse calls per render cycle (30K packets)
After: ~30K parse calls (one per packet, cached thereafter)
Memory impact: Negligible (stores parsed objects that were already
created temporarily)
## Notes
- Cache uses `undefined` check to distinguish "not cached" from "cached
empty result"
- Property names `_parsedPath` and `_parsedDecoded` prefixed to avoid
collision with server fields
- No breaking changes to existing code paths
Fixes#387
---------
Co-authored-by: P. Clawmogorov <262173731+Alm0stSurely@users.noreply.github.com>
Co-authored-by: you <you@example.com>
## Summary
- `db.GetNodes` accepted a `region` param from the HTTP handler but
never used it — every region-filter selection was silently ignored and
all nodes were always returned
- Added a subquery filtering `nodes.public_key` against ADVERT
transmissions (payload_type=4) observed by observers with matching IATA
codes
- Handles both v2 (`observer_id TEXT`) and v3 (`observer_idx INT`)
schemas
## Test plan
- [x] 4 new subtests added to `TestGetNodesFiltering`: SJC (1 node), SFO
(1 node), SJC,SFO multi (1 node deduped), AMS unknown (0 nodes)
- [x] All existing Go tests still pass
- [x] Deploy to staging, open `/nodes`, select a region in the filter
bar — only nodes observed by observers in that region should appear
Closes#496🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- `indexByNode` was calling `json.Unmarshal` for every packet during
`Load()` and `IngestNewFromDB()`, even channel messages and other
payloads that can never contain node pubkey fields
- All three target fields (`"pubKey"`, `"destPubKey"`, `"srcPubKey"`)
share the common substring `"ubKey"` — added a `strings.Contains`
pre-check that skips the JSON parse entirely for packets that don't
match
- At 30K+ packets on startup, this eliminates the majority of
`json.Unmarshal` calls in `indexByNode` (channel messages, status
packets, etc. all bypass it)
## Test plan
- [x] 5 new subtests in `TestIndexByNodePreCheck`: ADVERT with pubKey
indexed, destPubKey indexed, channel message skipped, empty JSON
skipped, duplicate hash deduped
- [x] All existing Go tests pass
- [x] Deploy to staging and verify node-filtered packet queries still
work correctly
Closes#376🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- `bufferPacket()` was overwriting `_ts` with `Date.now()` (receive
time) for every live WS packet
- Packets arriving in the same batch all got identical timestamps,
making the message history show the same "Xs ago" for every entry (e.g.,
all show "5s ago")
- Fix: use `pkt.timestamp || pkt.created_at` (mirroring
`dbPacketToLive`) so each packet reflects its actual origination time,
falling back to `Date.now()` only when the packet has no timestamp
## Root cause
```js
// before
pkt._ts = Date.now();
// after
pkt._ts = new Date(pkt.timestamp || pkt.created_at || Date.now()).getTime();
```
The WS broadcast includes `timestamp` (= `tx.FirstSeen`) in the packet
map (store.go:1182), so the field is always present for real packets.
## Test plan
- [x] Open Live page, observe packets arriving — each should show its
own relative time, not all the same value
- [x] `node test-frontend-helpers.js` passes (235 tests, 0 failures)
Closes#475🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: you <you@example.com>
## Summary
- `BuildBreakdown` was never ported from the deleted Node.js
`decoder.js` to Go — the server has returned `breakdown: {}` since the
Go migration (commit `742ed865`), so `createColoredHexDump()` and
`buildHexLegend()` in the frontend always received an empty `ranges`
array and rendered everything as monochrome
- Implemented `BuildBreakdown()` in `decoder.go` — computes labeled byte
ranges matching the frontend's `LABEL_CLASS` map: `Header`, `Transport
Codes`, `Path Length`, `Path`, `Payload`; ADVERT packets get sub-ranges:
`PubKey`, `Timestamp`, `Signature`, `Flags`, `Latitude`, `Longitude`,
`Name`
- Wired into `handlePacketDetail` (was `struct{}{}`)
- Also adds per-section color classes to the field breakdown table
(`section-header`, `section-transport`, `section-path`,
`section-payload`) so the table rows get matching background tints
## Test plan
- [x] Open any packet detail pane — hex dump should show color-coded
sections (red header, orange path length, blue transport codes, green
path hops, yellow/colored payload)
- [x] Legend below action buttons should appear with color swatches
- [x] ADVERT packets: PubKey/Timestamp/Signature/Flags each get their
own distinct color
- [x] Field breakdown table section header rows should be tinted per
section
- [x] 8 new Go tests: all pass
Closes#329🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Addresses review feedback from PR #487 (nodes.js coverage).
### Changes
1. **Replace fragile `exportInternals` regex source patching with stable
test hooks** — `getStatusInfo` and `getStatusTooltip` are now exposed
via `window._nodesGetStatusInfo` and `window._nodesGetStatusTooltip`,
matching the existing pattern used by all other test-accessible
functions. The brittle regex `.replace()` approach that modified source
code at runtime has been removed entirely.
2. **Strengthen weak null assertion** — The `renderNodeTimestampHtml
handles null` test previously asserted `html.includes('—') ||
html.length > 0`, which is a near-tautology (any non-empty string
passes). Now strictly asserts `html.includes('—')`.
### Files changed
- `public/nodes.js` — 2 new test hook lines
- `test-frontend-helpers.js` — removed 21-line `exportInternals` branch,
updated tests to use hooks
### Testing
- All 309 frontend helper tests pass
- All 62 packet filter tests pass
- All 29 aging tests pass
Closes review items from #487.
Co-authored-by: you <you@example.com>
## Summary
Add 67 new unit tests for `nodes.js`, raising frontend helper test count
from 233 to 300.
Part of #344 — nodes.js coverage.
## What's Tested
### Sort System (`toggleSort`, `sortNodes`, `sortArrow`)
- Direction toggling on same column (asc↔desc)
- Default sort directions per column type (name→asc, last_seen→desc,
advert_count→desc)
- localStorage persistence of sort state
- All 5 sort columns: `name`, `public_key`, `role`, `last_seen`,
`advert_count`
- Both ascending and descending for each column
- Case-insensitive name sorting
- Unnamed nodes sort last
- Timestamp fallback chain: `last_heard` → `last_seen` → 0
- Missing timestamp handling
- Empty array edge case
- Unknown column graceful handling
- `sortArrow` rendering for active (▲/▼) and inactive columns
### Status Calculation (`getStatusInfo`, `getStatusTooltip`)
- `_lastHeard` takes priority over `last_heard`
- `last_seen` used as fallback when `last_heard` missing
- No-timestamp nodes return stale with `lastHeardMs: 0`
- Infrastructure threshold (72h) for rooms
- Standard threshold (24h) for sensors and companions
- Explanation text varies by role and status
- Unknown role defaults to gray color `#6b7280`
- All role/status tooltip combinations
### Timestamp Rendering (`renderNodeTimestampHtml`,
`renderNodeTimestampText`)
- HTML output includes tooltip and `timestamp-text` class
- Future timestamps show ⚠️ warning icon
- Null input produces dash
- Text output is plain (no HTML tags)
### Favorites Sync (`syncClaimedToFavorites`)
- Claimed pubkeys added to favorites
- No-op when all already synced
- Empty my-nodes handled
- Missing localStorage keys don't crash
## Implementation
- Added test hooks on `window` for closure-scoped functions
(non-invasive, follows existing pattern)
- Tests use `vm.createContext` to load real `nodes.js` code — no copies
- No new dependencies
## Test Results
```
Frontend helpers: 300 passed, 0 failed
```
---------
Co-authored-by: you <you@example.com>
## Problem
On a long-running session the packets page consumed 8 GB of browser
memory and 20%+ CPU on an 8-core machine. Root causes:
1. **Unbounded `packets` array growth via WebSocket** —
`packets.unshift()` was called for every new unique hash, but nothing
ever trimmed the array. After hours of live traffic the array grew well
past the initial 50 k load limit.
2. **Unbounded `pauseBuffer`** — all WS messages queued while paused, no
cap.
3. **Unbounded `_children` growth** — expanded groups received a
`unshift(p)` on every matching WS message with no size limit.
4. **O(n) `observers.find()` inside the O(n) render loop** — with 50 k
rows, each render triggered up to 50 k linear scans through the
observers list.
5. **Full DOM rebuild on every WS message** — `renderTableRows()` was
called synchronously on every WebSocket batch, reconstructing the entire
table on each incoming packet.
## Changes
- `packets[]` is now trimmed to `PACKET_LIMIT` after each WS batch;
evicted entries are also removed from `hashIndex` to prevent stale
references.
- `pauseBuffer` capped at 2 000 entries (oldest dropped).
- `_children` capped at 200 entries on WS prepend.
- `renderTableRows()` on the WS path is debounced to 200 ms, batching
rapid updates into a single redraw.
- `observersById = new Map()` pre-built from the observers array; all
`observers.find()` calls in the render loop and WS filter replaced with
O(1) `Map.get()`.
## Test plan
- [x] Load the packets page and leave it running for several minutes
with live WebSocket traffic — memory in DevTools should remain stable
rather than growing continuously
- [x] Pause live updates, wait for several messages, then resume —
buffer replays correctly and display updates
- [x] Expand a packet group and leave it open during live traffic —
children update but don't grow past 200
- [x] Region filter still works correctly (relies on the observer Map
lookup)
- [x] Observer name / IATA badge renders correctly in grouped and flat
mode
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Summary
Fixes#485 — the app version was derived from `package.json` via
Node.js, which is a meaningless artifact for this Go project. This
caused version mismatches (e.g., v3.3.0 release showing "3.2.0") when
someone forgot to bump `package.json`.
## Changes
### `manage.sh`
- **Line 43**: Replace `node -p "require('./package.json').version"`
with `git describe --tags --match "v*"` — version is now derived
automatically from git tags
- **Line 515**: Add `--force` to `git fetch origin --tags` in setup
command
- **Line 1320**: Add `--force` to `git fetch origin --tags` in update
command — prevents "would clobber existing tag" errors when tags are
moved
### `package.json`
- Version field set to `0.0.0-use-git-tags` to make it clear this is not
the source of truth. File kept because npm scripts and devDependencies
are still used for testing.
## How it works
`git describe --tags --match "v*"` produces:
- `v3.3.0` — when on an exact tag
- `v3.3.0-3-gabcdef1` — when 3 commits after a tag (useful for
debugging)
- Falls back to `unknown` if no tags exist
## Testing
- All Go tests pass (`cmd/server`, `cmd/ingestor`)
- All frontend unit tests pass (254/254)
- No changes to application logic — only build-time version derivation
Co-authored-by: you <you@example.com>
## Problem
Every PR that touches `public/` files requires manually bumping cache
buster timestamps in `index.html` (e.g. `?v=1775111407`). Since all PRs
change the same lines in the same file, this causes **constant merge
conflicts** — it's been the #1 source of unnecessary PR friction.
## Solution
Replace all hardcoded `?v=TIMESTAMP` values in `index.html` with a
`?v=__BUST__` placeholder. The Go server replaces `__BUST__` with the
current Unix timestamp **once at startup** when it reads `index.html`,
then serves the pre-processed HTML from memory.
Every server restart automatically picks up fresh cache busters — no
manual intervention needed.
## What changed
| File | Change |
|------|--------|
| `public/index.html` | All `v=1775111407` → `v=__BUST__` (28
occurrences) |
| `cmd/server/main.go` | `spaHandler` reads index.html at init, replaces
`__BUST__` with Unix timestamp, serves from memory for `/`,
`/index.html`, and SPA fallback |
| `cmd/server/helpers_test.go` | New `TestSpaHandlerCacheBust` —
verifies placeholder replacement works for root, SPA fallback, and
direct `/index.html` requests. Also added tests for root `/` and
`/index.html` routes |
| `AGENTS.md` | Rule 3 updated: cache busters are now automatic, agents
should not manually edit them |
## Testing
- `go build ./...` — compiles cleanly
- `go test ./...` — all tests pass (including new cache-bust tests)
- `node test-frontend-helpers.js && node test-packet-filter.js && node
test-aging.js` — all frontend tests pass
- No hardcoded timestamps remain in `index.html`
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: you <you@example.com>
## Summary
Fixes#457 — The "Show direct neighbors" checkbox on the map was a UI
stub that did nothing. This PR implements the full feature.
## What Changed
### `public/map.js`
- **New state**: `selectedReferenceNode` (pubkey) and `neighborPubkeys`
(Set) track which node is the reference and who its direct neighbors are
- **`selectReferenceNode(pubkey, name)`**: Fetches
`/api/nodes/{pubkey}/paths`, parses path hops to find all nodes directly
adjacent to the reference node in any observed path, then auto-enables
the neighbor filter
- **Neighbor filter in `_renderMarkersInner()`**: When
`filters.neighbors` is on and a reference node is selected, only the
reference node and its direct (1-hop) neighbors are shown on the map
- **Popup "Show Neighbors" link**: Each node popup now has a "Show
Neighbors" action that sets it as the reference node
- **Sidebar UI hints**: Shows the reference node name when selected, or
a hint to click a node when the filter is enabled without a reference
- **Cleanup on `destroy()`**: Clears reference state and global handler
### `test-frontend-helpers.js`
- 6 new unit tests covering:
- Filter off shows all nodes
- Filter on without reference shows all nodes (graceful no-op)
- Filter on with reference + neighbors filters correctly
- Filter on with empty neighbor set shows only reference
- Neighbor filter respects role filters
- Neighbor extraction from path data
### `public/index.html`
- Cache buster bump
## How It Works
1. User clicks a node marker on the map → popup shows "Show Neighbors"
link
2. Clicking "Show Neighbors" fetches that node's paths from
`/api/nodes/{pubkey}/paths`
3. Adjacent hops in each path are identified as direct neighbors
4. The map filters to show only the reference node + its neighbors
5. The sidebar shows which node is the reference
6. Unchecking the checkbox restores the full node view
## Test Results
```
Frontend helpers: 250 passed, 0 failed
Packet filter: 62 passed, 0 failed
```
---------
Co-authored-by: you <you@example.com>
## Summary
Related to #463 (partial fix — addresses packet path, status message
path still needs investigation) — Observers incorrectly showing as
offline despite actively forwarding packets.
## Root Cause
Observer `last_seen` was only updated when status topic messages
(`meshcore/<region>/<observer_id>/status`) were received via
`UpsertObserver`. When packets were ingested from an observer, the
observer's `last_seen` was **not** updated — only the `observer_idx` was
resolved for the observation record.
This meant observers with low traffic that published status messages
less frequently than the 10-minute online threshold would appear offline
on the observers page, even though they were clearly alive and
forwarding packets.
## Changes
**`cmd/ingestor/db.go`:**
- Added `stmtUpdateObserverLastSeen` prepared statement: `UPDATE
observers SET last_seen = ? WHERE rowid = ?`
- In `InsertTransmission`, after resolving `observer_idx`, update the
observer's `last_seen` to the packet timestamp
- This ensures any observer actively forwarding traffic stays marked as
online
**`cmd/ingestor/db_test.go`:**
- Added `TestInsertTransmissionUpdatesObserverLastSeen` — verifies that
inserting a packet from an observer updates its `last_seen` from a
backdated value to the packet timestamp
## Performance
The added `UPDATE` is a single-row update by `rowid` (primary key) —
O(1) with no index overhead. It runs once per packet insertion when an
observer is resolved, which was already doing a `SELECT` by `rowid`
anyway. No measurable impact on ingestion throughput.
## Test Results
All existing tests pass:
- `cmd/ingestor`: 26.6s ✅
- `cmd/server`: 3.7s ✅
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#433 — Replace the inaccurate Euclidean distance approximation in
`analytics.js` hop distances with proper haversine calculation, matching
the server-side computation introduced in PR #415.
## Problem
PR #415 moved collision analysis server-side and switched from the
frontend's Euclidean approximation (`dLat×111, dLon×85`) to proper
haversine. However, the **hop distance** calculation in `analytics.js`
(subpath detail panel) still used the old Euclidean formula. This
caused:
- **Inconsistent distances** between hop distances and collision
distances
- **Significant errors at high latitudes** — e.g., Oslo→Stockholm:
Euclidean gives ~627km, haversine gives ~415km (51% error)
- The `dLon×85` constant assumes ~40° latitude; at 60° latitude the real
scale factor is ~55.5km/degree, not 85
## Changes
| File | Change |
|------|--------|
| `public/analytics.js` | Replace `dLat*111, dLon*85` Euclidean with
`HopResolver.haversineKm()` (with inline fallback) |
| `public/hop-resolver.js` | Export `haversineKm` in the public API for
reuse |
| `test-frontend-helpers.js` | Add 4 tests: export check, zero distance,
SF→LA accuracy, Euclidean vs haversine divergence |
| `cmd/server/helpers_test.go` | Add `TestHaversineKm`: zero, SF→LA,
symmetry, Oslo→Stockholm accuracy |
| `public/index.html` | Cache buster bump |
## Performance
No performance impact — `haversineKm` replaces an inline arithmetic
expression with another inline arithmetic expression of identical O(1)
complexity. Only called per hop pair in the subpath detail panel
(typically <10 hops).
## Testing
- `node test-frontend-helpers.js` — 248 passed, 0 failed
- `go test -run TestHaversineKm` — PASS
Co-authored-by: you <you@example.com>
## Summary
The `/api/analytics/hash-collisions` endpoint always returned global
results, ignoring the active region filter. Every other analytics
endpoint (RF, topology, hash-sizes, channels, distance, subpaths)
respected the `?region=` query parameter — this was the only one that
didn't.
Fixes#438
## Changes
### Backend (`cmd/server/`)
- **routes.go**: Extract `region` query param and pass to
`GetAnalyticsHashCollisions(region)`
- **store.go**:
- `collisionCache` changed from `*cachedResult` →
`map[string]*cachedResult` (keyed by region, `""` = global) — consistent
with `rfCache`, `topoCache`, etc.
- `GetAnalyticsHashCollisions(region)` and
`computeHashCollisions(region)` now accept a region parameter
- When region is specified, resolves regional observers, scans packets
for nodes seen by those observers, and filters the node list before
computing collisions
- Cache invalidation updated to clear the map (not set to nil)
### Frontend (`public/`)
- **analytics.js**: The hash-collisions fetch was missing `+ sep` (the
region query string). All other fetches in the same `Promise.all` block
had it — this was simply overlooked in PR #415.
- **index.html**: Cache busters bumped
### Tests (`cmd/server/routes_test.go`)
- `TestHashCollisionsRegionParamIgnored` → renamed to
`TestHashCollisionsRegionParam` with updated comments reflecting that
region is now accepted (with no configured regional observers, results
match global — which the test verifies)
## Performance
No new hot-path work. Region filtering adds one scan of `s.packets`
(same as every other region-filtered analytics endpoint) only when
`?region=` is provided. Results are cached per-region with the existing
60s TTL. Without `?region=`, behavior is unchanged.
Co-authored-by: you <you@example.com>
## Summary
Removes an unreachable duplicate `return offsets;` statement in the
`_cumulativeRowOffsets()` function in `packets.js`. The second return
was dead code found during review of PR #402.
## Changes
- **`public/packets.js`**: Removed the duplicate `return offsets;` on
what was line 1137 (the line immediately after the first, reachable
`return offsets;`)
- **`public/index.html`**: Cache buster bump
## Testing
This is a dead code removal — the duplicate return was unreachable. No
behavior change. No new tests needed as existing tests already cover
`_cumulativeRowOffsets()` behavior.
Fixes#447
Co-authored-by: you <you@example.com>
## Summary
Fixes#472
The Docker build job on the self-hosted runner fails with `no space left
on device` because Docker build cache and Go module downloads accumulate
between runs. The existing cleanup (line ~330) runs in the **deploy**
step *after* the build — too late to help.
## Changes
- Added a "Free disk space" step at the start of the build job,
**before** "Build Go Docker image":
- `docker system prune -af` — removes all unused images, containers,
networks
- `docker builder prune -af` — clears the build cache
- `df -h /` — logs available disk space for visibility
- Kept the existing post-deploy cleanup as belt-and-suspenders
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Replace all `setInterval`-based animations in `live.js` with
`requestAnimationFrame` loops and add a concurrency cap to prevent
unbounded animation accumulation under high packet throughput.
Fixes#384
## Problem
Under high throughput (≥5 packets/sec), the live map accumulated
unbounded `setInterval` timers:
- `pulseNode()`: 26ms interval per pulse ring
- `drawAnimatedLine()`: 33ms interval per hop line + 52ms nested
interval for fade-out
- Ghost hop pulse: 600ms interval per ghost marker
At 5 pkts/sec × 3 hops = **15+ concurrent intervals**, climbing without
limit. This caused UI jank, rising CPU usage, and potential memory leaks
from leaked Leaflet markers.
## Changes
### `public/live.js`
| Function | Before | After |
|----------|--------|-------|
| `pulseNode()` | `setInterval` (26ms) + `setTimeout` safety |
`requestAnimationFrame` loop, self-terminates at 2s or opacity ≤ 0 |
| `drawAnimatedLine()` | `setInterval` (33ms) for line + nested
`setInterval` (52ms) for fade | Two `requestAnimationFrame` loops (line
advance + fade-out) |
| Ghost hop pulse | `setInterval` (600ms) + `setTimeout` (3s) |
`requestAnimationFrame` loop with 3s expiry |
| `animatePath()` | No concurrency limit | Returns early when
`activeAnims >= MAX_CONCURRENT_ANIMS` (20) |
### `public/index.html`
- Cache buster version bump
### `test-live-anims.js` (new)
- 7 tests verifying:
- No `setInterval` in `pulseNode`, `drawAnimatedLine`, or `animatePath`
- `MAX_CONCURRENT_ANIMS` defined and set to 20
- Concurrency check present in `animatePath`
- No stale `setInterval` in animation hot paths
## Complexity & Scale
- **Time complexity**: O(1) per animation frame (no change in per-frame
work)
- **Concurrency**: Hard-capped at 20 simultaneous animations (previously
unbounded)
- **At 5 pkts/sec, 3 hops**: Excess animations silently dropped instead
of accumulating timers
- **rAF benefit**: Browser coalesces all animations into single paint
cycle; paused tabs stop animating automatically
## Test Results
```
=== Animation interval elimination ===
✅ pulseNode does not use setInterval
✅ drawAnimatedLine does not use setInterval
✅ ghost hop pulse does not use setInterval
=== Concurrency cap ===
✅ MAX_CONCURRENT_ANIMS is defined
✅ MAX_CONCURRENT_ANIMS is set to 20
✅ animatePath checks MAX_CONCURRENT_ANIMS before proceeding
=== Safety: no stale setInterval in animation functions ===
✅ no setInterval remains in animation hot path
7 passed, 0 failed
```
All existing tests pass (packet-filter: 62, aging: 29, frontend-helpers:
241).
## Performance Proof (Rule 0 compliance)
Benchmark: `node test-anim-perf.js` — simulates timer/animation
accumulation under realistic throughput.
### Timer count: old (setInterval) vs new (rAF + cap)
| Scenario | Old model (peak concurrent timers) | New model (peak
concurrent animations) |
|----------|-----------------------------------:|---------------------------------------:|
| 5 pkt/s × 3 hops, 30s sustained | **123** | **20** |
| 5 pkt/s × 3 hops, 5min sustained | **123** | **20** |
| 20 pkt/s × 3 hops, 10s burst | **246** | **20** |
**Before:** Each hop spawns 3 `setInterval` timers (pulse 26ms, line
33ms, fade 52ms) that live 0.6–2s each. At 5 pkt/s × 3 hops = 15
timers/sec, peak concurrent timers reach **123** (limited only by timer
lifetime, not by any cap). Under burst traffic (20 pkt/s), this climbs
to **246+**.
**After:** `MAX_CONCURRENT_ANIMS = 20` hard-caps active animations.
Excess packets are silently dropped. rAF loops replace all `setInterval`
calls, coalescing into single paint cycles. Peak concurrent animations:
**always ≤ 20**, regardless of throughput or duration.
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Fixes#465 — Channel hash was displaying in decimal instead of
hexadecimal in `channels.js`.
## Changes
- Added `formatHashHex()` helper to `channels.js` that formats numeric
hashes as `0x` hex (e.g. `0x0A`) and passes string hashes through
unchanged
- Applied to both display sites: `renderChannelList` fallback name and
`selectChannel` header text
- Consistent with `packets.js` and `analytics.js` which already use
`.toString(16).padStart(2, '0').toUpperCase()`
## Tests
- 3 new tests in `test-frontend-helpers.js` verifying the helper exists,
is used at display sites, and produces correct output for numeric and
string inputs
- All 244 frontend tests pass, plus packet-filter (62) and aging (29)
tests
Co-authored-by: you <you@example.com>
## Summary
Fixes#361 — `perfMiddleware()` wrote to shared `PerfStats` fields
(`Requests`, `TotalMs`, `Endpoints` map, `SlowQueries` slice) without
any synchronization, causing data races under concurrent HTTP requests.
## Changes
### `cmd/server/routes.go`
- **Added `sync.Mutex` to `PerfStats` struct** — single mutex protects
all fields
- **`perfMiddleware`** — all shared state mutations (counter increments,
endpoint map access, slice appends) now happen under lock. Key
normalization (regex, mux route lookup) moved outside the lock since it
uses no shared state
- **`handleHealth`** — snapshots `Requests`, `TotalMs`, `SlowQueries`
under lock before building response
- **`handlePerf`** — copies all endpoint data and slow queries under
lock into local snapshots, then does expensive work (sorting, percentile
calculation) outside the lock
- **`handlePerfReset`** — resets fields in-place instead of replacing
the pointer (avoids unlocking a different mutex)
### `cmd/server/perfstats_race_test.go` (new)
- Regression test: 50 concurrent writer goroutines + 10 concurrent
reader goroutines hammering `PerfStats` simultaneously
- Verifies no race conditions (via `-race` flag) and counter consistency
## Design Decisions
- **Single mutex over atomics**: The issue suggested `atomic.Int64` for
counters, but since slices/maps need a mutex anyway, a single mutex is
simpler and the critical section is small (microseconds). No measurable
contention at CoreScope's scale.
- **Copy-under-lock pattern**: Expensive operations (sorting, percentile
computation) happen outside the lock to minimize hold time.
- **In-place reset**: `handlePerfReset` clears fields rather than
replacing the `PerfStats` pointer, ensuring the mutex remains valid for
concurrent goroutines.
## Testing
- `go test -race -count=1 ./cmd/server/...` — **PASS** (all existing
tests + new race test)
- New `TestPerfStatsConcurrentAccess` specifically validates concurrent
access patterns
Co-authored-by: you <you@example.com>
## Summary
Replace all `observers.find()` linear scans in `packets.js` with O(1)
`Map.get()` lookups, eliminating ~300K comparisons per render cycle at
30K+ rows.
## Changes
- Added `observerMap` (`Map<id, observer>`) built once when observers
load
- Replaced all 6 `observers.find()` call sites with `observerMap.get()`:
- `obsName()` — called per row for observer name display
- Region filter check in packet filtering
- Observer dropdown label in filter UI
- Group header region lookup
- Child row region lookup
- Flat row region lookup
- Map is cleared on reset and rebuilt on each `loadObservers()` call
## Complexity
- **Before:** O(k) per row × 30K rows = O(30K × k) where k = observer
count (~10)
- **After:** O(1) per row × 30K rows = O(30K)
- Map construction: O(k) once, negligible
## Testing
- All Go tests pass (`cmd/server`, `cmd/ingestor`)
- All frontend tests pass (`test-packet-filter.js`: 62 passed,
`test-aging.js`: 29 passed, `test-frontend-helpers.js`: 241 passed)
Fixes#383
Co-authored-by: you <you@example.com>
## Problem
Fixes#324. The VCR LCD clock and timeline hover/touch tooltip always
showed local time, ignoring the UTC/local timezone setting in the
customizer Display tab.
## Root cause
Three sites in `live.js` bypassed the shared `getTimestampTimezone()`
utility:
- `updateVCRClock()` — used `d.getHours()` / `d.getMinutes()` /
`d.getSeconds()` (always local)
- Timeline mousemove tooltip — used `d.toLocaleTimeString()` (always
local)
- Timeline touchmove tooltip — same
## Fix
Added `vcrFormatTime(tsMs)` helper that checks `getTimestampTimezone()`
and uses `getUTC*` methods when set to `'utc'`, otherwise local `get*`.
Applied to all three sites. Exposed as `window._vcrFormatTime` for
testing.
## Tests
4 new unit tests in `test-frontend-helpers.js` covering UTC mode, local
mode, and zero-padding.
## Checklist
- [x] Branches from `upstream/master`
- [x] No Matomo or local-only commits
- [x] Cache busters bumped (`v=1775073838`)
- [x] 233 tests pass, 0 fail
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Summary
- `nextHop()` schedules `setInterval`/`setTimeout` callbacks that can
fire after `destroy()` has set `animLayer = null` and removed DOM
elements
- This caused three console errors on the Live page when navigating away
mid-animation: `Cannot read properties of null (reading 'hasLayer')` and
`Cannot set properties of null (setting 'textContent')`
- Added null guards at each async callback site; no behavioral change
when the page is active
## Changes
- `public/live.js`: early return if `animLayer` is null at start of
`nextHop()`; null-safe `animLayer.hasLayer` checks in
`setInterval`/`setTimeout`; null-safe `liveAnimCount` element access
- `public/index.html`: cache buster bumped
- `test-frontend-helpers.js`: 4 source-inspection tests verifying the
null guards are present
## Test plan
- [ ] Open Live page, trigger some packet animations, navigate away
quickly — no console errors
- [ ] `node test-frontend-helpers.js` passes (233 tests, 0 failures)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
Fixes#399. On every ADVERT WebSocket batch the nodes page invalidated
the entire `_allNodes` cache and triggered a full `/nodes?limit=5000`
fetch — even when every advertising node was already cached. The 90s API
TTL was actively bypassed.
## Root cause
```js
wsHandler = debouncedOnWS(function (msgs) {
if (msgs.some(isAdvertMessage)) {
_allNodes = null; // wipe cache unconditionally
invalidateApiCache('/nodes'); // bust API TTL
loadNodes(true); // full 5k fetch
}
}, 5000);
```
## Fix
ADVERT decoded payloads include `pubKey`, `name`, `lat`, `lon` — enough
to update known nodes in place:
- **Known node** (pubKey found in `_allNodes`): upsert `name`, `lat`,
`lon`, `last_seen` directly — no fetch, no cache bust, just re-render.
- **New node** (pubKey not in cache) or **no pubKey** in payload: fall
back to full reload as before.
This covers the common case on an active mesh: all advertising nodes are
already cached. The full reload path is preserved for node discovery.
## Tests
2 new unit tests: known-node upsert (asserts 0 API calls, fields
updated) and unknown-node fallback (asserts full reload triggered). All
231 tests pass.
## Checklist
- [x] Branches from `upstream/master`
- [x] No Matomo or local-only commits
- [x] Cache busters bumped
- [x] 231 tests pass, 0 fail
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Problem
Fixes#325. Removing all home steps and clicking "Reset my theme" did
not restore them.
## Root cause
Two-part bug:
**1. `SITE_CONFIG.home` permanently mutated at page load**
`app.js` calls `mergeUserHomeConfig(SITE_CONFIG, userTheme)` which does
`SITE_CONFIG.home = Object.assign({}, serverHome, userTheme.home)`. If
the user had `steps: []` saved in localStorage, this sets
`SITE_CONFIG.home.steps = []` globally — permanently for the lifetime of
the page.
**2. `initState()` reads the contaminated config**
When the customizer opens (or Reset is clicked), `initState()` reads
`cfg = window.SITE_CONFIG`. Since `SITE_CONFIG.home.steps` is already
`[]`, `state.home.steps` stays `[]` even after
`localStorage.removeItem`. `autoSave()` then re-saves `steps: []`
straight back.
**Secondary issue:** `data-rm-step` / add / move handlers didn't call
`autoSave()`, making step persistence non-deterministic (only saved if a
text field edit happened to be pending).
## Fix
- **`app.js`**: snapshot `SITE_CONFIG.home` before `mergeUserHomeConfig`
→ `window._SITE_CONFIG_ORIGINAL_HOME`
- **`customize.js`**: `initState()` uses `_SITE_CONFIG_ORIGINAL_HOME`
instead of the contaminated `cfg.home`
- **`customize.js`**: add `autoSave()` to rm/move/add handlers for
steps, checklist, and footer links
## Tests
2 new unit tests covering the snapshot bypass and DEFAULTS fallback. 231
tests pass.
## Checklist
- [x] Branches from `upstream/master`
- [x] No Matomo or local-only commits
- [x] Cache busters bumped
- [x] 231 tests pass, 0 fail
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Summary
Fixes#466 — staging config was not refreshed from prod due to a stale
`-nt` timestamp guard.
## Root Cause
`prepare_staging_config()` only copied prod config when staging was
missing or prod was newer by mtime. However, the `sed -i` that applies
the STAGING siteName updated staging's mtime, making it appear newer
than prod. Subsequent runs skipped the copy entirely.
## Changes
- **`manage.sh`**: Removed the `-nt` timestamp conditional in
`prepare_staging_config()`. Staging config is now always copied fresh
from prod with the STAGING siteName applied.
Note: `prepare_staging_db()` already copies unconditionally — no change
needed there.
Co-authored-by: you <you@example.com>
## Summary
`manage.sh update` now supports pinning to specific release tags instead
of always pulling tip of master.
Fixes#455
## Changes
### `cmd_update` — accepts optional version argument
- **No argument**: fetches tags, checks out latest release tag (`git tag
-l 'v*' --sort=-v:refname | head -1`)
- **`latest`**: explicit opt-in to tip of master (bleeding edge)
- **Specific tag** (e.g. `v3.1.0`): checks out that exact tag, with
error message + available tags if not found
### `cmd_setup` — defaults to latest tag
- After Docker check, fetches tags and pins to latest release tag
- Skips if already on the latest tag
- Uses state tracking (`version_pin`) so re-runs don't repeat
### `cmd_status` — shows version
- Displays current version (exact tag name or short commit hash) at the
top of status output
### Help text
- Updated to reflect new `update [version]` syntax
## Usage
```bash
./manage.sh update # checkout latest release tag (e.g. v3.2.0)
./manage.sh update v3.1.0 # pin to specific version
./manage.sh update latest # explicit tip of master (bleeding edge)
./manage.sh status # now shows "Version: v3.2.0"
```
## Testing
- `bash -n manage.sh` passes (syntax valid)
- Logic follows existing patterns (git fetch, checkout, rebuild,
restart)
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#450 — staging deployment flaky due to container not shutting down
cleanly.
## Root Causes
1. **Server never closed DB on shutdown** — SQLite WAL lock held
indefinitely, blocking new container startup
2. **`httpServer.Close()` instead of `Shutdown()`** — abruptly kills
connections instead of draining them
3. **No `stop_grace_period` in compose configs** — Docker sends SIGTERM
then immediately SIGKILL (default 10s is often not enough for WAL
checkpoint)
4. **Supervisor didn't forward SIGTERM** — missing
`stopsignal`/`stopwaitsecs` meant Go processes got SIGKILL instead of
graceful shutdown
5. **Deploy scripts used default `docker stop` timeout** — only 10s
grace period
## Changes
### Go Server (`cmd/server/`)
- **Graceful HTTP shutdown**: `httpServer.Shutdown(ctx)` with 15s
context timeout — drains in-flight requests before closing
- **WebSocket cleanup**: New `Hub.Close()` method sends `CloseGoingAway`
frames to all connected clients
- **DB close on shutdown**: Explicitly closes DB after HTTP server stops
(was never closed before)
- **WAL checkpoint**: `PRAGMA wal_checkpoint(TRUNCATE)` before DB close
— flushes WAL to main DB file and removes WAL/SHM lock files
### Go Ingestor (`cmd/ingestor/`)
- **WAL checkpoint on shutdown**: New `Store.Checkpoint()` method,
called before `Close()`
- **Longer MQTT disconnect timeout**: 5s (was 1s) to allow in-flight
messages to drain
### Docker Compose (all 4 variants)
- Added `stop_grace_period: 30s` and `stop_signal: SIGTERM`
### Supervisor Configs (both variants)
- Added `stopsignal=TERM` and `stopwaitsecs=20` to server and ingestor
programs
### Deploy Scripts
- `deploy-staging.sh`: `docker stop -t 30` with explicit grace period
- `deploy-live.sh`: `docker stop -t 30` with explicit grace period
## Shutdown Sequence (after fix)
1. Docker sends SIGTERM to supervisord (PID 1)
2. Supervisord forwards SIGTERM to server + ingestor (waits up to 20s
each)
3. Server: stops poller → drains HTTP (15s) → closes WS clients →
checkpoints WAL → closes DB
4. Ingestor: stops tickers → disconnects MQTT (5s) → checkpoints WAL →
closes DB
5. Docker waits up to 30s total before SIGKILL
## Tests
All existing tests pass:
- `cd cmd/server && go test ./...` ✅
- `cd cmd/ingestor && go test ./...` ✅
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Fixes#451 — packet detail pane crash on direct routed packets where
`pathHops` is `null`.
## Root Cause
`JSON.parse(pkt.path_json)` can return literal `null` when the DB stores
`"null"` for direct routed packets. The existing code only had a catch
block for parse errors, but `null` is valid JSON — so the parse succeeds
and `pathHops` ends up `null` instead of `[]`.
## Changes
- **`public/packets.js`**: Added `|| []` after `JSON.parse(...)` in both
`buildFlatRowHtml` (table rows) and the detail pane (`selectPacket`),
ensuring `pathHops` is always an array.
- **`test-frontend-helpers.js`**: Added 2 regression tests verifying the
null guards exist in both code paths.
- **`public/index.html`**: Cache buster bump.
## Testing
- All 229 frontend helper tests pass
- All 62 packet filter tests pass
- All 29 aging tests pass
Co-authored-by: you <you@example.com>
## Summary
Fixes the critical performance issue where `renderTableRows()` rebuilt
the **entire** table innerHTML (up to 50K rows) on every update —
WebSocket arrivals, filter changes, group expand/collapse, and theme
refreshes.
## Changes
### Lazy Row Generation (`renderVisibleRows`) — fixes#422
- Row HTML strings are **only generated for the visible slice + 30-row
buffer** on each render
- `_displayPackets` stores the filtered data array;
`renderVisibleRows()` calls `buildGroupRowHtml`/`buildFlatRowHtml`
lazily for ~60-90 visible entries
- Previously, `displayPackets.map(buildGroupRowHtml)` built HTML for ALL
30K+ packets on every render — the expensive work (JSON.parse, observer
lookups, template literals) ran for every packet regardless of
visibility
### Unified Row Count via `_getRowCount()` — fixes#424
- Single function `_getRowCount(p)` computes DOM row count for any entry
(1 for flat/collapsed, 1+children for expanded groups)
- Used by BOTH `_rowCounts` computation AND `renderVisibleRows` —
eliminates divergence risk between row counting and row building
### Hoisted Observer Filter Set — fixes#427
- `_observerFilterSet` created once in `renderTableRows()`, reused
across `buildGroupRowHtml`, `_getRowCount`, and child filtering
- Previously, `new Set(filters.observer.split(','))` was created inside
`buildGroupRowHtml` for every packet AND again in the row count callback
### Dynamic Colspan — fixes#426
- `_getColCount()` reads column count from the thead instead of
hardcoded `colspan="11"`
- Spacers and empty-state messages use the actual column count
### Null-Safety in `buildFlatRowHtml` — fixes#430
- `p.decoded_json || '{}'` fallback added, matching
`buildGroupRowHtml`'s existing null-safety
- Prevents TypeError on null/undefined `decoded_json` in flat
(ungrouped) mode
### Behavioral Tests — fixes#428
- Replaced 5 source-grep tests with behavioral unit tests for
`_getRowCount`:
- Flat mode always returns 1
- Collapsed group returns 1
- Expanded group returns 1 + child count
- Observer filter correctly reduces child count
- Null `_children` handled gracefully
- Retained source-level assertions only where behavioral testing isn't
practical (e.g., verifying lazy generation pattern exists)
### Other Improvements
- Cumulative row offsets cached in `_cumulativeOffsetsCache`,
invalidated on row count changes
- Debounced WebSocket renders (200ms) coalesce rapid packet arrivals
- `destroy()` properly cleans up all virtual scroll state
## Performance Benchmarks — fixes#423
**Methodology:** Row building cost measured by counting
`buildGroupRowHtml` calls per render cycle on 30K grouped packets.
| Scenario | Before (eager) | After (lazy) | Improvement |
|----------|----------------|--------------|-------------|
| Initial render (30K packets) | 30,000 `buildGroupRowHtml` calls | ~90
calls (60 visible + 30 buffer) | **333× fewer calls** |
| Scroll event | 0 calls (pre-built) | ~90 calls (rebuild visible slice)
| Trades O(1) scroll for O(n) initial savings |
| WS packet arrival | 30,000 calls (full rebuild) | ~90 calls (debounced
+ lazy) | **333× fewer calls** |
| Filter change | 30,000 calls | ~90 calls | **333× fewer calls** |
| Memory (row HTML cache) | ~2MB string array for 30K packets | 0 (no
cache, build on demand) | **~2MB saved** |
**Per-call cost of `buildGroupRowHtml`:** Each call performs JSON.parse
of `decoded_json`, `path_json`, `observers.find()` lookup, and template
literal construction. At 30K packets, the eager approach spent
~400-500ms on row building alone (measured via `performance.now()` on
staging data). The lazy approach builds ~90 rows in ~1-2ms.
**Net effect:** `renderTableRows()` goes from O(n) string building +
O(1) DOM insertion to O(1) data assignment + O(visible) string building
+ O(visible) DOM insertion. For n=30K and visible≈60, this is ~333× less
work per render cycle.
**Trade-off:** Scrolling now rebuilds ~90 rows per RAF frame instead of
slicing pre-built strings. This costs ~1-2ms per scroll event, well
within the 16ms frame budget. The trade-off is overwhelmingly positive
since renders happen far more frequently than full-table scrolls.
## Tests
- 247 frontend helper tests pass (including 18 virtual scroll tests)
- 62 packet filter tests pass
- 29 aging tests pass
- Go backend tests pass
## Remaining Debt (tracked in issues)
- #425: Hardcoded `VSCROLL_ROW_HEIGHT=36` and `theadHeight=40` — should
be measured from DOM
- #429: 200ms WS debounce delay — value works well in practice but lacks
formal justification
- #431: No scroll position preservation on filter change or group
expand/collapse
Fixes#380
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Moves the hash collision analysis from the frontend to a new server-side
endpoint, eliminating a major performance bottleneck on the analytics
collision tab.
Fixes#386
## Problem
The collision tab was:
1. **Downloading all nodes** (`/nodes?limit=2000`) — ~500KB+ of data
2. **Running O(n²) pairwise distance calculations** on the browser main
thread (~2M comparisons with 2000 nodes)
3. **Building prefix maps client-side** (`buildOneBytePrefixMap`,
`buildTwoBytePrefixInfo`, `buildCollisionHops`) iterating all nodes
multiple times
## Solution
### New endpoint: `GET /api/analytics/hash-collisions`
Returns pre-computed collision analysis with:
- `inconsistent_nodes` — nodes with varying hash sizes
- `by_size` — per-byte-size (1, 2, 3) collision data:
- `stats` — node counts, space usage, collision counts
- `collisions` — pre-computed collisions with pairwise distances and
classifications (local/regional/distant/incomplete)
- `one_byte_cells` — 256-cell prefix map for 1-byte matrix rendering
- `two_byte_cells` — first-byte-grouped data for 2-byte matrix rendering
### Caching
Uses the existing `cachedResult` pattern with a new `collisionCache`
map. Invalidated on `hasNewTransmissions` (same trigger as the
hash-sizes cache) and on eviction.
### Frontend changes
- `renderCollisionTab` now accepts pre-fetched `collisionData` from the
parallel API load
- New `renderHashMatrixFromServer` and `renderCollisionsFromServer`
functions consume server-computed data directly
- No more `/nodes?limit=2000` fetch from the collision tab
- Old client-side functions (`buildOneBytePrefixMap`, etc.) preserved
for test helper exports
## Test results
- `go test ./...` (server): ✅ pass
- `go test ./...` (ingestor): ✅ pass
- `test-packet-filter.js`: ✅ 62 passed
- `test-aging.js`: ✅ 29 passed
- `test-frontend-helpers.js`: ✅ 227 passed
## Performance impact
| Metric | Before | After |
|--------|--------|-------|
| Data transferred | ~500KB (all nodes) | ~50KB (collision data only) |
| Client computation | O(n²) distance calc | None (server-cached) |
| Main thread blocking | Yes (2000 nodes × pairwise) | No |
| Server caching | N/A | 15s TTL, invalidated on new transmissions |
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Summary
Fixes#381 — The "My Nodes" filter in `packets.js` was making a **server
API call inside `renderTableRows()`** on every render cycle. With
WebSocket updates arriving every few seconds while the toggle was
active, this created continuous unnecessary server load.
## What Changed
**`public/packets.js`** — Replaced the `api('/packets?nodes=...')`
server call with a pure client-side filter:
```js
// Before: server round-trip on every render
const myData = await api('/packets?nodes=' + allKeys.join(',') + '&limit=500');
displayPackets = myData.packets || [];
// After: filter already-loaded packets client-side
displayPackets = displayPackets.filter(p => {
const dj = p.decoded_json || '';
return allKeys.some(k => dj.includes(k));
});
```
This uses the exact same matching logic as the server's
`QueryMultiNodePackets()` — a string contains check on `decoded_json`
for each pubkey — but without the network round-trip.
**`test-frontend-helpers.js`** — Added 5 unit tests for the filter
logic:
- Single and multiple pubkey matching
- No matches / empty keys edge case
- Null/empty `decoded_json` handled gracefully
**`public/index.html`** — Cache busters bumped.
## Test Results
- Frontend helpers: **232 passed, 0 failed** (including 5 new tests)
- Packet filter: **62 passed, 0 failed**
- Aging: **29 passed, 0 failed**
Co-authored-by: you <you@example.com>
## Problem
Every time new data is ingested (`IngestNewFromDB`,
`IngestNewObservations`, `EvictStale`), **all 6 analytics caches** are
wiped by creating new empty maps — regardless of what kind of data
actually changed. With the poller running every 1 second, this means the
15s cache TTL is effectively bypassed because caches are cleared far
more frequently than they expire.
## Fix
Introduces a `cacheInvalidation` flags struct and
`invalidateCachesFor()` method that selectively clears only the caches
affected by the ingested data:
| Flag | Caches Cleared |
|------|----------------|
| `hasNewObservations` | RF (SNR/RSSI data changed) |
| `hasNewPaths` | Topology, Distance, Subpaths |
| `hasNewTransmissions` | Hash sizes |
| `hasChannelData` | Channels (GRP_TXT payload_type 5) + channels list
cache |
| `eviction` | All (data removed, everything potentially stale) |
### Impact
For a typical ingest cycle with ADVERT/ACK/TXT_MSG packets (no GRP_TXT):
- **Before:** All 6 caches cleared every cycle
- **After:** Channel cache preserved (most common case), hash cache
preserved on observation-only ingestion
For observation-only ingestion (`IngestNewObservations`):
- **Before:** All 6 caches cleared
- **After:** Only RF cache cleared (+ topo/dist/subpath if paths
actually changed)
## Tests
7 new unit tests in `cache_invalidation_test.go` covering:
- Eviction clears all caches
- Observation-only ingest preserves non-RF caches
- Transmission-only ingest clears only hash cache
- Channel data clears only channel cache
- Path changes clear topo/dist/subpath
- Combined flags work correctly
- No flags = no invalidation
All existing tests pass.
### Post-rebase fix
Restored `channelsCacheRes` invalidation that was accidentally dropped
during the refactor. The old code cleared this separate channels list
cache on every ingest, but `invalidateCachesFor()` didn't include it.
Now cleared on `hasChannelData` and `eviction`.
Fixes#375
---------
Co-authored-by: you <you@example.com>
## Summary
Fixes#353 — addresses all 5 findings from the CoreScope code analysis.
## Changes
### Finding 1 (Major): `score` field never extracted from MQTT
- Added `Score *float64` field to `PacketData` and `MQTTPacketMessage`
structs
- Extract `msg["score"]` with `msg["Score"]` case fallback via
`toFloat64` in all three MQTT handlers (raw packet, channel message,
direct message)
- Pass through to DB observation insert instead of hardcoded `nil`
### Finding 2 (Major): `direction` field never extracted from MQTT
- Added `Direction *string` field to `PacketData` and
`MQTTPacketMessage` structs
- Extract `msg["direction"]` with `msg["Direction"]` case fallback as
string in all three MQTT handlers
- Pass through to DB observation insert instead of hardcoded `nil`
### Finding 3 (Minor): `toFloat64` doesn't strip units
- Added `stripUnitSuffix()` that removes common RF/signal unit suffixes
(dBm, dB, mW, km, mi, m) case-insensitively before `ParseFloat`
- Values like `"-110dBm"` or `"5.5dB"` now parse correctly
### Finding 4 (Minor): Bare type assertions in store.go
- Changed `firstSeen` and `lastSeen` from `interface{}` to typed
`string` variables at `store.go:5020`
- Removed unsafe `.(string)` type assertions in comparisons
### Finding 5 (Minor): `distHopRecord.SNR` typed as `interface{}`
- Changed `distHopRecord.SNR` from `interface{}` to `*float64`
- Updated assignment (removed intermediate `snrVal` variable, pass
`tx.SNR` directly)
- Updated output serialization to use `floatPtrOrNil(h.SNR)` for
consistent JSON output
## Tests Added
- `TestBuildPacketDataScoreAndDirection` — verifies Score/Direction flow
through BuildPacketData
- `TestBuildPacketDataNilScoreDirection` — verifies nil handling when
fields absent
- `TestInsertTransmissionWithScoreAndDirection` — end-to-end: inserts
with score/direction, verifies DB values
- `TestStripUnitSuffix` — covers all supported suffixes, case
insensitivity, and passthrough
- `TestToFloat64WithUnits` — verifies unit-bearing strings parse
correctly
All existing tests pass.
Co-authored-by: you <you@example.com>
## Problem
On installations where the database predates the
`idx_observations_timestamp` index, `/api/stats` takes 30s+ because
`GetStoreStats()` runs two full table scans:
```sql
SELECT COUNT(*) FROM observations WHERE timestamp > ? -- last hour
SELECT COUNT(*) FROM observations WHERE timestamp > ? -- last 24h
```
The index is only created in the `if !obsExists` block, so any database
where the `observations` table already existed before that code was
added never gets it.
## Fix
Adds a one-time migration (`obs_timestamp_index_v1`) that runs at
ingestor startup:
```sql
CREATE INDEX IF NOT EXISTS idx_observations_timestamp ON observations(timestamp)
```
On large installations this index creation may take a few seconds on
first startup after the upgrade, but subsequent stats queries become
instant.
## Test plan
- [ ] Restart ingestor on an older database and confirm `[migration]
observations timestamp index created` appears in logs
- [ ] Confirm `/api/stats` response time drops from 30s+ to <100ms
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
Two endpoints were slow on larger installations:
**`/packets?limit=50000&groupByHash=true` — 16s+**
`QueryGroupedPackets` did two expensive things on every request:
1. O(n × observations) scan per packet to find `latest` timestamp
2. Held `s.mu.RLock()` during the O(n log n) sort, blocking all
concurrent reads
**`/channels` — 13s+**
`GetChannels` iterated all payload-type-5 packets and JSON-unmarshaled
each one while holding `s.mu.RLock()`, blocking all concurrent reads for
the full duration.
## Fix
**Packets (`QueryGroupedPackets`):**
- Add `LatestSeen string` to `StoreTx`, maintained incrementally in all
three observation write paths. Eliminates the per-packet observation
scan at query time.
- Build output maps under the read lock, sort the local copy after
releasing it.
- Cache the full sorted result for 3 seconds keyed by filter params.
**Channels (`GetChannels`):**
- Copy only the fields needed (firstSeen, decodedJSON, region match)
under the read lock, then release before JSON unmarshaling.
- Cache the result for 15 seconds keyed by region param.
- Invalidate cache on new packet ingestion.
## Test plan
- [ ] Open packets page on a large store — load time should drop from
16s to <1s
- [ ] Open channels page — should load in <100ms instead of 13s+
- [ ] `[SLOW API]` warnings gone for both endpoints
- [ ] Packet/channel data is correct (hashes, counts, observer counts)
- [ ] Filters (region, type, since/until) still work correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Priority+ Navigation Pattern for Tablet Viewports
Phase 2 of responsive nav improvements for #322.
### What this does
On **tablet viewports (768-1023px)**, implements the [Priority+
navigation
pattern](https://css-tricks.com/the-priority-plus-navigation-pattern/):
- **5 high-priority tabs** shown inline: Home, Nodes, Packets, Map, Live
- **6 low-priority tabs** collapse into a "More ▾" dropdown: Channels,
Traces, Observers, Analytics, Perf, Lab
- The "More" button highlights when a low-priority page is active
**Desktop (>=1024px)** and **mobile (<768px)** behavior is unchanged.
### Changes
| File | Change |
|------|--------|
| `public/index.html` | Added `data-priority="high"` to 5 primary nav
links; added More button + dropdown menu |
| `public/style.css` | Split ≤1023px hamburger query into tablet
Priority+ (768-1023px) and mobile hamburger (<768px); added More
dropdown styles |
| `public/app.js` | Added `closeMoreMenu()`, More button toggle,
outside-click/Escape close, active state on More button |
| Cache busters | Bumped in same commit |
### Accessibility
- `aria-haspopup="true"` and `aria-expanded` on More button
- `role="menu"` / `role="menuitem"` on dropdown
- Focus moves to first item on open
- Escape key closes dropdown
### Testing
- All 308 existing tests pass (217 frontend-helpers + 62 packet-filter +
29 aging)
- No new dependencies added
- No build step changes
### Breakpoint summary
| Viewport | Behavior |
|----------|----------|
| >= 1024px | Full horizontal nav (unchanged) |
| 768-1023px | Priority+ pattern: 5 tabs + More dropdown **← NEW** |
| < 768px | Hamburger drawer with all items (unchanged) |
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When GetMaxTransmissionID() fails silently (e.g., corrupted DB returns 0
from COALESCE), the poller starts from ID 0 and replays the entire
database over WebSocket — broadcasting thousands of old packets per second.
Fix: after querying the DB, use the in-memory store's MaxTransmissionID
and MaxObservationID as a floor. Since Load() already read the full DB
successfully, the store has the correct max IDs.
Root cause discovered on staging: DB corruption caused MAX(id) query to
fail, returning 0. Poller log showed 'starting from transmission ID 0'
followed by 1000-2000 broadcasts per tick walking through 76K rows.
Also adds MaxObservationID() to PacketStore for observation cursor safety.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes the Playwright CI regression on master where the "Packets page
loads with filter" test times out after 15 seconds waiting for able
tbody tr to appear.
## Root Cause
Three packets tests used an bout:blank round-trip pattern to force a
full page reload:
`
page.goto(BASE) → set localStorage → page.goto('about:blank') →
page.goto(BASE/#/packets)
`
This cross-origin round-trip through bout:blank causes the SPA's config
fetch and router to not fire reliably in CI's headless Chromium, leaving
the page uninitialized past the 15-second timeout.
## Fix
Replace the bout:blank pattern with page.reload() in all three affected
tests:
`
page.goto(BASE/#/packets) → set localStorage → page.reload()
`
This stays on the same origin throughout. Playwright handles same-origin
reloads predictably — the page fully re-initializes, the IIFE re-reads
localStorage, and loadPackets() uses the correct time window.
## Tests affected
| Test | Change |
|------|--------|
| Packets page loads with filter | bout:blank → page.reload() |
| Packets initial fetch honors persisted time window | bout:blank →
page.reload() |
| Packets groupByHash toggle works | bout:blank → page.reload() |
## Validation
- All 318 unit tests pass (packet-filter: 62, aging: 29, frontend: 227)
- No public/ files changed — no cache buster needed
- Single file changed: est-e2e-playwright.js (9 insertions, 15
deletions)
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WS broadcast pushes all packets regardless of the selected time
window filter. This caused old packets to appear in the table even
when the API correctly returned zero results for the time range.
Add time window check to the WS packet filter — drops packets
with timestamps older than the selected window cutoff.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The outer IIFE declares const isMobile (line 27, from #340) and
renderLeft() declares its own const isMobile (line 821, pre-existing).
JavaScript hoists const declarations within the function scope, so
referencing isMobile at line 574 (inside renderLeft but before line 821)
throws 'Cannot access isMobile before initialization'.
Rename the inner declaration to isNarrow since it uses a different
breakpoint (640px for column hiding vs 1024px for packet limit).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes#326 — the packets page crashes mobile browsers (iOS Safari, Edge)
by loading 50K+ packets when no time filter is persisted in
localStorage.
## Root Cause
Two problems in public/packets.js:
### Bug 1: savedTimeWindowMin defaults to 0 instead of 15
localStorage.getItem('meshcore-time-window') returns
ull when never set. Number(null) = 0. The guard checked < 0 but not <=
0, so savedTimeWindowMin = 0 meant "All time" — fetching all 50K+
packets.
**Fix:** Changed < 0 to <= 0 in both the initialization guard (line 30)
and the change handler (line 758).
### Bug 2: No mobile protection against large packet loads
Even with valid large time windows, mobile browsers crash under the
weight of thousands of DOM rows and packet data (~1.4 GB WebKit memory
limit).
**Fix:**
- Detect mobile viewport: window.innerWidth <= 768
- Cap limit at 1000 on mobile (vs 50000 on desktop)
- Disable 6h/12h/24h options and hide "All time" on mobile
- Reset persisted windows >3h to 15 min on mobile
## Testing
Added 9 unit tests in est-frontend-helpers.js covering:
- savedTimeWindowMin defaults to 15 when localStorage returns null
- savedTimeWindowMin defaults to 15 when localStorage returns "0"
- Valid values (60) are preserved
- Negative and NaN values default to 15
- PACKET_LIMIT is 1000 on mobile, 50000 on desktop
- Mobile caps large time windows (1440 → 15) but allows 180
All 218 frontend helper tests pass. Packet filter (62) and aging (29)
tests also pass.
## Changes
| File | Change |
|------|--------|
| public/packets.js | Fix <= 0 guard, add mobile detection, cap limit,
restrict time options |
| public/index.html | Cache buster bump |
| est-frontend-helpers.js | 9 new regression tests for time window
defaults and mobile caps |
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Extends the hamburger menu activation breakpoint from max-width: 640px
to max-width: 1023px, making all 11 nav items accessible on tablets and
small laptops where they were previously clipped/invisible.
Fixes#322
## Changes
### public/style.css
- New @media (max-width: 1023px) block activates the hamburger menu and
vertical drawer
- Drawer has max-height: calc(100dvh - 52px) with overflow-y: auto for
scrollability
- z-index set to 1100 (consistent with nav layer)
- ody.nav-open locks background scroll when drawer is open
- Mobile-only rules (brand-text hidden, tighter nav-right gap) remain at
640px
### public/app.js
- Extracted closeNav() helper for consistent drawer close behavior
- Hamburger toggle now adds/removes ody.nav-open class
- Drawer closes on: nav link click, Escape key, and route change (SPA
navigation)
### public/index.html
- Cache busters bumped for all CSS/JS assets
## What's NOT changed
- Desktop layout (>=1024px) is completely untouched
- No Priority+ pattern (Phase 2)
- No map layout changes (Phase 3)
- No new dependencies
## Testing
- All 308 frontend tests pass ( est-frontend-helpers.js,
est-packet-filter.js, est-aging.js)
- Visual verification: hamburger activates at <=1023px, full bar at
>=1024px
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes#303 — Repeater hash stats now reflect the **latest advert**
instead of the historical mode (most frequent).
When a node is reconfigured (e.g. from 1-byte to 2-byte hash size), the
analytics and node detail pages now show the updated value immediately
after the next advert is received.
## Changes
### cmd/server/store.go
1. **computeNodeHashSizeInfo** — Changed hash size determination from
statistical mode to latest advert. The most recent advert in
chronological order now determines hash_size. The hash_sizes_seen and
hash_size_inconsistent tracking is preserved for multi-byte analytics.
2. **computeAnalyticsHashSizes** — Two fixes:
- **yNode keyed by pubKey** instead of name, so same-name nodes with
different public keys are counted separately in distributionByRepeaters.
- **Zero-hop adverts included** — advert originator tracking now happens
before the hops check, so zero-hop adverts contribute to per-node stats.
### cmd/server/routes_test.go
Added 4 new tests:
- TestGetNodeHashSizeInfoLatestWins — 4 historical 1-byte adverts + 1
recent 2-byte advert → hash size should be 2 (not 1 from mode)
- TestGetNodeHashSizeInfoNoAdverts — node with no ADVERT packets →
graceful nil, no crash
- TestAnalyticsHashSizeSameNameDifferentPubkey — two nodes named
"SameName" with different pubkeys → counted as 2 separate entries
- Updated TestGetNodeHashSizeInfoDominant comment to reflect new
behavior
## Context
Community report from contributor @kizniche: after reconfiguring a
repeater from 1-byte to 2-byte hash and sending a flood advert, the
analytics page still showed 1-byte. Root cause was the mode-based
computation which required many new adverts to shift the majority. The
upstream firmware bug causing stale path bytes
(meshcore-dev/MeshCore#2154) has been fixed, making the latest advert
reliable.
## Testing
- `go vet ./...` — clean
- `go test ./... -count=1` — all tests pass (including 4 new ones)
- `cmd/ingestor` tests — pass
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes the poor contrast in the node side pane's "Paths through this
node" section in dark mode.
## Root Cause
.node-detail-section (side pane) had no background or border — it
inherited the lighter --detail-bg (#232340) from .panel-right. The same
content on the full detail page sits inside .node-full-card which uses
the darker --card-bg (#1a1a2e) + a visible border, giving it proper
contrast.
| Context | Container | Background | Contrast |
|---------|-----------|------------|----------|
| Full detail page | .node-full-card | --card-bg (darker) | ✅ Good |
| Side pane | .node-detail-section | inherited --detail-bg (lighter) | ❌
Poor |
## Fix
Give .node-detail-section the same card treatment as .node-full-card:
`css
.node-detail-section {
background: var(--card-bg);
border: 1px solid var(--border);
border-radius: 8px;
padding: 12px;
margin-bottom: 8px;
}
`
- All colors use CSS variables — no hardcoded hex values
- Both light and dark themes benefit from the card treatment
- No JS changes needed — CSS-only fix
- Cache busters bumped in the same commit
Fixes#334
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Surfaces transport route types in the packets view by adding a **"T"
badge** next to the payload type badge for packets with
`TRANSPORT_FLOOD` (route type 0) or `TRANSPORT_DIRECT` (route type 3)
routes.
This helps mesh analysis — communities can quickly identify transported
packets and gain insights into scope usage adoption.
Closes#241
## What Changed
### Frontend (`public/`)
- **app.js**: Added `isTransportRoute(rt)` and `transportBadge(rt)`
helper functions that render a `<span class="badge
badge-transport">T</span>` badge with the full route type name as a
tooltip
- **packets.js**: Applied `transportBadge()` in all three packet row
render paths:
- Flat (ungrouped) packet rows
- Grouped packet header rows
- Grouped packet child rows
- **style.css**: Added `.badge-transport` class with amber styling and
CSS variable support (`--transport-badge-bg`, `--transport-badge-fg`)
for theme customization
### Backend (`cmd/server/`)
- **decoder_test.go**: Added 6 new tests covering:
- `TestDecodeHeader_TransportFlood` — verifies route type 0 decodes as
TRANSPORT_FLOOD
- `TestDecodeHeader_TransportDirect` — verifies route type 3 decodes as
TRANSPORT_DIRECT
- `TestDecodeHeader_Flood` — verifies route type 1 (non-transport)
decodes correctly
- `TestIsTransportRoute` — verifies the helper identifies transport vs
non-transport routes
- `TestDecodePacket_TransportFloodHasCodes` — verifies transport codes
are extracted from T_FLOOD packets
- `TestDecodePacket_FloodHasNoCodes` — verifies FLOOD packets have no
transport codes
## Visual
In the packets table Type column, transport packets now show:
```
[Channel Msg] [T] ← transport packet
[Channel Msg] ← normal flood packet
```
The "T" badge has an amber color scheme and shows the full route type
name on hover.
## Tests
- All Go tests pass (`cmd/server` and `cmd/ingestor`)
- All frontend tests pass (`test-packet-filter.js`, `test-aging.js`,
`test-frontend-helpers.js`)
- Cache busters bumped in `index.html`
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem
Two data integrity bugs in the Go ingestor cause observer metadata and
signal quality data to be missing for all Go-backend users.
### #320 — Observer metadata never populated
`extractObserverMeta()` reads `battery_mv`, `uptime_secs`, and
`noise_floor` from the **top level** of the MQTT status message.
However, the actual MQTT payload nests these under a `stats` object:
```json
{
"status": "online",
"origin": "ObserverName",
"model": "Heltec V3",
"firmware_version": "v1.14.0-9f1a3ea",
"stats": {
"battery_mv": 4174,
"uptime_secs": 80277,
"noise_floor": -110
}
}
```
Result: battery, uptime, and noise floor are always NULL in the
database.
### #321 — SNR and RSSI always missing on raw packets
The raw packet handler reads `msg["SNR"]` and `msg["RSSI"]` (uppercase
only). Some MQTT bridges send these as lowercase `snr`/`rssi`. The
companion BLE handler already has a case-insensitive fallback — the raw
packet path did not.
Result: SNR/RSSI are NULL for all raw packet observations from bridges
that use lowercase keys.
## Fix
### #320 — Nested stats with top-level fallback
- Added `nestedOrTopLevel()` helper that checks `msg["stats"][key]`
first, then `msg[key]`
- `extractObserverMeta` now uses this helper for `battery_mv`,
`uptime_secs`, `noise_floor`
- Top-level fallback preserved for backward compatibility with bridges
that flatten the structure
- Safe type assertion: `stats, _ :=
msg["stats"].(map[string]interface{})` — no crash if stats is missing or
wrong type
### #321 — Lowercase SNR/RSSI fallback
- Raw packet handler now uses `else if` to check lowercase `snr`/`rssi`
when uppercase keys are absent
- Matches the pattern already used in the companion channel and direct
message handlers
## Tests
10 new test cases added:
| Test | What it verifies |
|------|-----------------|
| `TestExtractObserverMetaNestedStats` | All 5 fields populated from
nested stats object |
| `TestExtractObserverMetaNestedStatsPrecedence` | Nested stats wins
over top-level when both present |
| `TestExtractObserverMetaFlatFallback` | Flat structure still works
(backward compat) |
| `TestExtractObserverMetaEmptyStats` | Empty stats object — no crash,
model still works |
| `TestExtractObserverMetaStatsNotAMap` | stats is a string — no crash,
falls back to top-level |
| `TestExtractObserverMetaNoiseFloorFloat` | Float precision preserved
(noise_floor REAL migration) |
| `TestHandleMessageWithLowercaseSNRRSSI` | Lowercase snr/rssi both
stored correctly |
| `TestHandleMessageSNRRSSIUppercaseWins` | When both cases present,
uppercase takes precedence |
| `TestHandleMessageNoSNRRSSI` | Neither key present — nil, no crash |
| Existing `TestExtractObserverMeta` | Still passes (flat structure
backward compat) |
All tests pass: `go test ./... -count=1` and `go vet ./...` clean.
Closes#320Closes#321
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem
The self-hosted runner (`meshcore-runner-2`) filled its 29GB disk to
100%, blocking all CI runs:
```
Filesystem Size Used Avail Use%
/dev/root 29G 29G 2.3M 100%
Docker Images: 67 total, 2 active, 18.83GB reclaimable (99%)
```
Root cause: no Docker image cleanup after builds. Each CI run builds a
new image but never prunes old ones.
## Fix
### 1. Docker image cleanup after deploy (`deploy` job)
- Runs with `if: always()` so it executes even if deploy fails
- `docker image prune -af --filter "until=24h"` — removes images older
than 24h (safe: current build is minutes old)
- `docker builder prune -f --keep-storage=1GB` — caps build cache
- Logs before/after `docker system df` for visibility
### 2. Runner log cleanup at start of E2E job
- Prunes runner diagnostic logs older than 3 days (was 53MB and growing)
- Reports `df -h` for disk visibility in CI output
## Impact
After manual cleanup today, disk went from 100% → 35% (19GB free). This
PR prevents recurrence.
## Test plan
- [x] Manual cleanup verified on runner via `az vm run-command`
- [ ] Next CI run should show cleanup step output in deploy job logs
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Several features and fixes from a live deployment of the Go v3.0.0
backend.
### geo_filter — full enforcement
- **Go backend config** (`cmd/server/config.go`,
`cmd/ingestor/config.go`): added `GeoFilterConfig` struct so
`geo_filter.polygon` and `bufferKm` from `config.json` are parsed by
both the server and ingestor
- **Ingestor** (`cmd/ingestor/geo_filter.go`, `cmd/ingestor/main.go`):
ADVERT packets from nodes outside the configured polygon + buffer are
dropped *before* any DB write — no transmission, node, or observation
data is stored
- **Server API** (`cmd/server/geo_filter.go`, `cmd/server/routes.go`):
`GET /api/config/geo-filter` endpoint returns the polygon + bufferKm to
the frontend; `/api/nodes` responses filter out any out-of-area nodes
already in the DB
- **Frontend** (`public/map.js`, `public/live.js`): blue polygon overlay
(solid inner + dashed buffer zone) on Map and Live pages, toggled via
"Mesh live area" checkbox, state shared via localStorage
### Automatic DB pruning
- Add `retention.packetDays` to `config.json` to delete transmissions +
observations older than N days on a daily schedule (1 min after startup,
then every 24h). Nodes and observers are never pruned.
- `POST /api/admin/prune?days=N` for manual runs (requires `X-API-Key`
header if `apiKey` is set)
```json
"retention": {
"nodeDays": 7,
"packetDays": 30
}
```
### tools/geofilter-builder.html
Standalone HTML tool (no server needed) — open in browser, click to
place polygon points on a Leaflet map, set `bufferKm`, copy the
generated `geo_filter` JSON block into `config.json`.
### scripts/prune-nodes-outside-geo-filter.py
Utility script to clean existing out-of-area nodes from the database
(dry-run + confirm). Useful after first enabling geo_filter on a
populated DB.
### HB column in packets table
Shows the hop hash size in bytes (1–4) decoded from the path byte of
each packet's raw hex. Displayed as **HB** between Size and Type
columns, hidden on small screens.
## Test plan
- [x] ADVERT from node outside polygon is not stored (no new row in
nodes or transmissions)
- [x] `GET /api/config/geo-filter` returns polygon + bufferKm when
configured, `{polygon: null, bufferKm: 0}` when not
- [x] `/api/nodes` excludes nodes outside polygon even if present in DB
- [x] Map and Live pages show blue polygon overlay when configured;
checkbox toggles it
- [x] `retention.packetDays: 30` deletes old transmissions/observations
on startup and daily
- [x] `POST /api/admin/prune?days=30` returns `{deleted: N, days: 30}`
- [x] `tools/geofilter-builder.html` opens standalone, draws polygon,
copies valid JSON
- [x] HB column shows 1–4 for all packets in grouped and flat view
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- fix packets initial load to honor persisted `meshcore-time-window`
before the filter UI is rendered
- keep the dropdown and effective query window in sync via a shared
`savedTimeWindowMin` value
- add a frontend regression test to ensure `loadPackets()` falls back to
persisted time window when `#fTimeWindow` is not yet present
- bump cache busters in `public/index.html`
## Root cause
`loadPackets()` could run before the filter bar existed, so
`document.getElementById('fTimeWindow')` was null and it fell back to
`15` minutes even though localStorage had a different saved value.
## Testing
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Fix: unprotect /api/decode from API key auth
Fixes#304
### Problem
PR #283 applied `requireAPIKey` to all POST endpoints including
`/api/decode`. But BYOP decode is a stateless read-only decoder — it
never writes to the database. Users see "write endpoints disabled" when
trying to decode packets.
### Fix
- Removed `requireAPIKey` wrapper from `/api/decode` in
`cmd/server/routes.go`
- Updated auth tests to use `/api/perf/reset` (actual write endpoint)
instead of `/api/decode`
- Added tests proving `/api/decode` works without API key, even when
apiKey is configured or empty
### Note
Decoder consolidation (`internal/decoder/` shared package) is tracked
separately and not included here to keep the PR clean.
### Tests
- `cd cmd/server && go test ./...` ✅
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- show relative build age next to the commit hash in the nav stats
version badge (e.g. `abc1234 (3h ago)`)
- use `stats.buildTime` from `/api/stats` and existing `timeAgo()`
formatting in `public/app.js`
- keep behavior unchanged when `buildTime` is missing/unknown
## What changed
- updated `formatVersionBadge()` signature to accept `buildTime`
- appended a `build-age` span after the commit link when `buildTime` is
valid
- passed `stats.buildTime` from `updateNavStats()`
- updated frontend helper tests for the new function signature
- added regression tests for build-age rendering/skip behavior
- bumped cache busters in `public/index.html`
## API check
- verified Go server already exposes `buildTime` on `/api/stats` and
`/api/health` via `cmd/server/routes.go`
- no backend API changes required
## Tests
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
All passed locally.
## Browser validation
- Not run in this environment (no browser session available).
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass region through channel message routes, apply DB/store filtering, normalize IATA at read and write boundaries, and add regression coverage for routes/server/ingestor.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds rule 13 to AGENTS.md: mandatory git worktree isolation for parallel
agents.
- Implementation agents must work in dedicated worktrees, never the main
checkout
- Review agents must read remote branches via `git show
origin/<branch>:`, never the working tree
- Prevents the stale-worktree bug that caused multiple incorrect reviews
tonight
## Summary
This PR performs a **one-time line-ending normalization** with **no
functional changes**.
- Normalized previously CRLF-indexed files to LF using `git add
--renormalize .`
- Added `.git-blame-ignore-revs` to preserve usable blame history for
this bulk formatting-only commit
- Added explicit LF enforcement for shell scripts in `.gitattributes`:
- `manage.sh text eol=lf`
- `*.sh text eol=lf`
## Why this is safe
- This is a text normalization pass only (CRLF → LF)
- No logic, behavior, APIs, or runtime paths were changed
- Review diff noise is expected due to line-ending-only rewrites
## File-count context
- There were **148 known CRLF-indexed files** targeted for normalization
- The renormalization pass touched **151 files total** in this
repository snapshot
## Blame preservation
GitHub reads `.git-blame-ignore-revs` natively, so blame views can skip
the normalization commit.
For local git blame setup:
```bash
git config blame.ignoreRevsFile .git-blame-ignore-revs
```
## Validation
Executed required no-regression checks:
```bash
node test-frontend-helpers.js && node test-packet-filter.js && node test-aging.js
```
All passed.
## Summary
- fix safe .env parser in manage.sh to expand a leading ~ before export
- ensure setup-time PROD_DATA_DIR read from .env also expands ~
- keep behavior unchanged for non-tilde values
## Why
xport "=" does not perform tilde expansion, so values like
PROD_DATA_DIR=~/meshcore-data stayed literal and broke path-based
operations in manage.sh.
## Validation
- ash -n manage.sh
- manual reasoning: PROD_DATA_DIR=~/meshcore-data now resolves to
$HOME/meshcore-data
## Notes
Docker Compose handling is unchanged (compose already expands ~); this
PR only fixes manage.sh runtime parsing.
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- add `DISABLE_MOSQUITTO` support in container startup by switching
supervisord config when disabled
- add a no-mosquitto supervisord config
(`docker/supervisord-go-no-mosquitto.conf`)
- fix Compose port mapping regression so host ports map to fixed
internal listener ports (`80`, `443`, `1883`)
- add compose variants without MQTT port publishing
(`docker-compose.no-mosquitto.yml`,
`docker-compose.staging.no-mosquitto.yml`)
- update `manage.sh` setup flow to ask `Use built-in MQTT broker?
[Y/n]`, skip MQTT port prompt when disabled, persist
`DISABLE_MOSQUITTO`, and use no-mosquitto compose files when
starting/stopping/restarting
- align `.env.example` staging keys with compose
(`STAGING_GO_HTTP_PORT`, `STAGING_GO_MQTT_PORT`)
- fix staging Caddyfile generation to use `STAGING_GO_HTTP_PORT`
- fix `.env.example` staging default comments to match actual values
(82/1885)
## Validation performed
- ✅ `bash -n manage.sh` passes.
- ✅ With `DISABLE_MOSQUITTO=true`, no-mosquitto compose overrides are
selected, Mosquitto is not started, and MQTT port is not published.
- ✅ With `DISABLE_MOSQUITTO=false`, standard compose files are used,
Mosquitto starts, and MQTT port mapping is present.
- ℹ️ Runtime Docker validation requires a running Docker host.
Fixes#267
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes frontend region crosstalk on Channels page by applying region
filtering to message fetches and live WS GRP_TXT handling.
## Changes
- Append `region` query param to channel message API calls in
`selectChannel` and `refreshMessages`.
- Add WS region guard in `public/channels.js` using observer→IATA map
with selected-region snapshot at handler entry.
- On region switch, reload channels and re-fetch selected channel
messages; if empty under selected region, clear pane and show `Channel
not available in selected region`.
- Bump cache busters in `public/index.html`.
- Add frontend helper tests for extracted WS region filter helper in
`test-frontend-helpers.js`.
## Validation
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
Refs #280
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- fix observer upsert write path in `cmd/ingestor` to persist identity
fields
- map status payload fields into observer metadata: `model`,
`firmware`/`firmware_version`, `client_version`/`clientVersion`, `radio`
- keep NULL-safe behavior when identity fields are missing
- add regression tests for identity persistence and missing-field
handling
## Root cause
The ingestor only wrote telemetry (`battery_mv`, `uptime_secs`,
`noise_floor`) and never included observer identity columns in the
upsert statement, leaving `model`, `firmware`, `client_version`, and
`radio` NULL on fresh DBs.
## Testing
- `cd cmd/ingestor && go test ./...`
- `cd cmd/server && go test ./...`
Fixes#295
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Fixes BYOP modal stacking on the Packets page by preventing duplicate
global click handlers and enforcing a single BYOP overlay instance.
## Root cause
Packets page init could register document-level click handlers
repeatedly across SPA navigations. Clicking BYOP then spawned multiple
overlays, and each close action removed only one layer.
## Changes
- `public/packets.js`
- Added `bindDocumentHandler(...)` to de-duplicate document click
handlers.
- Applied it to packets action delegation, filter menu outside-click
close, and column menu close.
- Added `removeAllByopOverlays()` and call it before opening BYOP.
- Tagged BYOP overlay with `.byop-overlay` class.
- Updated close logic to remove all BYOP overlays in one click.
- Scoped BYOP result lookup to the active overlay
(`overlay.querySelector`).
- Added destroy cleanup for document handlers and stray BYOP overlays.
- `test-frontend-helpers.js`
- Added regression tests for:
- BYOP singleton overlay behavior
- one-click close removing all overlays
- document click handler de-dup logic
- `public/index.html`
- Bumped cache busters for JS/CSS assets.
## Validation
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
All passed locally.
Fixes#249
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Fix: Enforce LF line endings repo-wide
### Problem
Windows-based agents produce CRLF line endings, causing git diffs to
show every line as changed (1000+ line "rewrites" that are actually
20-line patches). This has hit us on `manage.sh`, `deploy.yml`, and
multiple PRs.
### Fix
Added `* text=auto eol=lf` to `.gitattributes`. Git will now:
- Store all text files as LF in the repo
- Convert CRLF to LF on commit (regardless of OS)
- Check out as LF on all platforms
Also marks common binary formats explicitly.
### Impact
Existing files with CRLF will be normalized on their next commit. No
functional changes.
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Implements issue #286 P1 frontend timestamp features on top of P0:
- Added global timestamp timezone toggle in Display tab (`Local time` /
`UTC`)
- Added absolute-mode timestamp format presets (`iso`, `iso-seconds`,
`locale`)
- Added optional custom format input (only when
`SITE_CONFIG.timestamps.allowCustomFormat === true`)
- Extended `formatTimestamp()` / `formatTimestampWithTooltip()` behavior
to honor timezone + format settings
- Preserved server defaults with localStorage override precedence
- Bumped `public/index.html` cache busters in same commit
## Details
### 1) Timezone toggle
- New Display tab control persisted to `meshcore-timestamp-timezone`
- Reads server default from `window.SITE_CONFIG.timestamps.timezone`
with fallback to `local`
- Formatting logic now supports both local and UTC absolute rendering
### 2) Format presets (absolute mode only)
- New Display tab preset dropdown (shown only when timestamp mode =
`absolute`)
- Presets implemented:
- `iso` → `YYYY-MM-DD HH:mm:ss`
- `iso-seconds` → `YYYY-MM-DD HH:mm:ss.SSS`
- `locale` → `toLocaleString()` (or UTC locale when timezone=utc)
- Persisted to `meshcore-timestamp-format`
- Reads server default from `window.SITE_CONFIG.timestamps.formatPreset`
(fallback `iso`)
### 3) Custom format string (guarded)
- Text input only renders when
`window.SITE_CONFIG.timestamps.allowCustomFormat` is `true`
- Persisted to `meshcore-timestamp-custom-format`
- If non-empty and enabled, custom format overrides preset
- Frontend intentionally does not hard-validate the format string;
unsupported patterns fall back to preset behavior
## Tests
Executed required test commands:
```bash
node test-frontend-helpers.js
node test-packet-filter.js
node test-aging.js
```
Added coverage in `test-frontend-helpers.js` for:
- UTC output behavior
- Local output behavior
- `iso-seconds` includes milliseconds
- `locale` format behavior
All passed locally.
Refs #286
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
## Fix: FAQ save no longer wipes other home page sections
Fixes#284
### Problem
Editing FAQ in the customizer and saving caused other home page sections
(steps, footer links, hero text) to disappear on reload. Colors could
also reset.
### Root cause
`initState()` in `customize.js` used `||` (OR) logic for the `home`
object — if localStorage had *any* `home.checklist`, it took that and
ignored the server config for other fields. Partial localStorage data
replaced the full server config instead of merging on top.
### Fix
Changed `initState()` to properly layer: `DEFAULTS → server config →
localStorage` for all sections. Each field merges independently — a
partial localStorage save (e.g., only checklist) no longer wipes steps,
footerLinks, or hero fields. Same merge pattern applied to all theme
sections for consistency.
### Files changed
- `public/customize.js` — `initState()` merge logic
- `public/index.html` — cache buster bump
- `test-frontend-helpers.js` — regression tests:
1. Partial localStorage (checklist only) preserves steps/footerLinks
2. Server config values survive partial local overrides
3. Full localStorage properly overrides server config
### Testing
- `node test-frontend-helpers.js` ✅
- `node test-packet-filter.js` ✅
- `node test-aging.js` ✅
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- Add a new **Display** tab to the customizer tab bar (between Home Page
and Export / Save).
- Move timestamp-related **UI Settings** out of Branding into the new
Display tab.
- Keep Branding focused on site identity fields (name, tagline, logo,
favicon).
- Bump `public/index.html` cache busters so updated frontend assets load
immediately.
## Testing
- `node test-frontend-helpers.js`
- `node test-packet-filter.js`
- `node test-aging.js`
Fixes#293
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem
Repeaters with 2-byte adverts occasionally appear as 1-byte on the map
and in stats.
**Root cause:** `computeNodeHashSizeInfo()` sets `HashSize` by
overwriting on every packet (`ni.HashSize = hs`), so the last advert
processed wins — regardless of how many previous packets correctly
showed 2-byte.
When a node sends an ADVERT directly (no relay hops), the path byte
encodes `hashCount=0`. Some firmware sets the full path byte to `0x00`
in this case, which decodes as `hashSize=1` even if the node normally
uses 2-byte hashes. If this packet happens to be the last one iterated,
the node shows as 1-byte.
## Fix
Compute the **mode** (most frequent hash size) across all observed
adverts instead of using the last-seen value. On a tie, prefer the
larger value.
```go
counts := make(map[int]int, len(ni.AllSizes))
for _, hs := range ni.Seq {
counts[hs]++
}
best, bestCount := 1, 0
for hs, cnt := range counts {
if cnt > bestCount || (cnt == bestCount && hs > best) {
best = hs
bestCount = cnt
}
}
ni.HashSize = best
```
A node with 4× hashSize=2 and 1× hashSize=1 now correctly reports
`HashSize=2`.
## Test
`TestGetNodeHashSizeInfoDominant`: seeds 5 adverts (4× 2-byte, 1×
1-byte) and asserts `HashSize=2`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Adds `GeoFilter` struct to `Config` in `cmd/server/config.go` so
`geo_filter.polygon` and `bufferKm` from `config.json` are parsed by the
Go backend
- Adds `GET /api/config/geo-filter` endpoint in `cmd/server/routes.go`
returning the polygon + bufferKm to the frontend
- Restores the blue polygon overlay (solid inner + dashed buffer zone)
on the **Map** page (`public/map.js`)
- Restores the same overlay on the **Live** page (`public/live.js`),
toggled via the "Mesh live area" checkbox
## Test plan
- [x] `GET /api/config/geo-filter` returns `{ polygon: [...], bufferKm:
N }` when configured
- [x] `GET /api/config/geo-filter` returns `{ polygon: null, bufferKm: 0
}` when not configured
- [x] Map page shows blue polygon overlay when `geo_filter.polygon` is
set in config
- [x] Live page shows same overlay, checkbox state shared via
localStorage
- [x] Checkbox is hidden when no polygon is configured
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the separate config.json file bind mount from both compose
files. The data directory mount already covers it, and the Go server
searches /app/data/config.json via LoadConfig.
- Entrypoint symlinks /app/data/config.json for ingestor compatibility
- manage.sh setup creates config in data dir, prompts admin if missing
- manage.sh start checks config exists before starting, offers to create
- deploy.yml simplified — no more sudo rm or directory cleanup
- Backup/restore updated to use data dir path
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- added API key middleware for write routes in cmd/server/routes.go
- protected all current non-GET API routes (POST /api/packets, POST
/api/perf/reset, POST /api/decode)
- middleware enforces X-API-Key against cfg.APIKey and returns 401 JSON
error on missing/wrong key
- preserves backward compatibility: if piKey is empty, requests pass
through
- added startup warning log in cmd/server/main.go when no API key is
configured:
- [security] WARNING: no apiKey configured — write endpoints are
unprotected
- added route tests for missing/wrong/correct key and empty-apiKey
compatibility
## Validation
- cd cmd/server && go test ./... ✅
## Notes
- config.example.json already contains piKey, so no changes were
required.
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removes the separate config.json file bind mount from both compose
files. The data directory mount already covers it, and the Go server
searches /app/data/config.json via LoadConfig.
- Entrypoint symlinks /app/data/config.json for ingestor compatibility
- manage.sh setup creates config in data dir, prompts admin if missing
- manage.sh start checks config exists before starting, offers to create
- deploy.yml simplified — no more sudo rm or directory cleanup
- Backup/restore updated to use data dir path
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 1/2/3-byte selector to Hash Issues analytics page
- 1-byte and 2-byte modes show 16×16 matrix with stat cards (nodes
tracked, using N-byte ID, prefix space used, prefix collisions)
- 3-byte mode shows summary stat cards instead of unrenderable grid
- Fix "Nodes tracked" to always show total node count across all modes
- Use CSS variable colours for matrix cells (light/dark mode compatible)
- Replace native title tooltips with custom styled popovers
- Hide collision risk card when 3-byte mode is selected
- Fix double-tooltip bug on mode switch via _matrixTipInit guard
- Fix tooltip persisting outside matrix grid on mouseleave
https://dev.ve7kod.ca/#/analytics
Hash Issues
---------
Co-authored-by: Jesse <your@email.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes#276
## Root cause
TRACE packets store hop IDs in the payload (bytes 9+) rather than in the
header path field. The header path field is overloaded in TRACE packets
to carry RSSI values instead of repeater IDs (as noted in the issue
comments). This meant `Path.Hops` was always empty for TRACE packets —
the raw bytes ended up as an opaque `PathData` hex string with no
structure.
The hashSize encoded in the header path byte (bits 6–7) is still valid
for TRACE and is used to split the payload path bytes into individual
hop prefixes.
## Fix
After decoding a TRACE payload, if `PathData` is non-empty, parse it
into individual hops using `path.HashSize`:
```go
if header.PayloadType == PayloadTRACE && payload.PathData != "" {
pathBytes, err := hex.DecodeString(payload.PathData)
if err == nil && path.HashSize > 0 {
for i := 0; i+path.HashSize <= len(pathBytes); i += path.HashSize {
path.Hops = append(path.Hops, ...)
}
}
}
```
Applied to both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go`.
## Verification
Packet from the issue: `260001807dca00000000007d547d`
| | Before | After |
|---|---|---|
| `Path.Hops` | `[]` | `["7D", "54", "7D"]` |
| `Path.HashCount` | `0` | `3` |
New test `TestDecodeTracePathParsing` covers this exact packet.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
The in-memory `PacketStore` had **no eviction or aging** — it grew
unbounded until OOM killed the process. At ~3K packets/hour and ~5KB per
packet (not the 450 bytes previously estimated), an 8GB VM would OOM in
a few days.
## Changes
### Time-based eviction
- Configurable via `config.json`: `"packetStore": { "retentionHours": 24
}`
- Packets older than the retention window are evicted from the head of
the sorted slice
### Memory-based cap
- Configurable via `"packetStore": { "maxMemoryMB": 1024 }`
- Hard ceiling — evicts oldest packets when estimated memory exceeds the
cap
### Index cleanup
When a `StoreTx` is evicted, ALL associated data is removed from:
- `byHash`, `byTxID`, `byObsID`, `byObserver`, `byNode`, `byPayloadType`
- `nodeHashes`, `distHops`, `distPaths`, `spIndex`
### Periodic execution
- Background ticker runs eviction every 60 seconds
- Analytics caches and hash size cache are invalidated after eviction
### Stats fixes
- `estimatedMB` now uses ~5KB/packet + ~500B/observation (was 430B +
200B)
- `evicted` counter reflects actual evictions (was hardcoded to 0)
- Removed fake `maxPackets: 2386092` and `maxMB: 1024` from stats
### Config example
```json
{
"packetStore": {
"retentionHours": 24,
"maxMemoryMB": 1024
}
}
```
Both values default to 0 (unlimited) for backward compatibility.
## Tests
- 7 new tests in `eviction_test.go` covering time-based, memory-based,
index cleanup, thread safety, config parsing, and no-op when disabled
- All existing tests pass unchanged
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Problem
The RF analytics `packetsPerHour` chart was counting **observations**
instead of **unique transmissions** per hour. With ~34 observations per
transmission on average, the chart showed ~5,645 packets/hr instead of
the correct ~163/hr.
**Evidence from prod API:**
- `packetsPerHour` total: 1,580,620 (sum of all hourly counts)
- `totalPackets`: 45,764
- That's a ~34× inflation — exactly the observations-per-transmission
ratio
## Root Cause
In `store.go`, the `hourBuckets[hr]++` counter was inside the
observations loop (both regional and non-regional paths). Other counters
like `packetSizes` and `typeBuckets` already deduplicate by hash —
`hourBuckets` was the only one that didn't.
## Fix
Added a `seenHourHash` map (keyed by `hash|hour`) to deduplicate. Each
unique transmission is counted once per hour bucket, matching how packet
sizes and payload types already work.
Both the regional observer path and the non-regional path are fixed. The
legacy path (transmissions without observations) was already correct
since it iterates per-transmission.
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Root causes from CI logs:
1. 'read /app/config.json: is a directory' — Docker creates a directory
when bind-mounting a non-existent file. The entrypoint now detects
and removes directory config.json before falling back to example.
2. 'unable to open database file: out of memory (14)' — old container
(3GB) not fully exited when new one starts. Deploy now uses
'docker compose down' with timeout and waits for memory reclaim.
3. Supervisor gave up after 3 fast retries (FATAL in ~6s). Increased
startretries to 10 and startsecs to 2 for server and ingestor.
Additional:
- Deploy step ensures staging config.json exists before starting
- Healthcheck: added start_period=60s, increased timeout and retries
- No longer uses manage.sh (CI working dir != repo checkout dir)
Root cause: on the 8GB VM, both prod (~2.5GB) and staging (~2GB) containers
run simultaneously. During deploy, manage.sh would rm the old staging container
and immediately start a new one. The old container's memory wasn't reclaimed
yet, so the new one got 'unable to open database file: out of memory (14)'
from SQLite and both corescope-server and corescope-ingestor entered FATAL.
Fix:
- manage.sh restart staging: wait up to 15s for old container to fully exit,
plus 3s for OS memory reclamation before starting new container
- manage.sh restart staging: verify config.json exists before starting
- docker-compose.staging.yml: add deploy.resources.limits.memory=3g to
prevent staging from consuming unbounded memory
## Summary
Adds `distributionByRepeaters` to the `/api/analytics/hash-sizes`
endpoint in the **Go server**.
### Problem
PR #263 implemented this feature in the deprecated Node.js server
(server.js). All backend changes should go in the Go server at
`cmd/server/`.
### Solution
- For each hash size (1, 2, 3), count how many unique repeaters (nodes)
advertise packets with that hash size
- Uses the existing `byNode` map already computed in
`computeAnalyticsHashSizes()`
- Added to both the live response and the empty/fallback response in
routes.go
- Frontend changes from PR #263 (`public/analytics.js`) already render
this field — no frontend changes needed
### Response shape
```json
{
"distributionByRepeaters": { "1": 42, "2": 7, "3": 2 },
...existing fields...
}
```
### Testing
- All Go server tests pass
- Replaces PR #263 (which modified the wrong server)
Closes#263
---------
Co-authored-by: you <you@example.com>
down tears down the entire compose project including prod.
rm -sf stops and removes just the named service.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Version/Commit/BuildTime now populated from package.json, git, and
date. Exported as env vars so docker compose build picks them up.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Complete CI pipeline restructure. Sequential fail-fast chain, E2E tests
against Go server with real staging data, all deprecated Node.js server
tests removed.
### Pipeline (PR):
1. **Go unit tests** — fail-fast, coverage + badges
2. **Playwright E2E** — against Go server with fixture DB, frontend
coverage, fail-fast on first failure
3. **Docker build** — verify containers build
### Pipeline (master merge):
Same chain + deploy to staging + badge publishing
### Removed:
- All Node.js server-side unit tests (deprecated JS server)
- `npm ci` / `npm run test` steps
- JS server coverage collection (`COVERAGE=1 node server.js`)
- Changed-files detection logic
- Docs-only CI skip logic
- Cancel-workflow API hacks
### Added:
- `test-fixtures/e2e-fixture.db` — real data from staging (200 nodes, 31
observers, 500 packets)
- `scripts/capture-fixture.sh` — refresh fixture from staging API
- Go server launches with `-port 13581 -db test-fixtures/e2e-fixture.db
-public public-instrumented`
---------
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
Three optimizations to reduce wall-clock time:
1. Reduce safeClick timeout from 3000ms to 500ms
- Elements either exist immediately after navigation or don't exist at all
- ~75 safeClick calls; if ~30 miss, saves ~75s of dead wait time
2. Replace 18 page.goto() calls with SPA hash navigation
- After initial page load, the SPA shell is already in the DOM
- page.goto() reloads the entire page (network round-trip + parse)
- Hash navigation via location.hash triggers the SPA router instantly
- Only 3 page.goto() remain: initial load + 2 home page loads after localStorage.clear()
3. Remove redundant final route sweep
- All 10 routes were already visited during the page-specific sections
- The sweep just re-navigated to pages that had already been exercised
- Saves ~2s of redundant navigation
Also:
- Reduce inter-route wait from 200ms to 50ms (SPA router is synchronous)
- Merge utility function + packet filter exercises into single evaluate() call
- Use navHash() helper for consistent hash navigation with 150ms settle time
The test 'Node perf page should NOT show Go Runtime section' asserts
Node.js-specific behavior, but E2E tests now run against the Go server
(per this PR), so Go Runtime info is correctly present. Remove the
now-irrelevant assertion.
The Playwright E2E tests were starting `node server.js` (the deprecated
JS server) instead of the Go server, meaning E2E tests weren't testing
the production backend at all.
Changes:
- Add Go 1.22 setup and build steps to the node-test job
- Build the Go server binary before E2E tests run
- Replace `node server.js` with `./corescope-server` in both the
instrumented (coverage) and quick (no-coverage) E2E server starts
- Use `-port 13581` and `-public` flags to configure the Go server
- For coverage runs, serve from `public-instrumented/` directory
The Go server serves the same static files and exposes compatible
/api/* routes (stats, packets, health, perf) that the E2E tests hit.
Change healthThresholds config from milliseconds to hours for readability.
Config keys: infraDegradedHours, infraSilentHours, nodeDegradedHours, nodeSilentHours.
Defaults: infra degraded 24h, silent 72h; node degraded 1h, silent 24h.
- Config stored in hours, converted to ms at comparison time
- /api/config/client sends ms to frontend (backward compatible)
- Frontend tooltips use dynamic thresholds instead of hardcoded strings
- Added healthThresholds section to config.example.json
- Updated Go and Node.js servers, tests
* docs: remove letsmesh.net reference from README
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: remove paths-ignore from pull_request trigger
PR #233 only touches .md files, which were excluded by paths-ignore,
causing CI to be skipped entirely. Remove paths-ignore from the
pull_request trigger so all PRs get validated. Keep paths-ignore on
push to avoid unnecessary deploys for docs-only changes to master.
* ci: skip heavy CI jobs for docs-only PRs
Instead of using paths-ignore (which skips the entire workflow and
blocks required status checks), detect docs-only changes at the start
of each job and skip heavy steps while still reporting success.
This allows doc-only PRs to merge without waiting for Go builds,
Node.js tests, or Playwright E2E runs.
Reverts the approach from 7546ece (removing paths-ignore entirely)
in favor of a proper conditional skip within the jobs themselves.
* fix: update engine tests to match engine-badge HTML format
Tests expected [go]/[node] text but formatVersionBadge now renders
<span class="engine-badge">go</span>. Updated 6 assertions to
check for engine-badge class and engine name in HTML output.
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
Three optimizations to the CI frontend test pipeline:
1. Run E2E tests and coverage collection concurrently
- Previously sequential (E2E ~1.5min, then coverage ~5.75min)
- Now both run in parallel against the same instrumented server
- Expected savings: ~5 min (coverage runs alongside E2E instead of after)
2. Replace networkidle with domcontentloaded in coverage collector
- SPA uses hash routing — networkidle waits 500ms for network silence
on every navigation, adding ~10-15s of dead time across 23 navigations
- domcontentloaded fires immediately once HTML is parsed; JS initializes
the route handler synchronously
- For in-page hash changes, use 200ms setTimeout instead of
waitForLoadState (which would never re-fire for same-document nav)
3. Extract coverage from E2E tests too
- E2E tests already exercise the app against the instrumented server
- Now writes window.__coverage__ to .nyc_output/e2e-coverage.json
- nyc merges both coverage files for higher total coverage
Also:
- Split Playwright install into browser + deps steps (deps skip if present)
- Replace sleep 5 with health-check poll in quick E2E path
The poller's Start() calls GetMaxTransmissionID() to initialize its cursor.
When the test goroutine inserts data between go poller.Start() and the
actual GetMaxTransmissionID() call, the poller's cursor skips past the
test data and never broadcasts it, causing a timeout.
Adding a 100ms sleep after go poller.Start() ensures the poller has
initialized its cursors before the test inserts new data.
SQLite :memory: databases create separate databases per connection.
When the connection pool opens multiple connections (e.g. poller goroutine
vs main test goroutine), tables created on one connection are invisible
to others. Setting MaxOpenConns(1) ensures all queries use the same
in-memory database, fixing TestPollerBroadcastsMultipleObservations.
IngestNewFromDB now broadcasts one message per observation (not per
transmission). IngestNewObservations also broadcasts late arrivals.
Tests verify multi-observer packets produce multiple WS messages.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compared decoder.js against the MeshCore firmware source (Dispatcher.cpp,
Packet.h, Mesh.cpp, AdvertDataHelpers.h) and fixed all mismatches:
1. Field order: transport codes now parsed BEFORE path_length byte,
matching the spec: [header][transport_codes?][path_length][path][payload]
2. ACK payload: was incorrectly decoded as dest(1)+src(1)+ackHash(4).
Firmware shows ACK is just checksum(4) — no dest/src hashes.
3. TRACE payload: was incorrectly decoded as flags(1)+tag(4)+dest(6)+src(1).
Firmware shows tag(4)+authCode(4)+flags(1)+pathData.
4. ADVERT appdata: added missing feature1 (0x20 flag) and feature2
(0x40 flag) parsing — 2-byte fields between location and name.
5. Transport code field naming: renamed nextHop/lastHop to code1/code2
to match spec terminology (transport_code_1/transport_code_2).
6. Fixed incorrect field size labels in packets.js hex breakdown:
dest/src are 1 byte, MAC is 2 bytes (not 6B/6B/4B).
7. Fixed ANON_REQ/PATH comment typos (dest was listed as 6 bytes,
MAC as 4 bytes — both wrong, code was already correct).
All 329 tests pass (66 decoder + 263 spec/golden).
The Windows self-hosted runner picks up jobs and fails because bash
scripts run in PowerShell. Node.js tests need Chromium/Playwright
(Linux-only), and build/deploy/publish use Docker (Linux-only).
Changes:
- node-test: runs-on: [self-hosted, Linux]
- build: runs-on: [self-hosted, Linux]
- deploy: runs-on: [self-hosted, Linux]
- publish: runs-on: [self-hosted, Linux]
- go-test: unchanged (ubuntu-latest)
- perf.js: toFixed(1) on all ms/MB values in Go Runtime section
- style.css: white-space: nowrap on .nav-stats to prevent the · separator
from wrapping onto its own line
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two cache bugs fixed:
1. Hit rate formula excluded stale hits — reported rate was artificially low
because stale-while-revalidate responses (which ARE cache hits from the
caller's perspective) were not counted. Changed formula from
hits/(hits+misses) to (hits+staleHits)/(hits+staleHits+misses).
2. Bulk-health cache invalidated on every advert packet — in a mesh with
dozens of nodes advertising every few seconds, this caused the expensive
bulk-health query to be recomputed on nearly every request, defeating
the cache entirely. Switched to 30s debounced invalidation via
debouncedInvalidateBulkHealth().
Added regression test for hit rate formula in test-server-routes.js.
Match the C++ firmware wire format (Packet::writeTo/readFrom):
1. Field order: transport codes are parsed BEFORE path_length byte,
matching firmware's header → transport_codes → path_len → path → payload
2. ACK payload: just 4-byte CRC checksum, not dest+src+ackHash.
Firmware createAck() writes only ack_crc (4 bytes).
3. TRACE payload: tag(4) + authCode(4) + flags(1) + pathData,
matching firmware createTrace() and onRecvPacket() TRACE handler.
4. ADVERT features: parse feat1 (0x20) and feat2 (0x40) optional
2-byte fields between location and name, matching AdvertDataBuilder
and AdvertDataParser in the firmware.
5. Transport code naming: code1/code2 instead of nextHop/lastHop,
matching firmware's transport_codes[0]/transport_codes[1] naming.
Fixes applied to both cmd/ingestor/decoder.go and cmd/server/decoder.go.
Tests updated to match new behavior.
When go-test or node-test fails, the workflow run is now cancelled
via the GitHub API so the sibling job doesn't sit queued/running.
Also fixed build job to need both go-test AND node-test (was only
waiting on go-test despite the pipeline comment saying both gate it).
The flat 'deploy' concurrency group caused ALL PRs to share one queue,
so pushing to any PR would cancel CI runs on other PRs.
Changed to deploy-${{ github.event.pull_request.number || github.ref }}
so each PR gets its own concurrency group while re-pushes to the same
PR still cancel the previous run.
- /api/stats: 10s server-side cache — was running 5 SQLite COUNT queries
on every call, taking ~1500ms with 28 concurrent WS clients polling every 15s
- GetNodeHashSizeInfo: 15s cache — was doing a full O(n) scan + JSON unmarshal
of all advert packets in memory on every /nodes request, taking ~1200ms
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Linux Docker doesn't resolve host.docker.internal by default.
Required when MQTT sources in config.json point to the host machine.
Harmless on Docker Desktop where it already works.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without build: directive, docker compose tries to pull corescope:latest
from Docker Hub instead of building locally.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add preflight check for 'docker compose' in manage.sh (catches plugin missing)
- Document named Caddy volumes as cert storage, not user data
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove all legacy docker run code paths. manage.sh is now a pure
docker compose wrapper with no dual-mode branching.
Removed:
- COMPOSE_MODE flag and all if/else branches
- get_docker_run_args(), get_data_mount_args(), recreate_container()
- get_required_ports(), get_current_ports(), check_port_match()
- CONTAINER_NAME, DATA_VOLUME, CADDY_VOLUME variables
- All direct docker run/stop/start/rm invocations
All commands now delegate to docker compose:
- start → docker compose up -d prod
- stop → docker compose down / docker compose stop
- restart → docker compose up -d --force-recreate
- update → docker compose build prod + up -d --force-recreate
- reset → docker compose down --rmi local
- backup/restore use bind mount path from .env (PROD_DATA_DIR)
- verify_health, mqtt-test, status all use corescope-prod
Net result: -248 lines, zero dual-mode logic, identical behavior
to running docker compose directly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace all hardcoded \C:\Users\KpaBap/meshcore-data with \ variable
- \ resolves from \ in .env or defaults to ~/meshcore-data
- Updated get_data_mount_args(), cmd_backup(), cmd_restore(), cmd_reset()
- Enhanced .env.example with detailed comments for each variable
- Both docker compose and manage.sh now read same .env file
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PROBLEM:
manage.sh was using named Docker volumes (meshcore-data) as the default,
which hides the database and theme files inside Docker's internal storage.
Users couldn't find their DB on the filesystem for backups or inspection.
The function get_data_mount_args() had conditional logic that only used
bind mounts IF it detected an existing ~/meshcore-data with a DB file.
For new installs, it fell through to the named volume — silently hiding
all data in /var/lib/docker/volumes/.
FIXES:
1. get_data_mount_args() — Always use bind mount to ~/meshcore-data
- Creates the directory if it doesn't exist
- Removes all conditional logic and the named volume fallback
2. cmd_backup() — Use direct path C:\Users\KpaBap/meshcore-data/meshcore.db
- No longer tries to inspect the named volume
- Consistent with the bind mount approach
3. cmd_restore() — Use direct path for restore operations
- Ensures directory exists before restoring files
- No fallback to docker cp
4. cmd_reset() — Updated message to reflect bind mount location
- Changed from 'docker volume rm' to '~/meshcore-data (not removed)'
5. docker-compose.yml — Added documentation comment
- Clarifies that bind mounts are intentional, not named volumes
- Ensures future changes maintain this pattern
VALIDATION:
- docker-compose.yml already used bind mounts correctly (\)
- Legacy 'docker run' mode now matches compose behavior
- All backup/restore operations reference the same bind mount path
DATABASE LOCATION:
- Always: ~/meshcore-data/meshcore.db
- Never: Hidden in Docker's volume storage
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Requested-by: Kpa-clawbot
- Load() SQL: keep o.timestamp DESC (consistent with IngestNewFromDB) so
pickBestObservation tie-breaking is identical on both load paths
- GetTimestamps: scan from tail instead of head (was breaking on first item
assuming it was the newest, now correctly reads from newest end)
- QueryMultiNodePackets: apply same DESC/ASC tail-read pagination as
QueryPackets (was sorting for ASC and assuming DESC as-is)
- GetNodeHealth recentPackets: read from tail to return 20 newest items
(was reading from head = 20 oldest items)
- Remove stale "Prepend (newest first)" comments, replace with accurate
"oldest-first; new items go to tail" wording
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
s.packets and s.byPayloadType[t] were prepended on every new packet
to maintain newest-first order, copying the entire slice each time.
With 2-3M packets in memory this meant ~24MB of pointer copies per
ingest cycle, causing sustained high CPU and GC pressure.
Fix: store both slices oldest-first (append to tail). Load() SQL
changed to ASC ordering. QueryPackets DESC pagination now reads from
the tail in O(page_size) with no sort; GetChannelMessages switches
from reverse-iteration to forward-iteration.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: remove all packets_v SQL fallbacks — store handles all queries
Remove DB fallback paths from all route handlers. The in-memory
PacketStore now handles all packet/node/analytics queries. Handlers
return empty results or 404 when no store is available instead of
falling back to direct DB queries.
- Remove else-DB branches from handlePacketDetail, handleNodeHealth,
handleNodeAnalytics, handleBulkHealth, handlePacketTimestamps, etc.
- Remove unused DB methods (GetPacketByHash, GetTransmissionByID,
GetPacketByID, GetObservationsForHash, GetTimestamps, GetNodeHealth,
GetNodeAnalytics, GetBulkHealth, etc.)
- Remove packets_v VIEW creation from schema
- Update tests for new behavior (no-store returns 404/empty, not 500)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: address PR #220 review comments
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: KpaBap <kpabap@gmail.com>
- Add pull_request trigger for PRs against master
- Add 'if: github.event_name == push' to build/deploy/publish jobs
- Test jobs (go-test, node-test) now run on both push and PRs
- Build/deploy/publish only run on push to master
This fixes the chicken-and-egg problem where branch protection requires
CI checks but CI doesn't run on PRs. Now PRs get test validation before
merge while keeping production deployments only on master pushes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The glob trick COPY .git-commi[t] only works with BuildKit.
manage.sh uses legacy docker build. Just create a default via RUN.
Commit hash comes through --build-arg ldflags anyway.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents what existing users need to update when the rename
from MeshCore Analyzer to CoreScope lands:
- Git remote URL update
- Docker image/container name changes
- Config branding.siteName (if customized)
- CI/CD references (if applicable)
- Confirms data dirs, MQTT, browser state unchanged
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The glob trick COPY .git-commi[t] only works with BuildKit.
manage.sh uses legacy docker build. Just create a default via RUN.
Commit hash comes through --build-arg ldflags anyway.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Server defaults to 6060, ingestor to 6061. Removed shared PPROF_PORT
env var. Bind failure logs warning instead of log.Fatal killing the process.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sensor nodes embed telemetry (battery_mv, temperature_c) in their advert
appdata after the null-terminated name. This commit adds decoding and
storage for both the Go ingestor and Node.js backend.
Changes:
- decoder.go/decoder.js: Parse telemetry bytes from advert appdata
(battery_mv as uint16 LE millivolts, temperature_c as int16 LE /100)
- db.go/db.js: Add battery_mv INTEGER and temperature_c REAL columns
to nodes and inactive_nodes tables, with migration for existing DBs
- main.go/server.js: Update node telemetry on advert processing
- server db.go: Include battery_mv/temperature_c in node API responses
- Tests: Decoder telemetry tests (positive, negative temp, no telemetry),
DB migration test, node telemetry update test, server API shape tests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fresh Go installs failed with 'no such table: packets_v' because the
ingestor created tables but never the VIEW that the Go server queries.
Add DROP VIEW IF EXISTS + CREATE VIEW packets_v to applySchema(), using
the v3 definition (observer_idx → observers.rowid JOIN). The view is
rebuilt on every startup to stay current with any definition changes.
Add tests: verify view exists after OpenStore, and verify it returns
correct observer_id/observer_name via the LEFT JOIN.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add net/http/pprof support to both Go server (default port 6060) and
ingestor (default port 6061). Profiling is off by default — only
starts the pprof HTTP listener when ENABLE_PPROF=true.
PPROF_PORT env var overrides the default port for each binary.
Enable on staging-go in docker-compose with exposed ports 6060/6061.
Not enabled on prod.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SQLite stores these as REAL on some instances. Go *int scan silently
fails, dropping the entire observer row (404 on detail, missing from list).
Reported for YC-Base-Repeater and YC-Work-Repeater.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Observability:
- Add DBStats struct with atomic counters for tx_inserted, tx_dupes,
obs_inserted, node_upserts, observer_upserts, write_errors
- Log SQLite config on startup (busy_timeout, max_open_conns, journal)
- Periodic stats logging every 5 minutes + final stats on shutdown
- Instrument all write paths with counter increments
Tests:
- TestConcurrentWrites: 20 goroutines × 50 writes (1000 total) with
interleaved InsertTransmission + UpsertNode + UpsertObserver calls.
Verifies zero errors and data integrity under concurrent load.
- TestDBStats: verifies counter accuracy for inserts, duplicates,
upserts, and that LogStats does not panic
Three changes to eliminate concurrent write collisions:
1. Add _busy_timeout=5000 to ingestor SQLite DSN (matches server)
- SQLite will wait up to 5s for the write lock instead of
immediately returning SQLITE_BUSY
2. Set SetMaxOpenConns(1) on ingestor DB connection pool
- Serializes all DB access at the Go sql.DB level
- Prevents multiple goroutines from opening overlapping writes
3. Change SetOrderMatters(false) to SetOrderMatters(true)
- MQTT handlers now run sequentially per client
- Eliminates concurrent handler execution that caused
overlapping multi-statement write flows
Root cause: concurrent MQTT handlers (SetOrderMatters=false) each
performed multiple separate writes (transmission lookup/insert,
observation insert, node upsert, observer upsert) without transactions
or connection limits. SQLite only permits one writer at a time, so
under bursty MQTT traffic the ingestor was competing with itself.
#210: Add role="img" aria-label to 9 Chart.js canvases in node-analytics.js
and observer-detail.js with descriptive labels.
#211: Add scope="col" to all <th> elements across analytics.js, audio-lab.js,
compare.js, node-analytics.js, nodes.js, observer-detail.js, observers.js,
and packets.js (40+ headers).
#212: Add aria-label to packet filter input and time window select in
packets.js. Add for/id associations to all customize.js inputs: branding,
theme colors, node/type colors, heatmap sliders, onboarding fields, and
export controls.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#203: Live page node detail panel becomes a bottom-sheet on mobile
(width:100%, bottom:0, max-height:60vh, rounded top corners).
#204: Perf page reduces padding to 12px, perf-cards stack in 2-col
grid, tables get smaller font/padding on mobile.
#205: Nodes table hides Public Key column on mobile via .col-pubkey
class (same pattern as packets page .col-region/.col-rpt).
Cache busters bumped in index.html.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
description: "A specialized agent for reviewing pull requests in the meshcore-analyzer repository. It focuses on SOLID, DRY, testing, Go best practices, frontend testability, observability, and performance to prevent regressions and maintain high code quality."
model: "gpt-5.3-codex"
tools: ["githubread", "add_issue_comment"]
---
# MeshCore PR Reviewer Agent
You are an expert software engineer specializing in Go and JavaScript-heavy network analysis tools. Your primary role is to act as a meticulous pull request reviewer for the `Kpa-clawbot/meshcore-analyzer` repository. You are deeply familiar with its architecture, as outlined in `AGENTS.md`, and you enforce its rules rigorously.
Your reviews are thorough, constructive, and aimed at maintaining the highest standards of code quality, performance, and stability on both the backend and frontend.
## Core Principles
1.**Context is King**: Before any review, consult the `AGENTS.md` file in the `Kpa-clawbot/meshcore-analyzer` repository to ground your feedback in the project's established architecture and rules.
2.**Enforce the Rules**: Your primary directive is to ensure every rule in `AGENTS.md` is followed. Call out any deviation.
3.**Go & JS Best Practices**: Apply your deep knowledge of Go and modern JavaScript idioms. Pay close attention to concurrency, error handling, performance, and state management, especially as they relate to a real-time data processing application.
4.**Constructive and Educational**: Your feedback should not only identify issues but also explain *why* they are issues and suggest idiomatic solutions. Your goal is to mentor and elevate the codebase and its contributors.
5.**Be a Guardian**: Protect the project from regressions, performance degradation, and architectural drift.
## Review Focus Areas
You will pay special attention to the following areas during your review:
- **SOLID & DRY**: Does the change adhere to SOLID principles? Is there duplicated logic that could be refactored? Does it respect the existing separation of concerns?
- **Project Architecture**: Does the PR respect the single Node.js server + static frontend architecture? Are changes in the right place?
### 2. Testing and Validation
- **No commit without tests**: Is the backend logic change covered by unit tests? Is `test-packet-filter.js` or `test-aging.js` updated if necessary?
- **Browser Validation**: Has the contributor confirmed the change works in a browser? Is there a screenshot for visual changes?
- **Cache Busters**: If any `public/` assets (`.js`, `.css`) were modified, has the cache buster in `public/index.html` been bumped in the *same commit*? This is critical.
### 3. Go-Specific Concerns
- **Concurrency**: Are goroutines used safely? Are there potential race conditions? Is synchronization used correctly?
- **Error Handling**: Is error handling explicit and clear? Are errors wrapped with context where appropriate?
- **Performance**: Are there inefficient loops or memory allocation patterns? Scrutinize any new data processing logic.
- **Go Idioms**: Does the code follow standard Go idioms and formatting (`gofmt`)?
### 4. Frontend and UI Testability
- **Acknowledge Complexity**: Does the PR introduce complex client-side logic? Recognize that browser-based functionality is difficult to unit test.
- **Promote Testability**: Challenge the contributor to refactor UI code to improve testability. Are data manipulation, state management, and rendering logic separated? Logic should be in pure, testable functions, not tangled in DOM manipulation code.
- **UI Logic Purity**: Scrutinize client-side JavaScript. Are there large, monolithic functions? Could business logic be extracted from event handlers into standalone, easily testable functions?
- **State Management**: How is client-side state managed? Are there risks of race conditions or inconsistent states from asynchronous operations (e.g., API calls)?
### 5. Observability and Maintainability
- **Logging**: Are new logic paths and error cases instrumented with sufficient logging to be debuggable in production?
- **Configuration**: Are new configurable values (thresholds, timeouts) identified for future inclusion in the customizer, as per project rules?
- **Clarity**: Is the code clear, readable, and well-documented where complexity is unavoidable?
### 6. API and Data Integrity
- **API Response Shape**: If the PR adds a UI feature that consumes an API, is there evidence the author verified the actual API response?
- **Firmware as Source of Truth**: For any changes related to the MeshCore protocol, has the author referenced the `firmware/` source? Challenge any "magic numbers" or assumptions about packet structure.
## Review Process
1.**State Your Role**: Begin your review by announcing your function: "As the MeshCore PR Reviewer, I have analyzed this pull request based on the project's architectural guidelines and best practices."
2.**Provide a Summary**: Give a high-level summary of your findings (e.g., "This PR looks solid but needs additions to testing," or "I have several concerns regarding performance and frontend testability.").
3.**Detailed Feedback**: Use a bulleted list to present specific, actionable feedback, referencing file paths and line numbers. For each point, cite the relevant principle or project rule (e.g., "Missing Test Coverage (Rule #1)", "UI Logic Purity (Focus Area #4)").
4.**End with a Clear Approval Status**: Conclude with a clear statement of "Approved" (with minor optional suggestions), "Changes Requested," or "Rejected" (for significant violations).
description: "A specialized agent for reviewing pull requests in the meshcore-analyzer repository. It focuses on SOLID, DRY, testing, Go best practices, frontend testability, observability, and performance to prevent regressions and maintain high code quality."
model: "gpt-5.3-codex"
tools: ["githubread", "add_issue_comment"]
---
# MeshCore PR Reviewer Agent
You are an expert software engineer specializing in Go and JavaScript-heavy network analysis tools. Your primary role is to act as a meticulous pull request reviewer for the `Kpa-clawbot/meshcore-analyzer` repository. You are deeply familiar with its architecture, as outlined in `AGENTS.md`, and you enforce its rules rigorously.
Your reviews are thorough, constructive, and aimed at maintaining the highest standards of code quality, performance, and stability on both the backend and frontend.
## Core Principles
1.**Context is King**: Before any review, consult the `AGENTS.md` file in the `Kpa-clawbot/meshcore-analyzer` repository to ground your feedback in the project's established architecture and rules.
2.**Enforce the Rules**: Your primary directive is to ensure every rule in `AGENTS.md` is followed. Call out any deviation.
3.**Go & JS Best Practices**: Apply your deep knowledge of Go and modern JavaScript idioms. Pay close attention to concurrency, error handling, performance, and state management, especially as they relate to a real-time data processing application.
4.**Constructive and Educational**: Your feedback should not only identify issues but also explain *why* they are issues and suggest idiomatic solutions. Your goal is to mentor and elevate the codebase and its contributors.
5.**Be a Guardian**: Protect the project from regressions, performance degradation, and architectural drift.
## Review Focus Areas
You will pay special attention to the following areas during your review:
- **SOLID & DRY**: Does the change adhere to SOLID principles? Is there duplicated logic that could be refactored? Does it respect the existing separation of concerns?
- **Project Architecture**: Does the PR respect the single Node.js server + static frontend architecture? Are changes in the right place?
### 2. Testing and Validation
- **No commit without tests**: Is the backend logic change covered by unit tests? Is `test-packet-filter.js` or `test-aging.js` updated if necessary?
- **Browser Validation**: Has the contributor confirmed the change works in a browser? Is there a screenshot for visual changes?
- **Cache Busters**: If any `public/` assets (`.js`, `.css`) were modified, has the cache buster in `public/index.html` been bumped in the *same commit*? This is critical.
### 3. Go-Specific Concerns
- **Concurrency**: Are goroutines used safely? Are there potential race conditions? Is synchronization used correctly?
- **Error Handling**: Is error handling explicit and clear? Are errors wrapped with context where appropriate?
- **Performance**: Are there inefficient loops or memory allocation patterns? Scrutinize any new data processing logic.
- **Go Idioms**: Does the code follow standard Go idioms and formatting (`gofmt`)?
### 4. Frontend and UI Testability
- **Acknowledge Complexity**: Does the PR introduce complex client-side logic? Recognize that browser-based functionality is difficult to unit test.
- **Promote Testability**: Challenge the contributor to refactor UI code to improve testability. Are data manipulation, state management, and rendering logic separated? Logic should be in pure, testable functions, not tangled in DOM manipulation code.
- **UI Logic Purity**: Scrutinize client-side JavaScript. Are there large, monolithic functions? Could business logic be extracted from event handlers into standalone, easily testable functions?
- **State Management**: How is client-side state managed? Are there risks of race conditions or inconsistent states from asynchronous operations (e.g., API calls)?
### 5. Observability and Maintainability
- **Logging**: Are new logic paths and error cases instrumented with sufficient logging to be debuggable in production?
- **Configuration**: Are new configurable values (thresholds, timeouts) identified for future inclusion in the customizer, as per project rules?
- **Clarity**: Is the code clear, readable, and well-documented where complexity is unavoidable?
### 6. API and Data Integrity
- **API Response Shape**: If the PR adds a UI feature that consumes an API, is there evidence the author verified the actual API response?
- **Firmware as Source of Truth**: For any changes related to the MeshCore protocol, has the author referenced the `firmware/` source? Challenge any "magic numbers" or assumptions about packet structure.
## Review Process
1.**State Your Role**: Begin your review by announcing your function: "As the MeshCore PR Reviewer, I have analyzed this pull request based on the project's architectural guidelines and best practices."
2.**Provide a Summary**: Give a high-level summary of your findings (e.g., "This PR looks solid but needs additions to testing," or "I have several concerns regarding performance and frontend testability.").
3.**Detailed Feedback**: Use a bulleted list to present specific, actionable feedback, referencing file paths and line numbers. For each point, cite the relevant principle or project rule (e.g., "Missing Test Coverage (Rule #1)", "UI Logic Purity (Focus Area #4)").
4.**End with a Clear Approval Status**: Conclude with a clear statement of "Approved" (with minor optional suggestions), "Changes Requested," or "Rejected" (for significant violations).
MeshCore Analyzer has 14 test files, 4,290 lines of test code. Backend coverage 85%+, frontend 42%+. Tests use Node.js native runner, Playwright for E2E, c8/nyc for coverage, supertest for API routes. vm.createContext pattern used for testing frontend helpers in Node.js.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- E2E run 2026-03-26: 12/16 passed, 4 failed. Results:
- ✅ Home page loads
- ✅ Nodes page loads with data
- ❌ Map page loads with markers — No markers found (empty DB, no geo data)
- ✅ Packets page loads with filter
- ✅ Node detail loads
- ✅ Theme customizer opens
- ✅ Dark mode toggle
- ✅ Analytics page loads
- ✅ Map heat checkbox persists in localStorage
- ✅ Map heat checkbox is clickable
- ✅ Live heat disabled when ghosts mode active
- ✅ Live heat checkbox persists in localStorage
- ✅ Heatmap opacity persists in localStorage
- ❌ Live heatmap opacity persists — browser closed before test ran (bug: browser.close() on line 274 is before tests 14-16)
- ❌ Customizer has separate map/live opacity sliders — same browser-closed bug
- ❌ Map re-renders on resize — same browser-closed bug
- BUG FOUND: test-e2e-playwright.js line 274 calls `await browser.close()` before tests 14, 15, 16 execute. Those 3 tests will always fail. The `browser.close()` must be moved after all tests.
- The "Map page loads with markers" failure is expected with an empty local DB — no nodes with coordinates exist to render markers.
- FIX APPLIED 2026-03-26: Moved `browser.close()` from between test 13 and test 14 to after test 16 (just before the summary). Tests 14 ("Live heatmap opacity persists") and 15 ("Customizer has separate map/live opacity sliders") now pass. Test 16 ("Map re-renders on resize") now runs but fails due to empty DB (no markers to count) — same root cause as test 3. Result: 14/16 pass, 2 fail (both map-marker tests, expected with empty DB).
- TESTS ADDED 2026-03-26: Issue #127 (copyToClipboard) — 8 unit tests in test-frontend-helpers.js using vm.createContext + DOM/clipboard mocks. Tests cover: fallback path (execCommand success/fail/throw), clipboard API path, null/undefined input, textarea lifecycle, no-callback usage. Pattern: `makeClipboardSandbox(opts)` helper builds sandbox with configurable navigator.clipboard and document.execCommand mocks. Total frontend helper tests: 47→55.
- TESTS ADDED 2026-03-26: Issue #125 (packet detail dismiss) — 1 E2E test in test-e2e-playwright.js. Tests: click row → pane opens (empty class removed) → click ✕ → pane closes (empty class restored). Skips gracefully when DB has no packets. Inserted before analytics group, before browser.close().
- E2E SPEED OPTIMIZATION 2026-03-26: Rewrote test-e2e-playwright.js for performance per Kobayashi's audit. Changes:
- Replaced ALL 19 `waitUntil: 'networkidle'` → `'domcontentloaded'` + targeted `waitForSelector`/`waitForFunction`. networkidle stalls ~500ms+ per navigation due to persistent WebSocket + Leaflet tiles.
- Eliminated 11 of 12 `waitForTimeout` sleeps → event-driven waits (waitForSelector, waitForFunction). Only 1 remains: 500ms for packet filter debounce (was 1500ms).
- Reordered tests into page groups to eliminate 7 redundant navigations (page.goto 14→7): Home(1,6,7), Nodes(2,5), Map(3,9,10,13,16), Packets(4), Analytics(8), Live(11,12), NoNav(14,15).
- Reduced default timeout from 15s to 10s.
- All 17 test names and assertions preserved unchanged.
- Verified: 17/17 tests pass against local server with generated test data.
- TOTAL: ~757s (~12.6 min locally). CI reports ~13 min (matches).
- ROOT CAUSE: collect-frontend-coverage.js is a 978-line script that launches a SECOND Playwright browser and exhaustively clicks every UI element on every page to maximize code coverage. It contains:
- 169 explicit `waitForTimeout()` calls totaling 104.1s (1.74 min) of hard sleep
- REGRESSION TESTS ADDED 2026-03-27: Memory optimization (observation deduplication). 8 new tests in test-packet-store.js under "=== Observation deduplication (transmission_id refs) ===" section. Tests verify: (1) observations don't duplicate raw_hex/decoded_json, (2) transmission fields accessible via store.byTxId.get(obs.transmission_id), (3) query() and all() still return transmission fields for backward compat, (4) multiple observations share one transmission_id, (5) getSiblings works after dedup, (6) queryGrouped returns transmission fields, (7) memory estimate reflects dedup savings. 4 tests fail pre-fix (expected — Hicks hasn't applied changes yet), 4 pass (backward compat). Pattern: use hasOwnProperty() to distinguish own vs inherited/absent fields.
- REVIEW 2026-03-27: Hicks RAM fix (observation dedup). REJECTED. Tests pass (42 packet-store + 204 route), but 5 server.js consumers access `.hash`, `.raw_hex`, `.decoded_json`, `.payload_type` on lean observations from `byObserver.get()` or `tx.observations` without enrichment. Broken endpoints: (1) `/api/nodes/bulk-health` line 1141 `o.hash` undefined, (2) `/api/nodes/network-status` line 1220 `o.hash` undefined, (3) `/api/analytics/signal` lines 1298+1306 `p.hash`/`p.raw_hex` undefined, (4) `/api/observers/:id/analytics` lines 2320+2329+2361 `p.payload_type`/`p.decoded_json` undefined + lean objects sent to client as recentPackets, (5) `/api/analytics/subpaths` line 2711 `o.hash` undefined. All are regional filtering or analytics code paths that use `byObserver` directly. Fix: either enrich at these call sites or store `hash` on observations (it's small). The enrichment pattern works for `getById()`, `getSiblings()`, and `/api/packets/:id` but was not applied to the 5 other consumers. Route tests pass because they don't assert on these specific field values in analytics responses.
- BATCH REVIEW 2026-03-27: Reviewed 6 issue fixes pushed without sign-off. Full suite: 971 tests, 0 failures across 11 test files. Cache busters uniform (v=1774625000). Verdicts:
-#133 (phantom nodes): ✅ APPROVED. 12 assertions on removePhantomNodes, real db.js code, edge cases (idempotency, real node preserved, stats filtering).
-#123 (channel hash): ⚠️ APPROVED WITH NOTES. 6 new decoder tests cover channelHashHex (zero-padding) and decryptionStatus (no_key ×3, decryption_failed). Missing: `decrypted` status untested (needs valid crypto key), frontend rendering of "Ch 0xXX (no key)" untested.
-#130 (disappearing nodes): ✅ APPROVED. 8 pruneStaleNodes tests cover dim/restore/remove for API vs WS nodes. Real live.js via vm.createContext.
-#131 (auto-updating nodes): ⚠️ APPROVED WITH NOTES. 8 solid isAdvertMessage tests (real code). BUT 5 WS handler tests are source-string-match checks (`src.includes('loadNodes(true)')`) — these verify code exists but not that it works at runtime. No runtime test for debounce batching behavior.
-#129 (observer comparison): ✅ APPROVED. 11 comprehensive tests for comparePacketSets — all edge cases, performance (10K hashes <500ms), mathematical invariant. Real compare.js via vm.createContext.
- NOTES FOR IMPROVEMENT: (1) #131 debounce behavior should get a runtime test via vm.createContext, not string checks. (2) #123 could benefit from a `decrypted` status test if crypto mocking is feasible. Neither is blocking.
- TEST GAP FIX 2026-03-27: Closed both noted gaps from batch review:
-#123 (channel hash decryption `decrypted` status): 3 new tests in test-decoder.js. Used require.cache mocking to swap ChannelCrypto module with mock that returns `{success:true, data:{...}}`. Tests cover: (1) decrypted status with sender+message (text formatted as "Sender: message"), (2) decrypted without sender (text is just message), (3) multiple keys tried, first match wins (verifies iteration order + call count). All verify channelHashHex, type='CHAN', channel name, sender, timestamp, flags. require.cache is restored in finally block.
-#131 (WS handler runtime tests): Rewrote 5 `src.includes()` string-match tests to use vm.createContext with runtime execution. Created `makeNodesWsSandbox()` helper that provides controllable setTimeout (timer queue), mock DOM, tracked api/invalidateApiCache calls, and real `debouncedOnWS` logic. Tests run actual nodes.js init() and verify: (1) ADVERT triggers refresh with 5s debounce, (2) non-ADVERT doesn't trigger refresh, (3) debounce collapses 3 ADVERTs into 1 API call, (4) _allNodes cache reset forces re-fetch, (5) scroll/selection preserved (panel innerHTML + scrollTop untouched by WS handler). Total: 87 frontend helper tests (same count — 5 replaced, not added), 61 decoder tests (+3).
- Technique learned: require.cache mocking is effective for testing code paths that depend on external modules (like ChannelCrypto). Store original, replace exports, restore in finally. Controllable setTimeout (capturing callbacks in array, firing manually) enables testing debounce logic without real timers.
# Bishop — History
## Project Context
CoreScope has 14 test files, 4,290 lines of test code. Backend coverage 85%+, frontend 42%+. Tests use Node.js native runner, Playwright for E2E, c8/nyc for coverage, supertest for API routes. vm.createContext pattern used for testing frontend helpers in Node.js.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- E2E run 2026-03-26: 12/16 passed, 4 failed. Results:
- ✅ Home page loads
- ✅ Nodes page loads with data
- ❌ Map page loads with markers — No markers found (empty DB, no geo data)
- ✅ Packets page loads with filter
- ✅ Node detail loads
- ✅ Theme customizer opens
- ✅ Dark mode toggle
- ✅ Analytics page loads
- ✅ Map heat checkbox persists in localStorage
- ✅ Map heat checkbox is clickable
- ✅ Live heat disabled when ghosts mode active
- ✅ Live heat checkbox persists in localStorage
- ✅ Heatmap opacity persists in localStorage
- ❌ Live heatmap opacity persists — browser closed before test ran (bug: browser.close() on line 274 is before tests 14-16)
- ❌ Customizer has separate map/live opacity sliders — same browser-closed bug
- ❌ Map re-renders on resize — same browser-closed bug
- BUG FOUND: test-e2e-playwright.js line 274 calls `await browser.close()` before tests 14, 15, 16 execute. Those 3 tests will always fail. The `browser.close()` must be moved after all tests.
- The "Map page loads with markers" failure is expected with an empty local DB — no nodes with coordinates exist to render markers.
- FIX APPLIED 2026-03-26: Moved `browser.close()` from between test 13 and test 14 to after test 16 (just before the summary). Tests 14 ("Live heatmap opacity persists") and 15 ("Customizer has separate map/live opacity sliders") now pass. Test 16 ("Map re-renders on resize") now runs but fails due to empty DB (no markers to count) — same root cause as test 3. Result: 14/16 pass, 2 fail (both map-marker tests, expected with empty DB).
- TESTS ADDED 2026-03-26: Issue #127 (copyToClipboard) — 8 unit tests in test-frontend-helpers.js using vm.createContext + DOM/clipboard mocks. Tests cover: fallback path (execCommand success/fail/throw), clipboard API path, null/undefined input, textarea lifecycle, no-callback usage. Pattern: `makeClipboardSandbox(opts)` helper builds sandbox with configurable navigator.clipboard and document.execCommand mocks. Total frontend helper tests: 47→55.
- TESTS ADDED 2026-03-26: Issue #125 (packet detail dismiss) — 1 E2E test in test-e2e-playwright.js. Tests: click row → pane opens (empty class removed) → click ✕ → pane closes (empty class restored). Skips gracefully when DB has no packets. Inserted before analytics group, before browser.close().
- E2E SPEED OPTIMIZATION 2026-03-26: Rewrote test-e2e-playwright.js for performance per Kobayashi's audit. Changes:
- Replaced ALL 19 `waitUntil: 'networkidle'` → `'domcontentloaded'` + targeted `waitForSelector`/`waitForFunction`. networkidle stalls ~500ms+ per navigation due to persistent WebSocket + Leaflet tiles.
- Eliminated 11 of 12 `waitForTimeout` sleeps → event-driven waits (waitForSelector, waitForFunction). Only 1 remains: 500ms for packet filter debounce (was 1500ms).
- Reordered tests into page groups to eliminate 7 redundant navigations (page.goto 14→7): Home(1,6,7), Nodes(2,5), Map(3,9,10,13,16), Packets(4), Analytics(8), Live(11,12), NoNav(14,15).
- Reduced default timeout from 15s to 10s.
- All 17 test names and assertions preserved unchanged.
- Verified: 17/17 tests pass against local server with generated test data.
- TOTAL: ~757s (~12.6 min locally). CI reports ~13 min (matches).
- ROOT CAUSE: collect-frontend-coverage.js is a 978-line script that launches a SECOND Playwright browser and exhaustively clicks every UI element on every page to maximize code coverage. It contains:
- 169 explicit `waitForTimeout()` calls totaling 104.1s (1.74 min) of hard sleep
- REGRESSION TESTS ADDED 2026-03-27: Memory optimization (observation deduplication). 8 new tests in test-packet-store.js under "=== Observation deduplication (transmission_id refs) ===" section. Tests verify: (1) observations don't duplicate raw_hex/decoded_json, (2) transmission fields accessible via store.byTxId.get(obs.transmission_id), (3) query() and all() still return transmission fields for backward compat, (4) multiple observations share one transmission_id, (5) getSiblings works after dedup, (6) queryGrouped returns transmission fields, (7) memory estimate reflects dedup savings. 4 tests fail pre-fix (expected — Hicks hasn't applied changes yet), 4 pass (backward compat). Pattern: use hasOwnProperty() to distinguish own vs inherited/absent fields.
- REVIEW 2026-03-27: Hicks RAM fix (observation dedup). REJECTED. Tests pass (42 packet-store + 204 route), but 5 server.js consumers access `.hash`, `.raw_hex`, `.decoded_json`, `.payload_type` on lean observations from `byObserver.get()` or `tx.observations` without enrichment. Broken endpoints: (1) `/api/nodes/bulk-health` line 1141 `o.hash` undefined, (2) `/api/nodes/network-status` line 1220 `o.hash` undefined, (3) `/api/analytics/signal` lines 1298+1306 `p.hash`/`p.raw_hex` undefined, (4) `/api/observers/:id/analytics` lines 2320+2329+2361 `p.payload_type`/`p.decoded_json` undefined + lean objects sent to client as recentPackets, (5) `/api/analytics/subpaths` line 2711 `o.hash` undefined. All are regional filtering or analytics code paths that use `byObserver` directly. Fix: either enrich at these call sites or store `hash` on observations (it's small). The enrichment pattern works for `getById()`, `getSiblings()`, and `/api/packets/:id` but was not applied to the 5 other consumers. Route tests pass because they don't assert on these specific field values in analytics responses.
- BATCH REVIEW 2026-03-27: Reviewed 6 issue fixes pushed without sign-off. Full suite: 971 tests, 0 failures across 11 test files. Cache busters uniform (v=1774625000). Verdicts:
-#133 (phantom nodes): ✅ APPROVED. 12 assertions on removePhantomNodes, real db.js code, edge cases (idempotency, real node preserved, stats filtering).
-#123 (channel hash): ⚠️ APPROVED WITH NOTES. 6 new decoder tests cover channelHashHex (zero-padding) and decryptionStatus (no_key ×3, decryption_failed). Missing: `decrypted` status untested (needs valid crypto key), frontend rendering of "Ch 0xXX (no key)" untested.
-#130 (disappearing nodes): ✅ APPROVED. 8 pruneStaleNodes tests cover dim/restore/remove for API vs WS nodes. Real live.js via vm.createContext.
-#131 (auto-updating nodes): ⚠️ APPROVED WITH NOTES. 8 solid isAdvertMessage tests (real code). BUT 5 WS handler tests are source-string-match checks (`src.includes('loadNodes(true)')`) — these verify code exists but not that it works at runtime. No runtime test for debounce batching behavior.
-#129 (observer comparison): ✅ APPROVED. 11 comprehensive tests for comparePacketSets — all edge cases, performance (10K hashes <500ms), mathematical invariant. Real compare.js via vm.createContext.
- NOTES FOR IMPROVEMENT: (1) #131 debounce behavior should get a runtime test via vm.createContext, not string checks. (2) #123 could benefit from a `decrypted` status test if crypto mocking is feasible. Neither is blocking.
- TEST GAP FIX 2026-03-27: Closed both noted gaps from batch review:
-#123 (channel hash decryption `decrypted` status): 3 new tests in test-decoder.js. Used require.cache mocking to swap ChannelCrypto module with mock that returns `{success:true, data:{...}}`. Tests cover: (1) decrypted status with sender+message (text formatted as "Sender: message"), (2) decrypted without sender (text is just message), (3) multiple keys tried, first match wins (verifies iteration order + call count). All verify channelHashHex, type='CHAN', channel name, sender, timestamp, flags. require.cache is restored in finally block.
-#131 (WS handler runtime tests): Rewrote 5 `src.includes()` string-match tests to use vm.createContext with runtime execution. Created `makeNodesWsSandbox()` helper that provides controllable setTimeout (timer queue), mock DOM, tracked api/invalidateApiCache calls, and real `debouncedOnWS` logic. Tests run actual nodes.js init() and verify: (1) ADVERT triggers refresh with 5s debounce, (2) non-ADVERT doesn't trigger refresh, (3) debounce collapses 3 ADVERTs into 1 API call, (4) _allNodes cache reset forces re-fetch, (5) scroll/selection preserved (panel innerHTML + scrollTop untouched by WS handler). Total: 87 frontend helper tests (same count — 5 replaced, not added), 61 decoder tests (+3).
- Technique learned: require.cache mocking is effective for testing code paths that depend on external modules (like ChannelCrypto). Store original, replace exports, restore in finally. Controllable setTimeout (capturing callbacks in array, firing manually) enables testing debounce logic without real timers.
- **Massive session 2026-03-27 (FULL DAY):** Reviewed and approved all 6 fixes, closed 2 test gaps, validated E2E:
MeshCore Analyzer is a real-time LoRa mesh packet analyzer. Node.js + Express + SQLite backend, vanilla JS SPA frontend. Custom decoder.js fixes path_length bug from upstream library. In-memory packet store provides O(1) lookups for 30K+ packets. TTL response cache achieves 7,000× speedup on bulk health endpoint.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- Split the monolithic "Frontend coverage (instrumented Playwright)" CI step into 5 discrete steps: Instrument frontend JS, Start test server (with health-check poll replacing sleep 5), Run Playwright E2E tests, Extract coverage + generate report, Stop test server. Cleanup/report steps use `if: always()` so server shutdown happens even on test failure. Server PID shared across steps via .server.pid file. "Frontend E2E only" fast-path left untouched.
- Fixed memory explosion in packet-store.js: observations no longer duplicate transmission fields (hash, raw_hex, decoded_json, payload_type, route_type). Instead, observations store only `transmission_id` as a reference. Added `_enrichObs()` to hydrate observations at API boundaries (getById, getSiblings, enrichObservations). Replaced `.all()` with `.iterate()` for streaming load. Updated `_transmissionsForObserver()` to use transmission_id instead of hash. For a 185MB DB with 50K transmissions × 23 observations avg, this eliminates ~1.17M copies of hex dumps and JSON — projected ~2GB RAM savings.
- Built standalone Go MQTT ingestor (`cmd/ingestor/`). Ported decoder.js → Go (header parsing, path extraction, all payload types, advert decoding with flags/lat/lon/name). Ported db.js v3 schema (transmissions + observations + nodes + observers). Ported computeContentHash (SHA-256 based, path-independent). Uses modernc.org/sqlite (pure Go, no CGO) and paho.mqtt.golang. 25 tests passing (decoder golden fixtures from production data + DB schema compatibility). Supports same config.json format as Node.js server. Handles Format 1 (raw packet) messages; companion bridge format deferred. System Go was 1.17 — installed Go 1.22.5 to support modern dependencies.
- Built standalone Go web server (`cmd/server/`) — READ side of the Go rewrite. 35+ REST API endpoints ported from server.js. All queries go directly to SQLite (no in-memory packet store). WebSocket broadcast via SQLite polling. Static file server with SPA fallback. Uses gorilla/mux for routing, gorilla/websocket for WS, modernc.org/sqlite for DB. 42 tests passing (20 DB query tests, 20+ route integration tests, 2 WebSocket tests). `go vet` clean. Binary compiles to single executable. Analytics endpoints that required Node.js in-memory store (topology, distance, hash-sizes, subpaths) return structural stubs — core data (RF stats, channels, node health, etc.) fully functional via SQL. System Go 1.17 → installed Go 1.22 for build. Each cmd/* module has its own go.mod (no root-level go.mod).
- Go server API parity fix: Rewrote QueryPackets from observation-centric (packets_v view) to transmission-centric (transmissions table + correlated subqueries). This fixes both performance (9s to sub-100ms for unfiltered queries on 1.2M rows) and response shape. Packets now return first_seen, timestamp (= first_seen), observation_count, and NOT created_at/payload_version/score. Node responses now include last_heard (= last_seen fallback), hash_size (null), hash_size_inconsistent (false). Added schema version detection (v2 vs v3 observations table). Fixed QueryGroupedPackets first_seen. Added GetRecentTransmissionsForNode. All tests pass, build clean with Go 1.22.
- Fixed #133 (node count keeps climbing): `db.getStats().totalNodes` used `SELECT COUNT(*) FROM nodes` which counts every node ever seen — 6800+ on a ~200-400 node mesh. Changed `totalNodes` to count only nodes with `last_seen` within 7 days. Added `totalNodesAllTime` for the full historical count. Also filtered role counts in `/api/stats` to the same 7-day window. Added `countActiveNodes` and `countActiveNodesByRole` prepared statements in db.js. 6 new tests (95 total in test-db.js). The existing `idx_nodes_last_seen` index covers the new queries.
- Go server FULL API parity: Rewrote QueryGroupedPackets from packets_v VIEW scan (8s on 1.2M rows) to transmission-centric query (<100ms). Fixed GetStats to use 7-day window for totalNodes + added totalNodesAllTime. Split GetRoleCounts into 7-day (for /api/stats) and all-time (for /api/nodes). Added packetsLastHour + node lat/lon/role to /api/observers via batch queries (GetObserverPacketCounts, GetNodeLocations). Added multi-node filter support (/api/packets?nodes=pk1,pk2). Fixed /api/packets/:id to return parsed path_json in path field. Populated bulk-health per-node stats from SQL. Updated test seed data to use dynamic timestamps for 7-day filter compatibility. All 42+ tests pass, go vet clean.
- Fixed #133 ROOT CAUSE (phantom nodes): `autoLearnHopNodes` in server.js was calling `db.upsertNode()` for every unresolved hop prefix, creating thousands of fake "repeater" nodes with short public_keys (just the 2-4 byte hop prefix). Removed the `upsertNode` call entirely — unresolved hops are now simply cached to skip repeat DB lookups, and display as raw hex prefixes via hop-resolver. Added `db.removePhantomNodes()` that deletes nodes with `LENGTH(public_key) <= 16` (real pubkeys are 64 hex chars). Called at server startup to purge existing phantoms. 14 new test assertions (109 total in test-db.js).
- Fixed #126 (offline node showing on map due to hash prefix collision): `updatePathSeenTimestamps()` and `autoLearnHopNodes()` used `LIKE prefix%` DB queries that non-deterministically picked the first match when multiple nodes shared a hash prefix (e.g. `1CC4` and `1C82` both start with `1C` under 1-byte hash_size). Extracted `resolveUniquePrefixMatch()` that checks for uniqueness — ambiguous prefixes (matching 2+ nodes) are skipped and cached in a negative-cache Set. This prevents dead nodes from getting `last_heard` updates from packets that actually belong to a different node. 3 new tests (207 total in test-server-routes.js).
- Fixed #123 (channel hash for undecrypted GRP_TXT): Added `channelHashHex` (zero-padded uppercase hex) and `decryptionStatus` ('decrypted'|'no_key'|'decryption_failed') fields to `decodeGrpTxt` in decoder.js. Distinguishes between "no channel keys configured" vs "keys tried but decryption failed." Frontend packets.js updated: list preview shows "🔒 Ch 0xXX (status)", detail pane hex breakdown and message area show channel hash with status label. 6 new tests (58 total in test-decoder.js).
- Ported in-memory packet store to Go (`cmd/server/store.go`). PacketStore loads all transmissions + observations from SQLite at startup via streaming query (no .all()), builds 5 indexes (byHash, byTxID, byObsID, byObserver, byNode), picks longest-path observation per transmission for display fields. QueryPackets and QueryGroupedPackets serve from memory with full filter support (type, route, observer, hash, since, until, region, node). Poller ingests new transmissions into store via IngestNewFromDB. Server/routes fall back to direct DB queries when store is nil (backward-compatible with tests). All 42+ existing tests pass, go vet clean, go build clean. System Go 1.17 requires using Go 1.22.5 at C:\go1.22\go\bin.
- Fixed 3 critically slow Go endpoints by switching from SQLite queries against packets_v VIEW (1.2M rows) to in-memory PacketStore queries. `/api/channels` 7.2s→37ms (195×), `/api/channels/:hash/messages` 8.2s→36ms (228×), `/api/analytics/rf` 4.2s→90ms avg (47×). Key optimizations: (1) byPayloadType index reduces channels scan from 52K to 17K packets, (2) struct-based JSON decode avoids map[string]interface{} allocations, (3) per-transmission work hoisted out of 1.2M observation loop for RF, (4) eliminated second-pass time.Parse over 1.2M observations (track min/max timestamps as strings instead), (5) pre-allocated slices with capacity hints, (6) 15-second TTL cache for RF analytics (separate mutex to avoid contention with store RWMutex). Cache invalidation is TTL-only because live mesh generates continuous ingest events. Also fixed `/api/analytics/channels` to use store. All handlers fall back to DB when store is nil (test compat).
- **#133 PHANTOM NODES (ROOT CAUSE):** Backend `autoLearnHopNodes()` removed upsertNode call. Added `db.removePhantomNodes()` (pubkey ≤16 chars). Called at startup. Cascadia: 7,308 → ~200-400 active nodes. 14 new tests, all passing.
- **#133 ACTIVE WINDOW:** `/api/stats``totalNodes` now 7-day window. Added `totalNodesAllTime` for historical. Role counts filtered to 7-day. Go server GetStats updated for parity.
- **#126 AMBIGUOUS PREFIXES:** `resolveUniquePrefixMatch()` requires unique prefix match. Ambiguous prefixes skipped, cached in negative-cache. Prevents dead nodes from wrong packet attribution.
- **Go API Parity:** QueryGroupedPackets transmission-centric 8s→<100ms. Response shapes match Node.js exactly. All 42+ Go tests passing.
- **Database merge:** Staging 185MB (50K tx + 1.2M obs) merged into prod 21MB. 0 data loss. Merged DB 51,723 tx + 1,237,186 obs. Deploy time 8,491ms, memory 860MiB RSS (v.s. 2.7GB pre-RAM-fix). Backups retained 7 days.
# Hicks — History
## Project Context
CoreScope is a real-time LoRa mesh packet analyzer. Node.js + Express + SQLite backend, vanilla JS SPA frontend. Custom decoder.js fixes path_length bug from upstream library. In-memory packet store provides O(1) lookups for 30K+ packets. TTL response cache achieves 7,000× speedup on bulk health endpoint.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- Split the monolithic "Frontend coverage (instrumented Playwright)" CI step into 5 discrete steps: Instrument frontend JS, Start test server (with health-check poll replacing sleep 5), Run Playwright E2E tests, Extract coverage + generate report, Stop test server. Cleanup/report steps use `if: always()` so server shutdown happens even on test failure. Server PID shared across steps via .server.pid file. "Frontend E2E only" fast-path left untouched.
- Fixed memory explosion in packet-store.js: observations no longer duplicate transmission fields (hash, raw_hex, decoded_json, payload_type, route_type). Instead, observations store only `transmission_id` as a reference. Added `_enrichObs()` to hydrate observations at API boundaries (getById, getSiblings, enrichObservations). Replaced `.all()` with `.iterate()` for streaming load. Updated `_transmissionsForObserver()` to use transmission_id instead of hash. For a 185MB DB with 50K transmissions × 23 observations avg, this eliminates ~1.17M copies of hex dumps and JSON — projected ~2GB RAM savings.
- Built standalone Go MQTT ingestor (`cmd/ingestor/`). Ported decoder.js → Go (header parsing, path extraction, all payload types, advert decoding with flags/lat/lon/name). Ported db.js v3 schema (transmissions + observations + nodes + observers). Ported computeContentHash (SHA-256 based, path-independent). Uses modernc.org/sqlite (pure Go, no CGO) and paho.mqtt.golang. 25 tests passing (decoder golden fixtures from production data + DB schema compatibility). Supports same config.json format as Node.js server. Handles Format 1 (raw packet) messages; companion bridge format deferred. System Go was 1.17 — installed Go 1.22.5 to support modern dependencies.
- Built standalone Go web server (`cmd/server/`) — READ side of the Go rewrite. 35+ REST API endpoints ported from server.js. All queries go directly to SQLite (no in-memory packet store). WebSocket broadcast via SQLite polling. Static file server with SPA fallback. Uses gorilla/mux for routing, gorilla/websocket for WS, modernc.org/sqlite for DB. 42 tests passing (20 DB query tests, 20+ route integration tests, 2 WebSocket tests). `go vet` clean. Binary compiles to single executable. Analytics endpoints that required Node.js in-memory store (topology, distance, hash-sizes, subpaths) return structural stubs — core data (RF stats, channels, node health, etc.) fully functional via SQL. System Go 1.17 → installed Go 1.22 for build. Each cmd/* module has its own go.mod (no root-level go.mod).
- Go server API parity fix: Rewrote QueryPackets from observation-centric (packets_v view) to transmission-centric (transmissions table + correlated subqueries). This fixes both performance (9s to sub-100ms for unfiltered queries on 1.2M rows) and response shape. Packets now return first_seen, timestamp (= first_seen), observation_count, and NOT created_at/payload_version/score. Node responses now include last_heard (= last_seen fallback), hash_size (null), hash_size_inconsistent (false). Added schema version detection (v2 vs v3 observations table). Fixed QueryGroupedPackets first_seen. Added GetRecentTransmissionsForNode. All tests pass, build clean with Go 1.22.
- Fixed #133 (node count keeps climbing): `db.getStats().totalNodes` used `SELECT COUNT(*) FROM nodes` which counts every node ever seen — 6800+ on a ~200-400 node mesh. Changed `totalNodes` to count only nodes with `last_seen` within 7 days. Added `totalNodesAllTime` for the full historical count. Also filtered role counts in `/api/stats` to the same 7-day window. Added `countActiveNodes` and `countActiveNodesByRole` prepared statements in db.js. 6 new tests (95 total in test-db.js). The existing `idx_nodes_last_seen` index covers the new queries.
- Go server FULL API parity: Rewrote QueryGroupedPackets from packets_v VIEW scan (8s on 1.2M rows) to transmission-centric query (<100ms). Fixed GetStats to use 7-day window for totalNodes + added totalNodesAllTime. Split GetRoleCounts into 7-day (for /api/stats) and all-time (for /api/nodes). Added packetsLastHour + node lat/lon/role to /api/observers via batch queries (GetObserverPacketCounts, GetNodeLocations). Added multi-node filter support (/api/packets?nodes=pk1,pk2). Fixed /api/packets/:id to return parsed path_json in path field. Populated bulk-health per-node stats from SQL. Updated test seed data to use dynamic timestamps for 7-day filter compatibility. All 42+ tests pass, go vet clean.
- Fixed #133 ROOT CAUSE (phantom nodes): `autoLearnHopNodes` in server.js was calling `db.upsertNode()` for every unresolved hop prefix, creating thousands of fake "repeater" nodes with short public_keys (just the 2-4 byte hop prefix). Removed the `upsertNode` call entirely — unresolved hops are now simply cached to skip repeat DB lookups, and display as raw hex prefixes via hop-resolver. Added `db.removePhantomNodes()` that deletes nodes with `LENGTH(public_key) <= 16` (real pubkeys are 64 hex chars). Called at server startup to purge existing phantoms. 14 new test assertions (109 total in test-db.js).
- Fixed #126 (offline node showing on map due to hash prefix collision): `updatePathSeenTimestamps()` and `autoLearnHopNodes()` used `LIKE prefix%` DB queries that non-deterministically picked the first match when multiple nodes shared a hash prefix (e.g. `1CC4` and `1C82` both start with `1C` under 1-byte hash_size). Extracted `resolveUniquePrefixMatch()` that checks for uniqueness — ambiguous prefixes (matching 2+ nodes) are skipped and cached in a negative-cache Set. This prevents dead nodes from getting `last_heard` updates from packets that actually belong to a different node. 3 new tests (207 total in test-server-routes.js).
- Fixed #123 (channel hash for undecrypted GRP_TXT): Added `channelHashHex` (zero-padded uppercase hex) and `decryptionStatus` ('decrypted'|'no_key'|'decryption_failed') fields to `decodeGrpTxt` in decoder.js. Distinguishes between "no channel keys configured" vs "keys tried but decryption failed." Frontend packets.js updated: list preview shows "🔒 Ch 0xXX (status)", detail pane hex breakdown and message area show channel hash with status label. 6 new tests (58 total in test-decoder.js).
- Ported in-memory packet store to Go (`cmd/server/store.go`). PacketStore loads all transmissions + observations from SQLite at startup via streaming query (no .all()), builds 5 indexes (byHash, byTxID, byObsID, byObserver, byNode), picks longest-path observation per transmission for display fields. QueryPackets and QueryGroupedPackets serve from memory with full filter support (type, route, observer, hash, since, until, region, node). Poller ingests new transmissions into store via IngestNewFromDB. Server/routes fall back to direct DB queries when store is nil (backward-compatible with tests). All 42+ existing tests pass, go vet clean, go build clean. System Go 1.17 requires using Go 1.22.5 at C:\go1.22\go\bin.
- Fixed 3 critically slow Go endpoints by switching from SQLite queries against packets_v VIEW (1.2M rows) to in-memory PacketStore queries. `/api/channels` 7.2s→37ms (195×), `/api/channels/:hash/messages` 8.2s→36ms (228×), `/api/analytics/rf` 4.2s→90ms avg (47×). Key optimizations: (1) byPayloadType index reduces channels scan from 52K to 17K packets, (2) struct-based JSON decode avoids map[string]interface{} allocations, (3) per-transmission work hoisted out of 1.2M observation loop for RF, (4) eliminated second-pass time.Parse over 1.2M observations (track min/max timestamps as strings instead), (5) pre-allocated slices with capacity hints, (6) 15-second TTL cache for RF analytics (separate mutex to avoid contention with store RWMutex). Cache invalidation is TTL-only because live mesh generates continuous ingest events. Also fixed `/api/analytics/channels` to use store. All handlers fall back to DB when store is nil (test compat).
- **#133 PHANTOM NODES (ROOT CAUSE):** Backend `autoLearnHopNodes()` removed upsertNode call. Added `db.removePhantomNodes()` (pubkey ≤16 chars). Called at startup. Cascadia: 7,308 → ~200-400 active nodes. 14 new tests, all passing.
- **#133 ACTIVE WINDOW:** `/api/stats``totalNodes` now 7-day window. Added `totalNodesAllTime` for historical. Role counts filtered to 7-day. Go server GetStats updated for parity.
- **#126 AMBIGUOUS PREFIXES:** `resolveUniquePrefixMatch()` requires unique prefix match. Ambiguous prefixes skipped, cached in negative-cache. Prevents dead nodes from wrong packet attribution.
MeshCore Analyzer is a real-time LoRa mesh packet analyzer. Node.js + Express + SQLite backend, vanilla JS SPA frontend with Leaflet maps, WebSocket live feed, MQTT ingestion. Production at v2.6.0, ~18K lines, 85%+ backend test coverage.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- **E2E Playwright performance audit (2026-03-26):** 16 tests, single browser/context/page (good). Key bottlenecks: (1) `waitUntil: 'networkidle'` used ~20 times — catastrophic for SPA with WebSocket + map tiles, (2) ~17s of hardcoded `waitForTimeout` sleeps, (3) redundant `page.goto()` to same routes across tests, (4) CI installs Playwright browser on every run with no caching, (5) coverage collection launches a second full browser session, (6) `sleep 5` server startup instead of health-check polling. Estimated 40-50% total runtime reduction achievable.
- **Issue triage session (2026-03-27):** Triaged 4 open issues, assigned to team:
- **#131** (Feature: Auto-update nodes tab) → Newt (⚛️). Requires WebSocket real-time updates in nodes.js, similar to existing packets feed.
- **#130** (Bug: Disappearing nodes on live map) → Newt (⚛️). High severity, multiple Cascadia Mesh community reports. Likely status calculation or map filter bug. Nodes visible in static list but vanishing from live map.
- **#129** (Feature: Packet comparison between observers) → Newt (⚛️). Feature request from letsmesh analyzer. Side-by-side packet filtering for two repeaters to diagnose repeater issues.
- **#123** (Feature: Show channel hash on decrypt failure) → Hicks (🔧). Core contributor (lincomatic) request. Decoder needs to track why decrypt failed (no key vs. corruption) and expose channel hash + reason in API response.
- **Massive session — 2026-03-27 (full day):**
- **#133 root cause (phantom nodes):** `autoLearnHopNodes()` creates stub nodes for unresolved hop prefixes (2-8 hex chars). Cascadia showed 7,308 nodes (6,638 repeaters) when real size ~200-400. With `hash_size=1`, collision rate high → infinite phantom generation.
- **DB merge decision:** Staging DB (185MB, 50K transmissions, 1.2M observations) is superset. Use as merge base. Transmissions dedup by hash (unique), observations all preserved (unique by observer), nodes/observers latest-wins + sum counts. 6-phase execution plan: pre-flight, backup, merge, deploy, validate, cleanup.
- **Outcome:** All 4 triaged issues fixed (#131, #130, #129, #123), #133 (phantom nodes) fully resolved, #126 (ambiguous hop prefixes) fixed as bonus, database merged successfully (0 data loss, 2 min downtime, 51,723 tx + 1.237M obs), Go rewrite (MQTT ingestor + web server) completed and ready for staging.
- **Team expanded:** Hudson joined for DevOps work, Ripley joined as Support Engineer.
- **Go staging bug triage (2026-03-28):** Filed 8 issues for Go staging bugs missed during API parity work. All found by actually loading the analytics page in a browser — none caught by endpoint-level parity checks.
- **Post-mortem:** Parity was verified by comparing individual endpoint response shapes in isolation. Nobody loaded the analytics page in a browser and looked at it. The agents tested API responses without browser validation of the full UI — exactly the failure mode AGENTS.md rule #2 exists to prevent.
# Kobayashi — History
## Project Context
CoreScope is a real-time LoRa mesh packet analyzer. Node.js + Express + SQLite backend, vanilla JS SPA frontend with Leaflet maps, WebSocket live feed, MQTT ingestion. Production at v2.6.0, ~18K lines, 85%+ backend test coverage.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- **E2E Playwright performance audit (2026-03-26):** 16 tests, single browser/context/page (good). Key bottlenecks: (1) `waitUntil: 'networkidle'` used ~20 times — catastrophic for SPA with WebSocket + map tiles, (2) ~17s of hardcoded `waitForTimeout` sleeps, (3) redundant `page.goto()` to same routes across tests, (4) CI installs Playwright browser on every run with no caching, (5) coverage collection launches a second full browser session, (6) `sleep 5` server startup instead of health-check polling. Estimated 40-50% total runtime reduction achievable.
- **Issue triage session (2026-03-27):** Triaged 4 open issues, assigned to team:
- **#131** (Feature: Auto-update nodes tab) → Newt (⚛️). Requires WebSocket real-time updates in nodes.js, similar to existing packets feed.
- **#130** (Bug: Disappearing nodes on live map) → Newt (⚛️). High severity, multiple Cascadia Mesh community reports. Likely status calculation or map filter bug. Nodes visible in static list but vanishing from live map.
- **#129** (Feature: Packet comparison between observers) → Newt (⚛️). Feature request from letsmesh analyzer. Side-by-side packet filtering for two repeaters to diagnose repeater issues.
- **#123** (Feature: Show channel hash on decrypt failure) → Hicks (🔧). Core contributor (lincomatic) request. Decoder needs to track why decrypt failed (no key vs. corruption) and expose channel hash + reason in API response.
- **Massive session — 2026-03-27 (full day):**
- **#133 root cause (phantom nodes):** `autoLearnHopNodes()` creates stub nodes for unresolved hop prefixes (2-8 hex chars). Cascadia showed 7,308 nodes (6,638 repeaters) when real size ~200-400. With `hash_size=1`, collision rate high → infinite phantom generation.
- **DB merge decision:** Staging DB (185MB, 50K transmissions, 1.2M observations) is superset. Use as merge base. Transmissions dedup by hash (unique), observations all preserved (unique by observer), nodes/observers latest-wins + sum counts. 6-phase execution plan: pre-flight, backup, merge, deploy, validate, cleanup.
- **Outcome:** All 4 triaged issues fixed (#131, #130, #129, #123), #133 (phantom nodes) fully resolved, #126 (ambiguous hop prefixes) fixed as bonus, database merged successfully (0 data loss, 2 min downtime, 51,723 tx + 1.237M obs), Go rewrite (MQTT ingestor + web server) completed and ready for staging.
- **Team expanded:** Hudson joined for DevOps work, Ripley joined as Support Engineer.
- **Go staging bug triage (2026-03-28):** Filed 8 issues for Go staging bugs missed during API parity work. All found by actually loading the analytics page in a browser — none caught by endpoint-level parity checks.
- **Post-mortem:** Parity was verified by comparing individual endpoint response shapes in isolation. Nobody loaded the analytics page in a browser and looked at it. The agents tested API responses without browser validation of the full UI — exactly the failure mode AGENTS.md rule #2 exists to prevent.
MeshCore Analyzer is a real-time LoRa mesh packet analyzer with a vanilla JS SPA frontend. 22 frontend modules, Leaflet maps, WebSocket live feed, VCR playback, Canvas animations, theme customizer with CSS variables. No build step, no framework. ES5/6 for broad browser support.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- **Issue #127 fix:** Firefox clipboard API fails silently when `navigator.clipboard.writeText()` is called outside a secure context or without proper user gesture handling. Added `window.copyToClipboard()` shared helper to `roles.js` that tries Clipboard API first, falls back to hidden textarea + `document.execCommand('copy')`. Updated all 3 clipboard call sites: `nodes.js` (Copy URL — the reported bug), `packets.js` (Copy Link — had ugly `prompt()` fallback), `customize.js` (Copy to Clipboard — already worked but now uses shared helper). Cache busters bumped. All tests pass (47 frontend, 62 packet-filter).
- **Issue #125 fix:** Added dismiss/close button (✕) to the packet detail pane on desktop. Extracted `closeDetailPanel()` shared helper and `PANEL_CLOSE_HTML` constant — DRY: Escape handler and click handler both call it. Close button uses event delegation on `#pktRight`, styled with CSS variables (`--text-muted`, `--text`, `--surface-1`) matching the mobile `.mobile-sheet-close` pattern. Hidden when panel is in `.empty` state. Clicking a different row still re-opens with new data. Files changed: `public/packets.js`, `public/style.css`. Cache busters NOT bumped (another agent editing index.html).
- **Issue #122 fix:** Node tooltip (line 45) and node detail panel (line 120) in `channels.js` used `last_seen` alone for "Last seen" display. Changed both to `last_heard || last_seen` per AGENTS.md pitfall. Pattern: always prefer `last_heard || last_seen` for any time-ago display. **Server note for Hicks:**`/api/nodes/search` and `/api/nodes/:pubkey` endpoints don't return `last_heard` — only the bulk `/api/nodes` list endpoint computes it from the in-memory packet store. These endpoints need the same `last_heard` enrichment for the frontend fix to fully take effect. Also, `/api/analytics/channels` has a separate bug: `lastActivity` is overwritten unconditionally (no `>=` check) so it shows the oldest packet's timestamp, not the newest.
- **Issue #130 fix:** Live map `pruneStaleNodes()` (added for #133) was completely removing stale nodes from the map, while the static map dims them with CSS. Root cause: API-loaded nodes and WS-only nodes were treated identically — both got deleted when stale. Fix: mark API-loaded nodes with `_fromAPI = true` in `loadNodes()`. `pruneStaleNodes()` now dims API nodes (fillOpacity 0.25, opacity 0.15) instead of removing them, and restores full opacity when they become active again. WS-only dynamic nodes are still removed to prevent memory leaks. Pattern: **live map should match static map behavior** — never remove database-loaded nodes, only change their visual state. 3 new tests added (63 total frontend tests passing).
- **Issue #129 fix:** Added observer packet comparison feature (`#/compare` page). Users select two observers from dropdowns, click Compare, and see which packets each observer saw in the last 24 hours. Data flow: fetches packets per observer via existing `/api/packets?observer=X&limit=10000&since=24h`, computes set intersection/difference client-side using `comparePacketSets()` (O(n) via Set lookups — no nested loops). UI: three summary cards (both/only-A/only-B with counts and percentages), horizontal stacked bar chart, packet type breakdown for shared packets, and tabbed detail tables (up to 200 rows each, clickable to packet detail). URL is shareable: `#/compare?a=ID1&b=ID2`. Added 🔍 compare button to observers page header. Pure function `comparePacketSets` exposed on `window` for testability. 11 new tests (87 total frontend tests). Files: `public/compare.js` (new), `public/style.css`, `public/observers.js`, `public/index.html`, `test-frontend-helpers.js`. Cache busters bumped.
- **Browser validation of 6 fixes (2026-03-27):** Validated against live prod at `https://analyzer.00id.net`. Results: ✅ #133 (phantom nodes) — API returns 50 nodes, reasonable count, no runaway growth. ✅ #123 (channel hash on undecrypted) — GRP_TXT packets with `decryption_failed` status show `channelHashHex` field; packet detail renders `🔒 Channel Hash: 0xE2 (decryption failed)` via `packets.js:1254-1259`. ⏭ #126 (offline node on map) — skipped, requires specific dead node. ✅ #130 (disappearing nodes on live map) — `pruneStaleNodes()` confirmed at `live.js:1474` dims API-loaded nodes (`fillOpacity:0.25`) instead of removing; `_fromAPI=true` flag set at `live.js:1279`. ✅ #131 (auto-updating node list) — `nodes.js:210-216` wires `debouncedOnWS` handler that triggers `loadNodes(true)` on ADVERT messages; `isAdvertMessage()` at `nodes.js:852` checks `payload_type===4`. ✅ #129 (observer comparison) — `compare.js` deployed with full UI: observer dropdowns, `comparePacketSets()` Set logic, summary cards, bar chart, type breakdown. 16 observers available in prod. Pattern: always verify deployed JS matches source — cache buster `v=1774625000` confirmed consistent across all script tags.
- **Packet detail pane fresh-load fix:** The `detail-collapsed` class added for issue #125's close button wasn't applied on initial render, so the empty right panel was visible on fresh page load. Fix: added `detail-collapsed` to the `split-layout` div in the initial `innerHTML` template (packets.js:183). Pattern: when adding a CSS toggle class, always consider the initial DOM state — if nothing is selected, the default state must match "nothing selected." 3 tests added (90 total frontend). Cache busters bumped.
- **#130 LIVE MAP STALE DIMMING:** `pruneStaleNodes()` distinguishes API-loaded (`_fromAPI`) from WS-only. Dims API nodes (fillOpacity 0.25, opacity 0.15) instead of removing. Matches static map behavior. 3 new tests, all passing.
- **#131 NODES TAB WS AUTO-UPDATE:** `loadNodes(refreshOnly)` pattern resets cache + invalidateApiCache + re-fetches. Preserves scroll/selection/listeners. WS handler now triggers on ADVERT messages (payload_type===4). All tests passing.
- **#129 OBSERVER COMPARISON PAGE:** New `#/compare` route with shareable params `?a=ID1&b=ID2`. `comparePacketSets()` pure function (O(n) Set operations). UI: summary cards, bar chart, type breakdown, detail tables. 🔍 compare button on observers header.
- **#133 LIVE PAGE NODE PRUNING:** Prune every 60s using `getNodeStatus()` from roles.js (per-role health thresholds: 24h companions/sensors, 72h infrastructure). `_liveSeen` timestamp set on insert, updated on re-observation. Bounded memory usage.
- **Database merge:** All frontend endpoints working with merged 1.237M observation DB. Load speed verified. All 4 fixes tested end-to-end in browser.
# Newt — History
## Project Context
CoreScope is a real-time LoRa mesh packet analyzer with a vanilla JS SPA frontend. 22 frontend modules, Leaflet maps, WebSocket live feed, VCR playback, Canvas animations, theme customizer with CSS variables. No build step, no framework. ES5/6 for broad browser support.
User: User
## Learnings
- Session started 2026-03-26. Team formed: Kobayashi (Lead), Hicks (Backend), Newt (Frontend), Bishop (Tester).
- **Issue #127 fix:** Firefox clipboard API fails silently when `navigator.clipboard.writeText()` is called outside a secure context or without proper user gesture handling. Added `window.copyToClipboard()` shared helper to `roles.js` that tries Clipboard API first, falls back to hidden textarea + `document.execCommand('copy')`. Updated all 3 clipboard call sites: `nodes.js` (Copy URL — the reported bug), `packets.js` (Copy Link — had ugly `prompt()` fallback), `customize.js` (Copy to Clipboard — already worked but now uses shared helper). Cache busters bumped. All tests pass (47 frontend, 62 packet-filter).
- **Issue #125 fix:** Added dismiss/close button (✕) to the packet detail pane on desktop. Extracted `closeDetailPanel()` shared helper and `PANEL_CLOSE_HTML` constant — DRY: Escape handler and click handler both call it. Close button uses event delegation on `#pktRight`, styled with CSS variables (`--text-muted`, `--text`, `--surface-1`) matching the mobile `.mobile-sheet-close` pattern. Hidden when panel is in `.empty` state. Clicking a different row still re-opens with new data. Files changed: `public/packets.js`, `public/style.css`. Cache busters NOT bumped (another agent editing index.html).
- **Issue #122 fix:** Node tooltip (line 45) and node detail panel (line 120) in `channels.js` used `last_seen` alone for "Last seen" display. Changed both to `last_heard || last_seen` per AGENTS.md pitfall. Pattern: always prefer `last_heard || last_seen` for any time-ago display. **Server note for Hicks:**`/api/nodes/search` and `/api/nodes/:pubkey` endpoints don't return `last_heard` — only the bulk `/api/nodes` list endpoint computes it from the in-memory packet store. These endpoints need the same `last_heard` enrichment for the frontend fix to fully take effect. Also, `/api/analytics/channels` has a separate bug: `lastActivity` is overwritten unconditionally (no `>=` check) so it shows the oldest packet's timestamp, not the newest.
- **Issue #130 fix:** Live map `pruneStaleNodes()` (added for #133) was completely removing stale nodes from the map, while the static map dims them with CSS. Root cause: API-loaded nodes and WS-only nodes were treated identically — both got deleted when stale. Fix: mark API-loaded nodes with `_fromAPI = true` in `loadNodes()`. `pruneStaleNodes()` now dims API nodes (fillOpacity 0.25, opacity 0.15) instead of removing them, and restores full opacity when they become active again. WS-only dynamic nodes are still removed to prevent memory leaks. Pattern: **live map should match static map behavior** — never remove database-loaded nodes, only change their visual state. 3 new tests added (63 total frontend tests passing).
- **Issue #129 fix:** Added observer packet comparison feature (`#/compare` page). Users select two observers from dropdowns, click Compare, and see which packets each observer saw in the last 24 hours. Data flow: fetches packets per observer via existing `/api/packets?observer=X&limit=10000&since=24h`, computes set intersection/difference client-side using `comparePacketSets()` (O(n) via Set lookups — no nested loops). UI: three summary cards (both/only-A/only-B with counts and percentages), horizontal stacked bar chart, packet type breakdown for shared packets, and tabbed detail tables (up to 200 rows each, clickable to packet detail). URL is shareable: `#/compare?a=ID1&b=ID2`. Added 🔍 compare button to observers page header. Pure function `comparePacketSets` exposed on `window` for testability. 11 new tests (87 total frontend tests). Files: `public/compare.js` (new), `public/style.css`, `public/observers.js`, `public/index.html`, `test-frontend-helpers.js`. Cache busters bumped.
- **Browser validation of 6 fixes (2026-03-27):** Validated against live prod at `https://analyzer.00id.net`. Results: ✅ #133 (phantom nodes) — API returns 50 nodes, reasonable count, no runaway growth. ✅ #123 (channel hash on undecrypted) — GRP_TXT packets with `decryption_failed` status show `channelHashHex` field; packet detail renders `🔒 Channel Hash: 0xE2 (decryption failed)` via `packets.js:1254-1259`. ⏭ #126 (offline node on map) — skipped, requires specific dead node. ✅ #130 (disappearing nodes on live map) — `pruneStaleNodes()` confirmed at `live.js:1474` dims API-loaded nodes (`fillOpacity:0.25`) instead of removing; `_fromAPI=true` flag set at `live.js:1279`. ✅ #131 (auto-updating node list) — `nodes.js:210-216` wires `debouncedOnWS` handler that triggers `loadNodes(true)` on ADVERT messages; `isAdvertMessage()` at `nodes.js:852` checks `payload_type===4`. ✅ #129 (observer comparison) — `compare.js` deployed with full UI: observer dropdowns, `comparePacketSets()` Set logic, summary cards, bar chart, type breakdown. 16 observers available in prod. Pattern: always verify deployed JS matches source — cache buster `v=1774625000` confirmed consistent across all script tags.
- **Packet detail pane fresh-load fix:** The `detail-collapsed` class added for issue #125's close button wasn't applied on initial render, so the empty right panel was visible on fresh page load. Fix: added `detail-collapsed` to the `split-layout` div in the initial `innerHTML` template (packets.js:183). Pattern: when adding a CSS toggle class, always consider the initial DOM state — if nothing is selected, the default state must match "nothing selected." 3 tests added (90 total frontend). Cache busters bumped.
- **#130 LIVE MAP STALE DIMMING:** `pruneStaleNodes()` distinguishes API-loaded (`_fromAPI`) from WS-only. Dims API nodes (fillOpacity 0.25, opacity 0.15) instead of removing. Matches static map behavior. 3 new tests, all passing.
- **#131 NODES TAB WS AUTO-UPDATE:** `loadNodes(refreshOnly)` pattern resets cache + invalidateApiCache + re-fetches. Preserves scroll/selection/listeners. WS handler now triggers on ADVERT messages (payload_type===4). All tests passing.
- **#129 OBSERVER COMPARISON PAGE:** New `#/compare` route with shareable params `?a=ID1&b=ID2`. `comparePacketSets()` pure function (O(n) Set operations). UI: summary cards, bar chart, type breakdown, detail tables. 🔍 compare button on observers header.
- **#133 LIVE PAGE NODE PRUNING:** Prune every 60s using `getNodeStatus()` from roles.js (per-role health thresholds: 24h companions/sensors, 72h infrastructure). `_liveSeen` timestamp set on insert, updated on re-observation. Bounded memory usage.
- **Database merge:** All frontend endpoints working with merged 1.237M observation DB. Load speed verified. All 4 fixes tested end-to-end in browser.
Deep knowledge of every frontend behavior, API response, and user-facing feature in MeshCore Analyzer. Fields community questions, triages bug reports, and explains "why does X look like Y."
1. Read the relevant frontend code FIRST — don't guess
2. Check the live API data if applicable (analyzer.00id.net is public)
3. Explain in user-friendly terms, not code jargon
4. If it's a bug, route to the right squad member
5. If it's expected behavior, explain WHY
## Model
Preferred: auto
# Ripley — Support Engineer
Deep knowledge of every frontend behavior, API response, and user-facing feature in CoreScope. Fields community questions, triages bug reports, and explains "why does X look like Y."
**Decision:** CI pipeline should check if `docker compose` (v2 plugin) is installed on the self-hosted runner and install it if needed, as part of the deploy job itself.
**Rationale:** Self-healing CI is preferred over manual VM setup; the VM may not have docker compose v2 installed.
### 2026-03-27T04:39 — Staging DB: Use Old Problematic DB
**By:** User (via Copilot)
**Decision:** Staging environment's primary purpose is debugging the problematic DB that caused 100% CPU on prod. Use the old DB (`~/meshcore-data-old/` on the VM) for staging. Prod keeps its current (new) DB. Never put the problematic DB on prod.
**Rationale:** This is the reason the staging environment was built.
### 2026-03-27T06:09 — Plan Go Rewrite (MQTT Separation)
**By:** User (via Copilot)
**Decision:** Start planning a Go rewrite. First step: separate MQTT ingestion (writes to DB) from the web server (reads from DB + serves API/frontend). Two separate services.
**Rationale:** Node.js single-thread + V8 heap limitations cause fragility at scale (185MB DB → 2.7GB heap → OOM). Go eliminates heap cap problem and enables real concurrency.
### 2026-03-27T06:31 — NO PII in Git
**By:** User (via Copilot)
**Decision:** NEVER write real names, usernames, email addresses, or any PII to files committed to git. Use "User" for attribution and "deploy" for SSH/server references. This is a PUBLIC repo.
**Rationale:** PII was leaked to the public repo and required a full git history rewrite to remove.
### 2026-03-27T02:19 — Production/Infrastructure Touches: Hudson Only
**By:** User (via Copilot)
**Decision:** Production/infrastructure touches (SSH, DB ops, server restarts, Azure operations) should only be done by Hudson (DevOps). No other agents should touch prod directly.
**Rationale:** Separation of concerns — dev agents write code, DevOps deploys and manages prod.
1. No Docker named volumes — always bind mount from `~/meshcore-data` (host location, easy to access)
2. Staging container runs on plaintext port (e.g., port 81, no HTTPS)
3. Use Docker Compose to orchestrate prod + staging containers on the same VM
4.`manage.sh` supports launching prod only OR prod+staging with clear messaging
5. Ports must be configurable via `manage.sh` or environment, with sane defaults
### 2026-03-27T03:43 — Staging Refinements: Shared Data
**By:** User (via Copilot)
**Decision:**
1. Staging copies prod DB on launch (snapshot into staging data dir when started)
2. Staging connects to SAME MQTT broker as prod (not its own Mosquitto)
**Rationale:** Staging needs real data (prod-like conditions) to be useful for testing.
### 2026-03-27T17:13 — Scribe Auto-Run After Agent Batches
**By:** User (via Copilot)
**Decision:** Scribe must run after EVERY batch of agent work automatically. No manual triggers. No reminders needed. This is a process guarantee, not a suggestion.
**Rationale:** Coordinator has been forgetting to spawn Scribe after agent batches complete. This is a process failure. Scribe auto-spawn ends the forgetfulness.
---
## Decision: Technical Fixes
### Issue #126 — Skip Ambiguous Hop Prefixes
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented
When resolving hop prefixes to full node pubkeys, require a **unique match**. If prefix matches 2+ nodes in DB, skip it and cache in `ambiguousHopPrefixes` (negative cache). Prevents hash prefix collisions (e.g., `1CC4` vs `1C82` sharing prefix `1C` under 1-byte hash_size) from attributing packets to wrong nodes.
**Impact:**
- Hopresixes that collide won't update `lastPathSeenMap` for any node (conservative, correct)
-`disambiguateHops()` still does geometric disambiguation for route visualization
-`autoLearnHopNodes()` no longer calls `db.upsertNode()` for unresolved hops
- Added `db.removePhantomNodes()` — deletes nodes where `LENGTH(public_key) <= 16` (real keys are 64 hex chars)
- Called at startup to purge existing phantoms from prior behavior
- Hop-resolver still handles unresolved prefixes gracefully
**Part 2: totalNodes now 7-day active window**
-`/api/stats``totalNodes` returns only nodes seen in last 7 days (was all-time)
- New field `totalNodesAllTime` for historical tracking
- Role counts (repeaters, rooms, companions, sensors) also filtered to 7-day window
- Frontend: no changes needed (same field name, smaller correct number)
**Impact:** Frontend `totalNodes` now reflects active mesh size. Go server should apply same 7-day filter when querying.
---
### Issue #123 — Channel Hash on Undecrypted Messages
**By:** Hicks
**Status:** Implemented
Fixed test coverage for decrypted status tracking on channel messages.
---
### Issue #130 — Live Map: Dim Stale Nodes, Don't Remove
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
`pruneStaleNodes()` in `live.js` now distinguishes API-loaded nodes (`_fromAPI`) from WS-only dynamic nodes. API nodes dimmed (reduced opacity) when stale instead of removed. WS-only nodes still pruned to prevent memory leaks.
**Rationale:** Static map shows stale nodes with faded markers; live map was deleting them, causing user-reported disappearing nodes. Parity expected.
**Pattern:** Database-loaded nodes never removed from map during session. Future live map features should respect `_fromAPI` flag.
---
### Issue #131 — Nodes Tab Auto-Update via WebSocket
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
WS-driven page updates must reset local caches: (1) set local cache to null, (2) call `invalidateApiCache()`, (3) re-fetch. New `loadNodes(refreshOnly)` pattern skips full DOM rebuild, only updates data rows. Preserves scroll, selection, listeners.
**Trap:** Two-layer caching (local variable + API cache) prevents re-fetches. All three reset steps required.
**Pattern:** Other pages doing WS-driven updates should follow same approach.
---
### Issue #129 — Observer Comparison Page
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
Added `comparePacketSets(hashesA, hashesB)` as standalone pure function exposed on `window` for testability. Computes `{ onlyA, onlyB, both }` via Set operations (O(n)).
**Pattern:** Comparison logic decoupled from UI, reusable. Client-side diff avoids new server endpoint. 24-hour window keeps data size reasonable (~10K packets max).
---
### Issue #132 — Detail Pane Collapse
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
Detail pane collapse uses CSS class on parent container. Add `detail-collapsed` class to `.split-layout`, which sets `.panel-right` to `display: none`. `.panel-left` with `flex: 1` fills 100% width naturally.
**Pattern:** CSS class toggling on parent cleaner than inline styles, easier to animate, keeps layout logic in CSS.
---
## Decision: Infrastructure & Deployment
### Database Merge — Prod + Staging
**By:** Kobayashi (Lead) / Hudson (DevOps)
**Date:** 2026-03-27
**Status:** ✅ Complete
Merged staging DB (185MB, 50K transmissions + 1.2M observations) into prod DB (21MB). Dedup strategy:
- **Transmissions:** `INSERT OR IGNORE` on `hash` (unique key)
- **Observations:** All unique by observer, all preserved
- **Nodes/Observers:** Latest `last_seen` wins, sum counts
- Backups: Retained at `/home/deploy/backups/pre-merge-20260327-071425/` until 2026-04-03
---
### Unified Docker Volume Paths
**By:** Hudson (DevOps)
**Date:** 2026-03-27
**Status:** Applied
Reconciled `manage.sh` and `docker-compose.yml` Docker volume names:
- Caddy volume: `caddy-data` everywhere (prod); `caddy-data-staging` for staging
- Data directory: Bind mount via `PROD_DATA_DIR` env var, default `~/meshcore-data`
- Config/Caddyfile: Mounted from repo checkout for prod, staging data dir for staging
- Removed deprecated `version` key from docker-compose.yml
**Consequence:**`./manage.sh start` and `docker compose up prod` now produce identical mounts. Anyone with data in old `caddy-data-prod` volume will need Caddy to re-provision TLS certs automatically.
Standalone Go web server replacing Node.js server's READ side (REST API + WebSocket). Two-component rewrite: ingestor (MQTT writes), server (REST/WS reads).
**Architecture Decisions:**
1.**Direct SQLite queries** — No in-memory packet store; all reads via `packets_v` view (v3 schema)
2.**Per-module go.mod** — Each `cmd/*` directory has own `go.mod`
3.**gorilla/mux for routing** — Handles 35+ parameterized routes cleanly
4.**SQLite polling for WebSocket** — Polls for new transmission IDs every 1s (decouples from MQTT)
**Future Work:** Full analytics via SQL, TTL response cache, shared `internal/db/` package, TLS, region-aware filtering.
---
### Go API Parity: Transmission-Centric Queries
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented, all 42+ tests pass
Go server rewrote packet list queries from VIEW-based (slow, wrong shape) to **transmission-centric** with correlated subqueries. Schema version detection (`isV3` flag) handles both v2 and v3 schemas.
-`/api/packets` — Multi-node filter support (`nodes` query param, comma-separated pubkeys)
---
### Go In-Memory Packet Store (cmd/server/store.go)
**By:** Hicks (Backend Dev)
**Date:** 2026-03-26
**Status:** Implemented
Port of `packet-store.js` with streaming load, 5 indexes, lean observation structs (only observation-specific fields). `QueryPackets` handles type, route, observer, hash, since, until, region, node. `IngestNewFromDB()` streams new transmissions from DB into memory.
- Startup: One-time load adds few seconds (acceptable)
- DB still used for: analytics, node/observer queries, role counts, region resolution
---
### Observation RAM Optimization
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented
Observation objects in in-memory packet store now store only `transmission_id` reference instead of copying `hash`, `raw_hex`, `decoded_json`, `payload_type`, `route_type` from parent. API boundary methods (`getById`, `getSiblings`, `enrichObservations`) hydrate on demand. Load uses `.iterate()` instead of `.all()` to avoid materializing full JOIN.
**Impact:** Eliminates ~1.17M redundant string copies, avoids 1.17M-row array during startup. 2.7GB RAM → acceptable levels with 185MB database.
**Code Pattern:** Any code reading observation objects from `tx.observations` directly must use `pktStore.enrichObservations()` if it needs transmission fields. Internal iteration over observations for observer_id, snr, rssi, path_json works unchanged.
**Status:** Proposed — awaiting user sign-off before implementation
Playwright E2E tests (16 tests in `test-e2e-playwright.js`) are slow in CI. Analysis identified ~40-50% potential runtime reduction.
### Recommendations (prioritized)
#### HIGH impact (30%+ improvement)
1.**Replace `waitUntil: 'networkidle'` with `'domcontentloaded'` + targeted waits** — used ~20 times; `networkidle` worst-case for SPAs with persistent WebSocket + Leaflet tile loading. Each navigation pays 500ms+ penalty.
2.**Eliminate redundant navigations** — group tests by route; navigate once, run all assertions for that route.
3.**Cache Playwright browser install in CI** — `npx playwright install chromium --with-deps` runs every frontend push. Self-hosted runner should retain browser between runs.
#### MEDIUM impact (10-30%)
4.**Replace hardcoded `waitForTimeout` with event-driven waits** — ~17s scattered. Replace with `waitForSelector`, `waitForFunction`, or `page.waitForResponse`.
5.**Merge coverage collection into E2E run** — `collect-frontend-coverage.js` launches second browser. Extract `window.__coverage__` at E2E end instead.
6.**Replace `sleep 5` server startup with health-check polling** — Start tests as soon as `/api/stats` responsive (~1-2s savings).
#### LOW impact (<10%)
7.**Block unnecessary resources for non-visual tests** — use `page.route()` to abort map tiles, fonts.
8.**Reduce default timeout 15s → 10s** — sufficient for local CI.
### Implementation notes
- Items 1-2 are test-file-only (Bishop/Newt scope)
- Items 3, 5-6 are CI pipeline (Hicks scope)
- No architectural changes; all incremental
- All assertions remain identical — only wait strategies change
---
### 2026-03-27T20:56:00Z — Protobuf API Contract (Merged)
**By:** Kpa-clawbot (via Copilot)
**Decision:**
1. All frontend/backend interfaces get protobuf definitions as single source of truth
2. Go generates structs with JSON tags from protos; Node stays unchanged — protos derived from Node's current JSON shapes
3. Proto definitions MUST use inheritance and composition (no repeating field definitions)
4. Data flow: SQLite → proto struct → JSON; JSON blobs from DB deserialize against proto structs for validation
5. CI pipeline's proto fixture capture runs against prod (stable reference), not staging
**Rationale:** Eliminates parity bugs between Node and Go. Compiler-enforced contract. Prod is known-good baseline.
**Decision:** CI pipeline should check if `docker compose` (v2 plugin) is installed on the self-hosted runner and install it if needed, as part of the deploy job itself.
**Rationale:** Self-healing CI is preferred over manual VM setup; the VM may not have docker compose v2 installed.
### 2026-03-27T04:39 — Staging DB: Use Old Problematic DB
**By:** User (via Copilot)
**Decision:** Staging environment's primary purpose is debugging the problematic DB that caused 100% CPU on prod. Use the old DB (`~/meshcore-data-old/` on the VM) for staging. Prod keeps its current (new) DB. Never put the problematic DB on prod.
**Rationale:** This is the reason the staging environment was built.
### 2026-03-27T06:09 — Plan Go Rewrite (MQTT Separation)
**By:** User (via Copilot)
**Decision:** Start planning a Go rewrite. First step: separate MQTT ingestion (writes to DB) from the web server (reads from DB + serves API/frontend). Two separate services.
**Rationale:** Node.js single-thread + V8 heap limitations cause fragility at scale (185MB DB → 2.7GB heap → OOM). Go eliminates heap cap problem and enables real concurrency.
### 2026-03-27T06:31 — NO PII in Git
**By:** User (via Copilot)
**Decision:** NEVER write real names, usernames, email addresses, or any PII to files committed to git. Use "User" for attribution and "deploy" for SSH/server references. This is a PUBLIC repo.
**Rationale:** PII was leaked to the public repo and required a full git history rewrite to remove.
### 2026-03-27T02:19 — Production/Infrastructure Touches: Hudson Only
**By:** User (via Copilot)
**Decision:** Production/infrastructure touches (SSH, DB ops, server restarts, Azure operations) should only be done by Hudson (DevOps). No other agents should touch prod directly.
**Rationale:** Separation of concerns — dev agents write code, DevOps deploys and manages prod.
1. No Docker named volumes — always bind mount from `~/meshcore-data` (host location, easy to access)
2. Staging container runs on plaintext port (e.g., port 81, no HTTPS)
3. Use Docker Compose to orchestrate prod + staging containers on the same VM
4.`manage.sh` supports launching prod only OR prod+staging with clear messaging
5. Ports must be configurable via `manage.sh` or environment, with sane defaults
### 2026-03-27T03:43 — Staging Refinements: Shared Data
**By:** User (via Copilot)
**Decision:**
1. Staging copies prod DB on launch (snapshot into staging data dir when started)
2. Staging connects to SAME MQTT broker as prod (not its own Mosquitto)
**Rationale:** Staging needs real data (prod-like conditions) to be useful for testing.
### 2026-03-27T17:13 — Scribe Auto-Run After Agent Batches
**By:** User (via Copilot)
**Decision:** Scribe must run after EVERY batch of agent work automatically. No manual triggers. No reminders needed. This is a process guarantee, not a suggestion.
**Rationale:** Coordinator has been forgetting to spawn Scribe after agent batches complete. This is a process failure. Scribe auto-spawn ends the forgetfulness.
---
## Decision: Technical Fixes
### Issue #126 — Skip Ambiguous Hop Prefixes
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented
When resolving hop prefixes to full node pubkeys, require a **unique match**. If prefix matches 2+ nodes in DB, skip it and cache in `ambiguousHopPrefixes` (negative cache). Prevents hash prefix collisions (e.g., `1CC4` vs `1C82` sharing prefix `1C` under 1-byte hash_size) from attributing packets to wrong nodes.
**Impact:**
- Hopresixes that collide won't update `lastPathSeenMap` for any node (conservative, correct)
-`disambiguateHops()` still does geometric disambiguation for route visualization
-`autoLearnHopNodes()` no longer calls `db.upsertNode()` for unresolved hops
- Added `db.removePhantomNodes()` — deletes nodes where `LENGTH(public_key) <= 16` (real keys are 64 hex chars)
- Called at startup to purge existing phantoms from prior behavior
- Hop-resolver still handles unresolved prefixes gracefully
**Part 2: totalNodes now 7-day active window**
-`/api/stats``totalNodes` returns only nodes seen in last 7 days (was all-time)
- New field `totalNodesAllTime` for historical tracking
- Role counts (repeaters, rooms, companions, sensors) also filtered to 7-day window
- Frontend: no changes needed (same field name, smaller correct number)
**Impact:** Frontend `totalNodes` now reflects active mesh size. Go server should apply same 7-day filter when querying.
---
### Issue #123 — Channel Hash on Undecrypted Messages
**By:** Hicks
**Status:** Implemented
Fixed test coverage for decrypted status tracking on channel messages.
---
### Issue #130 — Live Map: Dim Stale Nodes, Don't Remove
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
`pruneStaleNodes()` in `live.js` now distinguishes API-loaded nodes (`_fromAPI`) from WS-only dynamic nodes. API nodes dimmed (reduced opacity) when stale instead of removed. WS-only nodes still pruned to prevent memory leaks.
**Rationale:** Static map shows stale nodes with faded markers; live map was deleting them, causing user-reported disappearing nodes. Parity expected.
**Pattern:** Database-loaded nodes never removed from map during session. Future live map features should respect `_fromAPI` flag.
---
### Issue #131 — Nodes Tab Auto-Update via WebSocket
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
WS-driven page updates must reset local caches: (1) set local cache to null, (2) call `invalidateApiCache()`, (3) re-fetch. New `loadNodes(refreshOnly)` pattern skips full DOM rebuild, only updates data rows. Preserves scroll, selection, listeners.
**Trap:** Two-layer caching (local variable + API cache) prevents re-fetches. All three reset steps required.
**Pattern:** Other pages doing WS-driven updates should follow same approach.
---
### Issue #129 — Observer Comparison Page
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
Added `comparePacketSets(hashesA, hashesB)` as standalone pure function exposed on `window` for testability. Computes `{ onlyA, onlyB, both }` via Set operations (O(n)).
**Pattern:** Comparison logic decoupled from UI, reusable. Client-side diff avoids new server endpoint. 24-hour window keeps data size reasonable (~10K packets max).
---
### Issue #132 — Detail Pane Collapse
**By:** Newt (Frontend)
**Date:** 2026-03-27
**Status:** Implemented
Detail pane collapse uses CSS class on parent container. Add `detail-collapsed` class to `.split-layout`, which sets `.panel-right` to `display: none`. `.panel-left` with `flex: 1` fills 100% width naturally.
**Pattern:** CSS class toggling on parent cleaner than inline styles, easier to animate, keeps layout logic in CSS.
---
## Decision: Infrastructure & Deployment
### Database Merge — Prod + Staging
**By:** Kobayashi (Lead) / Hudson (DevOps)
**Date:** 2026-03-27
**Status:** ✅ Complete
Merged staging DB (185MB, 50K transmissions + 1.2M observations) into prod DB (21MB). Dedup strategy:
- **Transmissions:** `INSERT OR IGNORE` on `hash` (unique key)
- **Observations:** All unique by observer, all preserved
- **Nodes/Observers:** Latest `last_seen` wins, sum counts
- Backups: Retained at `/home/deploy/backups/pre-merge-20260327-071425/` until 2026-04-03
---
### Unified Docker Volume Paths
**By:** Hudson (DevOps)
**Date:** 2026-03-27
**Status:** Applied
Reconciled `manage.sh` and `docker-compose.yml` Docker volume names:
- Caddy volume: `caddy-data` everywhere (prod); `caddy-data-staging` for staging
- Data directory: Bind mount via `PROD_DATA_DIR` env var, default `~/meshcore-data`
- Config/Caddyfile: Mounted from repo checkout for prod, staging data dir for staging
- Removed deprecated `version` key from docker-compose.yml
**Consequence:**`./manage.sh start` and `docker compose up prod` now produce identical mounts. Anyone with data in old `caddy-data-prod` volume will need Caddy to re-provision TLS certs automatically.
Standalone Go web server replacing Node.js server's READ side (REST API + WebSocket). Two-component rewrite: ingestor (MQTT writes), server (REST/WS reads).
**Architecture Decisions:**
1.**Direct SQLite queries** — No in-memory packet store; all reads via `packets_v` view (v3 schema)
2.**Per-module go.mod** — Each `cmd/*` directory has own `go.mod`
3.**gorilla/mux for routing** — Handles 35+ parameterized routes cleanly
4.**SQLite polling for WebSocket** — Polls for new transmission IDs every 1s (decouples from MQTT)
**Future Work:** Full analytics via SQL, TTL response cache, shared `internal/db/` package, TLS, region-aware filtering.
---
### Go API Parity: Transmission-Centric Queries
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented, all 42+ tests pass
Go server rewrote packet list queries from VIEW-based (slow, wrong shape) to **transmission-centric** with correlated subqueries. Schema version detection (`isV3` flag) handles both v2 and v3 schemas.
-`/api/packets` — Multi-node filter support (`nodes` query param, comma-separated pubkeys)
---
### Go In-Memory Packet Store (cmd/server/store.go)
**By:** Hicks (Backend Dev)
**Date:** 2026-03-26
**Status:** Implemented
Port of `packet-store.js` with streaming load, 5 indexes, lean observation structs (only observation-specific fields). `QueryPackets` handles type, route, observer, hash, since, until, region, node. `IngestNewFromDB()` streams new transmissions from DB into memory.
- Startup: One-time load adds few seconds (acceptable)
- DB still used for: analytics, node/observer queries, role counts, region resolution
---
### Observation RAM Optimization
**By:** Hicks (Backend Dev)
**Date:** 2026-03-27
**Status:** Implemented
Observation objects in in-memory packet store now store only `transmission_id` reference instead of copying `hash`, `raw_hex`, `decoded_json`, `payload_type`, `route_type` from parent. API boundary methods (`getById`, `getSiblings`, `enrichObservations`) hydrate on demand. Load uses `.iterate()` instead of `.all()` to avoid materializing full JOIN.
**Impact:** Eliminates ~1.17M redundant string copies, avoids 1.17M-row array during startup. 2.7GB RAM → acceptable levels with 185MB database.
**Code Pattern:** Any code reading observation objects from `tx.observations` directly must use `pktStore.enrichObservations()` if it needs transmission fields. Internal iteration over observations for observer_id, snr, rssi, path_json works unchanged.
**Status:** Proposed — awaiting user sign-off before implementation
Playwright E2E tests (16 tests in `test-e2e-playwright.js`) are slow in CI. Analysis identified ~40-50% potential runtime reduction.
### Recommendations (prioritized)
#### HIGH impact (30%+ improvement)
1.**Replace `waitUntil: 'networkidle'` with `'domcontentloaded'` + targeted waits** — used ~20 times; `networkidle` worst-case for SPAs with persistent WebSocket + Leaflet tile loading. Each navigation pays 500ms+ penalty.
2.**Eliminate redundant navigations** — group tests by route; navigate once, run all assertions for that route.
3.**Cache Playwright browser install in CI** — `npx playwright install chromium --with-deps` runs every frontend push. Self-hosted runner should retain browser between runs.
#### MEDIUM impact (10-30%)
4.**Replace hardcoded `waitForTimeout` with event-driven waits** — ~17s scattered. Replace with `waitForSelector`, `waitForFunction`, or `page.waitForResponse`.
5.**Merge coverage collection into E2E run** — `collect-frontend-coverage.js` launches second browser. Extract `window.__coverage__` at E2E end instead.
6.**Replace `sleep 5` server startup with health-check polling** — Start tests as soon as `/api/stats` responsive (~1-2s savings).
#### LOW impact (<10%)
7.**Block unnecessary resources for non-visual tests** — use `page.route()` to abort map tiles, fonts.
8.**Reduce default timeout 15s → 10s** — sufficient for local CI.
### Implementation notes
- Items 1-2 are test-file-only (Bishop/Newt scope)
- Items 3, 5-6 are CI pipeline (Hicks scope)
- No architectural changes; all incremental
- All assertions remain identical — only wait strategies change
---
### 2026-03-27T20:56:00Z — Protobuf API Contract (Merged)
**By:** Kpa-clawbot (via Copilot)
**Decision:**
1. All frontend/backend interfaces get protobuf definitions as single source of truth
2. Go generates structs with JSON tags from protos; Node stays unchanged — protos derived from Node's current JSON shapes
3. Proto definitions MUST use inheritance and composition (no repeating field definitions)
4. Data flow: SQLite → proto struct → JSON; JSON blobs from DB deserialize against proto structs for validation
5. CI pipeline's proto fixture capture runs against prod (stable reference), not staging
**Rationale:** Eliminates parity bugs between Node and Go. Compiler-enforced contract. Prod is known-good baseline.
-`kobayashi-2026-03-27.md` (27 lines) — Root cause analysis, DB merge plan, coordination
-`hudson-2026-03-27.md` (117 lines) — DB merge execution, Docker Compose migration, staging setup, CI pipeline
-`ripley-2026-03-27.md` (30 lines) — Support onboarding, health threshold documentation
**Entry Total:** 448 lines of orchestration logs covering 28 issues, 2 Go services, database merge, staging deployment, CI pipeline updates, 42 E2E tests, 19 backend fixes
---
## Decisions.md Review
Current decisions.md (342 lines) contains authoritative log of all technical + infrastructure + deployment decisions made during #151-160 session. No archival needed (well under 20KB threshold). Organized by:
1. User Directives (process decisions)
2. Technical Fixes (bug fixes with rationale)
3. Infrastructure & Deployment (ops decisions)
4. Go Rewrite — API & Storage (architecture decisions)
-`kobayashi-2026-03-27.md` (27 lines) — Root cause analysis, DB merge plan, coordination
-`hudson-2026-03-27.md` (117 lines) — DB merge execution, Docker Compose migration, staging setup, CI pipeline
-`ripley-2026-03-27.md` (30 lines) — Support onboarding, health threshold documentation
**Entry Total:** 448 lines of orchestration logs covering 28 issues, 2 Go services, database merge, staging deployment, CI pipeline updates, 42 E2E tests, 19 backend fixes
---
## Decisions.md Review
Current decisions.md (342 lines) contains authoritative log of all technical + infrastructure + deployment decisions made during #151-160 session. No archival needed (well under 20KB threshold). Organized by:
1. User Directives (process decisions)
2. Technical Fixes (bug fixes with rationale)
3. Infrastructure & Deployment (ops decisions)
4. Go Rewrite — API & Storage (architecture decisions)
| `squad:{name}` | Pick up issue and complete the work | Named member |
### How Issue Assignment Works
1. When a GitHub issue gets the `squad` label, the **Lead** triages it — analyzing content, assigning the right `squad:{member}` label, and commenting with triage notes.
2. When a `squad:{member}` label is applied, that member picks up the issue in their next session.
3. Members can reassign by removing their label and adding another member's label.
4. The `squad` label is the "inbox" — untriaged issues waiting for Lead review.
## Rules
1.**Eager by default** — spawn all agents who could usefully start work, including anticipatory downstream work.
2.**Scribe always runs** after substantial work, always as `mode: "background"`. Never blocks.
3.**Quick facts → coordinator answers directly.** Don't spawn an agent for "what port does the server run on?"
4.**When two agents could handle it**, pick the one whose domain is the primary concern.
5.**"Team, ..." → fan-out.** Spawn all relevant agents in parallel as `mode: "background"`.
6.**Anticipate downstream work.** If a feature is being built, spawn the tester to write test cases from requirements simultaneously.
7.**Issue-labeled work** — when a `squad:{member}` label is applied to an issue, route to that member. The Lead handles all `squad` (base label) triage.
| `squad:{name}` | Pick up issue and complete the work | Named member |
### How Issue Assignment Works
1. When a GitHub issue gets the `squad` label, the **Lead** triages it — analyzing content, assigning the right `squad:{member}` label, and commenting with triage notes.
2. When a `squad:{member}` label is applied, that member picks up the issue in their next session.
3. Members can reassign by removing their label and adding another member's label.
4. The `squad` label is the "inbox" — untriaged issues waiting for Lead review.
## Rules
1.**Eager by default** — spawn all agents who could usefully start work, including anticipatory downstream work.
2.**Scribe always runs** after substantial work, always as `mode: "background"`. Never blocks.
3.**Quick facts → coordinator answers directly.** Don't spawn an agent for "what port does the server run on?"
4.**When two agents could handle it**, pick the one whose domain is the primary concern.
5.**"Team, ..." → fan-out.** Spawn all relevant agents in parallel as `mode: "background"`.
6.**Anticipate downstream work.** If a feature is being built, spawn the tester to write test cases from requirements simultaneously.
7.**Issue-labeled work** — when a `squad:{member}` label is applied to an issue, route to that member. The Lead handles all `squad` (base label) triage.
**Description:** Self-hosted alternative to analyzer.letsmesh.net. Ingests MeshCore mesh network packets via MQTT, decodes with custom parser (decoder.js), stores in SQLite with in-memory indexing (packet-store.js), and serves a rich SPA with live visualization, packet analysis, node analytics, channel chat, observer health, and theme customizer. ~18K lines, 14 test files, 85%+ backend coverage. Production at v2.6.0.
> {One-line personality statement — what makes this person tick}
## Identity
- **Name:** {Name}
- **Role:** {Role title}
- **Expertise:** {2-3 specific skills relevant to the project}
- **Style:** {How they communicate — direct? thorough? opinionated?}
## What I Own
- {Area of responsibility 1}
- {Area of responsibility 2}
- {Area of responsibility 3}
## How I Work
- {Key approach or principle 1}
- {Key approach or principle 2}
- {Pattern or convention I follow}
## Boundaries
**I handle:** {types of work this agent does}
**I don't handle:** {types of work that belong to other team members}
**When I'm unsure:** I say so and suggest who might know.
**If I review others' work:** On rejection, I may require a different agent to revise (not the original author) or request a new specialist be spawned. The Coordinator enforces this.
## Model
- **Preferred:** auto
- **Rationale:** Coordinator selects the best model based on task type — cost first unless writing code
- **Fallback:** Standard chain — the coordinator handles fallback automatically
## Collaboration
Before starting work, run `git rev-parse --show-toplevel` to find the repo root, or use the `TEAM ROOT` provided in the spawn prompt. All `.squad/` paths must be resolved relative to this root — do not assume CWD is the repo root (you may be in a worktree or subdirectory).
Before starting work, read `.squad/decisions.md` for team decisions that affect me.
After making a decision others should know, write it to `.squad/decisions/inbox/{my-name}-{brief-slug}.md` — the Scribe will merge it.
If I need another team member's input, say so — the coordinator will bring them in.
## Voice
{1-2 sentences describing personality. Not generic — specific. This agent has OPINIONS.
They have preferences. They push back. They have a style that's distinctly theirs.
Example: "Opinionated about test coverage. Will push back if tests are skipped.
Prefers integration tests over mocks. Thinks 80% coverage is the floor, not the ceiling."}
# {Name} — {Role}
> {One-line personality statement — what makes this person tick}
## Identity
- **Name:** {Name}
- **Role:** {Role title}
- **Expertise:** {2-3 specific skills relevant to the project}
- **Style:** {How they communicate — direct? thorough? opinionated?}
## What I Own
- {Area of responsibility 1}
- {Area of responsibility 2}
- {Area of responsibility 3}
## How I Work
- {Key approach or principle 1}
- {Key approach or principle 2}
- {Pattern or convention I follow}
## Boundaries
**I handle:** {types of work this agent does}
**I don't handle:** {types of work that belong to other team members}
**When I'm unsure:** I say so and suggest who might know.
**If I review others' work:** On rejection, I may require a different agent to revise (not the original author) or request a new specialist be spawned. The Coordinator enforces this.
## Model
- **Preferred:** auto
- **Rationale:** Coordinator selects the best model based on task type — cost first unless writing code
- **Fallback:** Standard chain — the coordinator handles fallback automatically
## Collaboration
Before starting work, run `git rev-parse --show-toplevel` to find the repo root, or use the `TEAM ROOT` provided in the spawn prompt. All `.squad/` paths must be resolved relative to this root — do not assume CWD is the repo root (you may be in a worktree or subdirectory).
Before starting work, read `.squad/decisions.md` for team decisions that affect me.
After making a decision others should know, write it to `.squad/decisions/inbox/{my-name}-{brief-slug}.md` — the Scribe will merge it.
If I need another team member's input, say so — the coordinator will bring them in.
## Voice
{1-2 sentences describing personality. Not generic — specific. This agent has OPINIONS.
They have preferences. They push back. They have a style that's distinctly theirs.
Example: "Opinionated about test coverage. Will push back if tests are skipped.
Prefers integration tests over mocks. Thinks 80% coverage is the floor, not the ceiling."}
When the user or system imposes constraints (question limits, revision limits, time budgets), maintain a visible counter in your responses and in the artifact.
## Format
```
📊 Clarifying questions used: 2 / 3
```
## Rules
- Update the counter each time the constraint is consumed
- When a constraint is exhausted, state it: `📊 Question budget exhausted (3/3). Proceeding with current information.`
- If no constraints are active, do not display counters
- Include the final constraint status in multi-agent artifacts
## Example Session
```
Coordinator: Spawning agents to analyze requirements...
📊 Clarifying questions used: 0 / 3
Agent asks clarification: "Should we support OAuth?"
Coordinator: Checking with user...
📊 Clarifying questions used: 1 / 3
Agent asks clarification: "What's the rate limit?"
Coordinator: Checking with user...
📊 Clarifying questions used: 2 / 3
Agent asks clarification: "Do we need RBAC?"
Coordinator: Checking with user...
📊 Clarifying questions used: 3 / 3
Agent asks clarification: "Should we cache responses?"
Coordinator: 📊 Question budget exhausted (3/3). Proceeding without clarification.
```
# Constraint Budget Tracking
When the user or system imposes constraints (question limits, revision limits, time budgets), maintain a visible counter in your responses and in the artifact.
## Format
```
📊 Clarifying questions used: 2 / 3
```
## Rules
- Update the counter each time the constraint is consumed
- When a constraint is exhausted, state it: `📊 Question budget exhausted (3/3). Proceeding with current information.`
- If no constraints are active, do not display counters
- Include the final constraint status in multi-agent artifacts
## Example Session
```
Coordinator: Spawning agents to analyze requirements...
📊 Clarifying questions used: 0 / 3
Agent asks clarification: "Should we support OAuth?"
Coordinator: Checking with user...
📊 Clarifying questions used: 1 / 3
Agent asks clarification: "What's the rate limit?"
Coordinator: Checking with user...
📊 Clarifying questions used: 2 / 3
Agent asks clarification: "Do we need RBAC?"
Coordinator: Checking with user...
📊 Clarifying questions used: 3 / 3
Agent asks clarification: "Should we cache responses?"
Coordinator: 📊 Question budget exhausted (3/3). Proceeding without clarification.
# Cooperative Rate Limiting for Multi-Agent Deployments
> Coordinate API quota across multiple Ralph instances to prevent cascading failures.
## Problem
The [circuit breaker template](ralph-circuit-breaker.md) handles single-instance rate limiting well. But when multiple Ralphs run across machines (or pods on K8s), each instance independently hits API limits:
- **No coordination** — 5 Ralphs each think they have full API quota
- **Thundering herd** — All Ralphs retry simultaneously after rate limit resets
- **Priority inversion** — Low-priority work exhausts quota before critical work runs
- **Reactive only** — Circuit opens AFTER 429, wasting the failed request
## Solution: 6-Pattern Architecture
These patterns layer on top of the existing circuit breaker. Each is independent — adopt one or all.
4.**Kubernetes:** Add KEDA scaler for automatic pod scaling
## References
- [Circuit Breaker Template](ralph-circuit-breaker.md) — Foundation patterns
- [Squad on AKS](https://github.com/tamirdresher/squad-on-aks) — Production K8s deployment
- [KEDA Copilot Scaler](https://github.com/tamirdresher/keda-copilot-scaler) — Custom KEDA external scaler
# Cooperative Rate Limiting for Multi-Agent Deployments
> Coordinate API quota across multiple Ralph instances to prevent cascading failures.
## Problem
The [circuit breaker template](ralph-circuit-breaker.md) handles single-instance rate limiting well. But when multiple Ralphs run across machines (or pods on K8s), each instance independently hits API limits:
- **No coordination** — 5 Ralphs each think they have full API quota
- **Thundering herd** — All Ralphs retry simultaneously after rate limit resets
- **Priority inversion** — Low-priority work exhausts quota before critical work runs
- **Reactive only** — Circuit opens AFTER 429, wasting the failed request
## Solution: 6-Pattern Architecture
These patterns layer on top of the existing circuit breaker. Each is independent — adopt one or all.
You are working on a project that uses **Squad**, an AI team framework. When picking up issues autonomously, follow these guidelines.
## Team Context
Before starting work on any issue:
1. Read `.squad/team.md` for the team roster, member roles, and your capability profile.
2. Read `.squad/routing.md` for work routing rules.
3. If the issue has a `squad:{member}` label, read that member's charter at `.squad/agents/{member}/charter.md` to understand their domain expertise and coding style — work in their voice.
## Capability Self-Check
Before starting work, check your capability profile in `.squad/team.md` under the **Coding Agent → Capabilities** section.
- **🟢 Good fit** — proceed autonomously.
- **🟡 Needs review** — proceed, but note in the PR description that a squad member should review.
- **🔴 Not suitable** — do NOT start work. Instead, comment on the issue:
```
🤖 This issue doesn't match my capability profile (reason: {why}). Suggesting reassignment to a squad member.
```
## Branch Naming
Use the squad branch convention:
```
squad/{issue-number}-{kebab-case-slug}
```
Example: `squad/42-fix-login-validation`
## PR Guidelines
When opening a PR:
- Reference the issue: `Closes #{issue-number}`
- If the issue had a `squad:{member}` label, mention the member: `Working as {member} ({role})`
- If this is a 🟡 needs-review task, add to the PR description: `⚠️ This task was flagged as "needs review" — please have a squad member review before merging.`
- Follow any project conventions in `.squad/decisions.md`
## Decisions
If you make a decision that affects other team members, write it to:
```
.squad/decisions/inbox/copilot-{brief-slug}.md
```
The Scribe will merge it into the shared decisions file.
# Copilot Coding Agent — Squad Instructions
You are working on a project that uses **Squad**, an AI team framework. When picking up issues autonomously, follow these guidelines.
## Team Context
Before starting work on any issue:
1. Read `.squad/team.md` for the team roster, member roles, and your capability profile.
2. Read `.squad/routing.md` for work routing rules.
3. If the issue has a `squad:{member}` label, read that member's charter at `.squad/agents/{member}/charter.md` to understand their domain expertise and coding style — work in their voice.
## Capability Self-Check
Before starting work, check your capability profile in `.squad/team.md` under the **Coding Agent → Capabilities** section.
- **🟢 Good fit** — proceed autonomously.
- **🟡 Needs review** — proceed, but note in the PR description that a squad member should review.
- **🔴 Not suitable** — do NOT start work. Instead, comment on the issue:
```
🤖 This issue doesn't match my capability profile (reason: {why}). Suggesting reassignment to a squad member.
```
## Branch Naming
Use the squad branch convention:
```
squad/{issue-number}-{kebab-case-slug}
```
Example: `squad/42-fix-login-validation`
## PR Guidelines
When opening a PR:
- Reference the issue: `Closes #{issue-number}`
- If the issue had a `squad:{member}` label, mention the member: `Working as {member} ({role})`
- If this is a 🟡 needs-review task, add to the PR description: `⚠️ This task was flagged as "needs review" — please have a squad member review before merging.`
- Follow any project conventions in `.squad/decisions.md`
## Decisions
If you make a decision that affects other team members, write it to:
```
.squad/decisions/inbox/copilot-{brief-slug}.md
```
The Scribe will merge it into the shared decisions file.
# KEDA External Scaler for GitHub Issue-Driven Agent Autoscaling
> Scale agent pods to zero when idle, up when work arrives — driven by GitHub Issues.
## Overview
When running Squad on Kubernetes, agent pods sit idle when no work exists. [KEDA](https://keda.sh) (Kubernetes Event-Driven Autoscaler) solves this for queue-based workloads, but GitHub Issues isn't a native KEDA trigger.
The `keda-copilot-scaler` is a KEDA External Scaler (gRPC) that bridges this gap:
1. Polls GitHub API for issues matching specific labels (e.g., `squad:copilot`)
# KEDA External Scaler for GitHub Issue-Driven Agent Autoscaling
> Scale agent pods to zero when idle, up when work arrives — driven by GitHub Issues.
## Overview
When running Squad on Kubernetes, agent pods sit idle when no work exists. [KEDA](https://keda.sh) (Kubernetes Event-Driven Autoscaler) solves this for queue-based workloads, but GitHub Issues isn't a native KEDA trigger.
The `keda-copilot-scaler` is a KEDA External Scaler (gRPC) that bridges this gap:
1. Polls GitHub API for issues matching specific labels (e.g., `squad:copilot`)
> Enable Ralph to skip issues requiring capabilities the current machine lacks.
## Overview
When running Squad across multiple machines (laptops, DevBoxes, GPU servers, Kubernetes nodes), each machine has different tooling. The capability system lets you declare what each machine can do, and Ralph automatically routes work accordingly.
## Setup
### 1. Create a Capabilities Manifest
Create `~/.squad/machine-capabilities.json` (user-wide) or `.squad/machine-capabilities.json` (project-local):
> Enable Ralph to skip issues requiring capabilities the current machine lacks.
## Overview
When running Squad across multiple machines (laptops, DevBoxes, GPU servers, Kubernetes nodes), each machine has different tooling. The capability system lets you declare what each machine can do, and Ralph automatically routes work accordingly.
## Setup
### 1. Create a Capabilities Manifest
Create `~/.squad/machine-capabilities.json` (user-wide) or `.squad/machine-capabilities.json` (project-local):
1. Ralph loads `machine-capabilities.json` at startup
2. For each open issue, Ralph extracts `needs:*` labels
3. If any required capability is missing, the issue is skipped
4. Issues without `needs:*` labels are always processed (opt-in system)
## Kubernetes Integration
On Kubernetes, machine capabilities map to node labels:
```yaml
# Node labels (set by capability DaemonSet or manually)
node.squad.dev/gpu:"true"
node.squad.dev/browser:"true"
# Pod spec uses nodeSelector
spec:
nodeSelector:
node.squad.dev/gpu:"true"
```
A DaemonSet can run capability discovery on each node and maintain labels automatically. See the [squad-on-aks](https://github.com/tamirdresher/squad-on-aks) project for a complete Kubernetes deployment example.
MCP (Model Context Protocol) servers extend Squad with tools for external services — Trello, Aspire dashboards, Azure, Notion, and more. The user configures MCP servers in their environment; Squad discovers and uses them.
> **Full patterns:** Read `.squad/skills/mcp-tool-discovery/SKILL.md` for discovery patterns, domain-specific usage, and graceful degradation.
## Config File Locations
Users configure MCP servers at these locations (checked in priority order):
1.**Repository-level:**`.copilot/mcp-config.json` (team-shared, committed to repo)
- **GitHub MCP requires a separate token** from the `gh` CLI auth. Generate at https://github.com/settings/tokens
- **Trello requires API key + token** from https://trello.com/power-ups/admin
- **Azure requires service principal credentials** — see Azure docs for setup
- **Aspire uses the dashboard URL** — typically `http://localhost:18888` during local dev
Auth is a real blocker for some MCP servers. Users need separate tokens for GitHub MCP, Azure MCP, Trello MCP, etc. This is a documentation problem, not a code problem.
# MCP Integration — Configuration and Samples
MCP (Model Context Protocol) servers extend Squad with tools for external services — Trello, Aspire dashboards, Azure, Notion, and more. The user configures MCP servers in their environment; Squad discovers and uses them.
> **Full patterns:** Read `.squad/skills/mcp-tool-discovery/SKILL.md` for discovery patterns, domain-specific usage, and graceful degradation.
## Config File Locations
Users configure MCP servers at these locations (checked in priority order):
1.**Repository-level:**`.copilot/mcp-config.json` (team-shared, committed to repo)
- **GitHub MCP requires a separate token** from the `gh` CLI auth. Generate at https://github.com/settings/tokens
- **Trello requires API key + token** from https://trello.com/power-ups/admin
- **Azure requires service principal credentials** — see Azure docs for setup
- **Aspire uses the dashboard URL** — typically `http://localhost:18888` during local dev
Auth is a real blocker for some MCP servers. Users need separate tokens for GitHub MCP, Azure MCP, Trello MCP, etc. This is a documentation problem, not a code problem.
When multiple agents contribute to a final artifact (document, analysis, design), use this format. The assembled result must include:
- Termination condition
- Constraint budgets (if active)
- Reviewer verdicts (if any)
- Raw agent outputs appendix
## Assembly Structure
The assembled result goes at the top. Below it, include:
```
## APPENDIX: RAW AGENT OUTPUTS
### {Name} ({Role}) — Raw Output
{Paste agent's verbatim response here, unedited}
### {Name} ({Role}) — Raw Output
{Paste agent's verbatim response here, unedited}
```
## Appendix Rules
This appendix is for diagnostic integrity. Do not edit, summarize, or polish the raw outputs. The Coordinator may not rewrite raw agent outputs; it may only paste them verbatim and assemble the final artifact above.
See `.squad/templates/run-output.md` for the complete output format template.
# Multi-Agent Artifact Format
When multiple agents contribute to a final artifact (document, analysis, design), use this format. The assembled result must include:
- Termination condition
- Constraint budgets (if active)
- Reviewer verdicts (if any)
- Raw agent outputs appendix
## Assembly Structure
The assembled result goes at the top. Below it, include:
```
## APPENDIX: RAW AGENT OUTPUTS
### {Name} ({Role}) — Raw Output
{Paste agent's verbatim response here, unedited}
### {Name} ({Role}) — Raw Output
{Paste agent's verbatim response here, unedited}
```
## Appendix Rules
This appendix is for diagnostic integrity. Do not edit, summarize, or polish the raw outputs. The Coordinator may not rewrite raw agent outputs; it may only paste them verbatim and assemble the final artifact above.
See `.squad/templates/run-output.md` for the complete output format template.
Plugins are curated agent templates, skills, instructions, and prompts shared by the community via GitHub repositories (e.g., `github/awesome-copilot`, `anthropics/skills`). They provide ready-made expertise for common domains — cloud platforms, frameworks, testing strategies, etc.
## Marketplace State
Registered marketplace sources are stored in `.squad/plugins/marketplaces.json`:
```json
{
"marketplaces":[
{
"name":"awesome-copilot",
"source":"github/awesome-copilot",
"added_at":"2026-02-14T00:00:00Z"
}
]
}
```
## CLI Commands
Users manage marketplaces via the CLI:
-`squad plugin marketplace add {owner/repo}` — Register a GitHub repo as a marketplace source
-`squad plugin marketplace remove {name}` — Remove a registered marketplace
-`squad plugin marketplace list` — List registered marketplaces
-`squad plugin marketplace browse {name}` — List available plugins in a marketplace
## When to Browse
During the **Adding Team Members** flow, AFTER allocating a name but BEFORE generating the charter:
1. Read `.squad/plugins/marketplaces.json`. If the file doesn't exist or `marketplaces` is empty, skip silently.
2. For each registered marketplace, search for plugins whose name or description matches the new member's role or domain keywords.
3. Present matching plugins to the user: *"Found '{plugin-name}' in {marketplace} marketplace — want me to install it as a skill for {CastName}?"*
4. If the user accepts, install the plugin (see below). If they decline or skip, proceed without it.
## How to Install a Plugin
1. Read the plugin content from the marketplace repository (the plugin's `SKILL.md` or equivalent).
2. Copy it into the agent's skills directory: `.squad/skills/{plugin-name}/SKILL.md`
3. If the plugin includes charter-level instructions (role boundaries, tool preferences), merge those into the agent's `charter.md`.
4. Log the installation in the agent's `history.md`: *"📦 Plugin '{plugin-name}' installed from {marketplace}."*
## Graceful Degradation
- **No marketplaces configured:** Skip the marketplace check entirely. No warning, no prompt.
- **Marketplace unreachable:** Warn the user (*"⚠ Couldn't reach {marketplace} — continuing without it"*) and proceed with team member creation normally.
- **No matching plugins:** Inform the user (*"No matching plugins found in configured marketplaces"*) and proceed.
# Plugin Marketplace
Plugins are curated agent templates, skills, instructions, and prompts shared by the community via GitHub repositories (e.g., `github/awesome-copilot`, `anthropics/skills`). They provide ready-made expertise for common domains — cloud platforms, frameworks, testing strategies, etc.
## Marketplace State
Registered marketplace sources are stored in `.squad/plugins/marketplaces.json`:
```json
{
"marketplaces":[
{
"name":"awesome-copilot",
"source":"github/awesome-copilot",
"added_at":"2026-02-14T00:00:00Z"
}
]
}
```
## CLI Commands
Users manage marketplaces via the CLI:
-`squad plugin marketplace add {owner/repo}` — Register a GitHub repo as a marketplace source
-`squad plugin marketplace remove {name}` — Remove a registered marketplace
-`squad plugin marketplace list` — List registered marketplaces
-`squad plugin marketplace browse {name}` — List available plugins in a marketplace
## When to Browse
During the **Adding Team Members** flow, AFTER allocating a name but BEFORE generating the charter:
1. Read `.squad/plugins/marketplaces.json`. If the file doesn't exist or `marketplaces` is empty, skip silently.
2. For each registered marketplace, search for plugins whose name or description matches the new member's role or domain keywords.
3. Present matching plugins to the user: *"Found '{plugin-name}' in {marketplace} marketplace — want me to install it as a skill for {CastName}?"*
4. If the user accepts, install the plugin (see below). If they decline or skip, proceed without it.
## How to Install a Plugin
1. Read the plugin content from the marketplace repository (the plugin's `SKILL.md` or equivalent).
2. Copy it into the agent's skills directory: `.squad/skills/{plugin-name}/SKILL.md`
3. If the plugin includes charter-level instructions (role boundaries, tool preferences), merge those into the agent's `charter.md`.
4. Log the installation in the agent's `history.md`: *"📦 Plugin '{plugin-name}' installed from {marketplace}."*
## Graceful Degradation
- **No marketplaces configured:** Skip the marketplace check entirely. No warning, no prompt.
- **Marketplace unreachable:** Warn the user (*"⚠ Couldn't reach {marketplace} — continuing without it"*) and proceed with team member creation normally.
- **No matching plugins:** Inform the user (*"No matching plugins found in configured marketplaces"*) and proceed.
| `squad:{name}` | Pick up issue and complete the work | Named member |
### How Issue Assignment Works
1. When a GitHub issue gets the `squad` label, the **Lead** triages it — analyzing content, assigning the right `squad:{member}` label, and commenting with triage notes.
2. When a `squad:{member}` label is applied, that member picks up the issue in their next session.
3. Members can reassign by removing their label and adding another member's label.
4. The `squad` label is the "inbox" — untriaged issues waiting for Lead review.
## Rules
1.**Eager by default** — spawn all agents who could usefully start work, including anticipatory downstream work.
2.**Scribe always runs** after substantial work, always as `mode: "background"`. Never blocks.
3.**Quick facts → coordinator answers directly.** Don't spawn an agent for "what port does the server run on?"
4.**When two agents could handle it**, pick the one whose domain is the primary concern.
5.**"Team, ..." → fan-out.** Spawn all relevant agents in parallel as `mode: "background"`.
6.**Anticipate downstream work.** If a feature is being built, spawn the tester to write test cases from requirements simultaneously.
7.**Issue-labeled work** — when a `squad:{member}` label is applied to an issue, route to that member. The Lead handles all `squad` (base label) triage.
| `squad:{name}` | Pick up issue and complete the work | Named member |
### How Issue Assignment Works
1. When a GitHub issue gets the `squad` label, the **Lead** triages it — analyzing content, assigning the right `squad:{member}` label, and commenting with triage notes.
2. When a `squad:{member}` label is applied, that member picks up the issue in their next session.
3. Members can reassign by removing their label and adding another member's label.
4. The `squad` label is the "inbox" — untriaged issues waiting for Lead review.
## Rules
1.**Eager by default** — spawn all agents who could usefully start work, including anticipatory downstream work.
2.**Scribe always runs** after substantial work, always as `mode: "background"`. Never blocks.
3.**Quick facts → coordinator answers directly.** Don't spawn an agent for "what port does the server run on?"
4.**When two agents could handle it**, pick the one whose domain is the primary concern.
5.**"Team, ..." → fan-out.** Spawn all relevant agents in parallel as `mode: "background"`.
6.**Anticipate downstream work.** If a feature is being built, spawn the tester to write test cases from requirements simultaneously.
7.**Issue-labeled work** — when a `squad:{member}` label is applied to an issue, route to that member. The Lead handles all `squad` (base label) triage.
- **Style:** Silent. Never speaks to the user. Works in the background.
- **Mode:** Always spawned as `mode: "background"`. Never blocks the conversation.
## What I Own
-`.squad/log/` — session logs (what happened, who worked, what was decided)
-`.squad/decisions.md` — the shared decision log all agents read (canonical, merged)
-`.squad/decisions/inbox/` — decision drop-box (agents write here, I merge)
- Cross-agent context propagation — when one agent's decision affects another
## How I Work
**Worktree awareness:** Use the `TEAM ROOT` provided in the spawn prompt to resolve all `.squad/` paths. If no TEAM ROOT is given, run `git rev-parse --show-toplevel` as fallback. Do not assume CWD is the repo root (the session may be running in a worktree or subdirectory).
After every substantial work session:
1.**Log the session** to `.squad/log/{timestamp}-{topic}.md`:
- Who worked
- What was done
- Decisions made
- Key outcomes
- Brief. Facts only.
2.**Merge the decision inbox:**
- Read all files in `.squad/decisions/inbox/`
- APPEND each decision's contents to `.squad/decisions.md`
- Delete each inbox file after merging
3.**Deduplicate and consolidate decisions.md:**
- Parse the file into decision blocks (each block starts with `### `).
- **Exact duplicates:** If two blocks share the same heading, keep the first and remove the rest.
- **Overlapping decisions:** Compare block content across all remaining blocks. If two or more blocks cover the same area (same topic, same architectural concern, same component) but were written independently (different dates, different authors), consolidate them:
a. Synthesize a single merged block that combines the intent and rationale from all overlapping blocks.
b. Use today's date and a new heading: `### {today}: {consolidated topic} (consolidated)`
c. Credit all original authors: `**By:** {Name1}, {Name2}`
d. Under **What:**, combine the decisions. Note any differences or evolution.
e. Under **Why:**, merge the rationale, preserving unique reasoning from each.
f. Remove the original overlapping blocks.
- Write the updated file back. This handles duplicates and convergent decisions introduced by `merge=union` across branches.
4.**Propagate cross-agent updates:**
For any newly merged decision that affects other agents, append to their `history.md`:
```
📌 Team update ({timestamp}): {summary} — decided by {Name}
```
5. **Commit `.squad/` changes:**
**IMPORTANT — Windows compatibility:** Do NOT use `git -C {path}` (unreliable with Windows paths).
Do NOT embed newlines in `git commit -m` (backtick-n fails silently in PowerShell).
Instead:
- `cd` into the team root first.
- Stage all `.squad/` files: `git add .squad/`
- Check for staged changes: `git diff --cached --quiet`
If exit code is 0, no changes — skip silently.
- Write the commit message to a temp file, then commit with `-F`:
- **Style:** Silent. Never speaks to the user. Works in the background.
- **Mode:** Always spawned as `mode: "background"`. Never blocks the conversation.
## What I Own
-`.squad/log/` — session logs (what happened, who worked, what was decided)
-`.squad/decisions.md` — the shared decision log all agents read (canonical, merged)
-`.squad/decisions/inbox/` — decision drop-box (agents write here, I merge)
- Cross-agent context propagation — when one agent's decision affects another
## How I Work
**Worktree awareness:** Use the `TEAM ROOT` provided in the spawn prompt to resolve all `.squad/` paths. If no TEAM ROOT is given, run `git rev-parse --show-toplevel` as fallback. Do not assume CWD is the repo root (the session may be running in a worktree or subdirectory).
After every substantial work session:
1.**Log the session** to `.squad/log/{timestamp}-{topic}.md`:
- Who worked
- What was done
- Decisions made
- Key outcomes
- Brief. Facts only.
2.**Merge the decision inbox:**
- Read all files in `.squad/decisions/inbox/`
- APPEND each decision's contents to `.squad/decisions.md`
- Delete each inbox file after merging
3.**Deduplicate and consolidate decisions.md:**
- Parse the file into decision blocks (each block starts with `### `).
- **Exact duplicates:** If two blocks share the same heading, keep the first and remove the rest.
- **Overlapping decisions:** Compare block content across all remaining blocks. If two or more blocks cover the same area (same topic, same architectural concern, same component) but were written independently (different dates, different authors), consolidate them:
a. Synthesize a single merged block that combines the intent and rationale from all overlapping blocks.
b. Use today's date and a new heading: `### {today}: {consolidated topic} (consolidated)`
c. Credit all original authors: `**By:** {Name1}, {Name2}`
d. Under **What:**, combine the decisions. Note any differences or evolution.
e. Under **Why:**, merge the rationale, preserving unique reasoning from each.
f. Remove the original overlapping blocks.
- Write the updated file back. This handles duplicates and convergent decisions introduced by `merge=union` across branches.
4.**Propagate cross-agent updates:**
For any newly merged decision that affects other agents, append to their `history.md`:
```
📌 Team update ({timestamp}): {summary} — decided by {Name}
```
5. **Commit `.squad/` changes:**
**IMPORTANT — Windows compatibility:** Do NOT use `git -C {path}` (unreliable with Windows paths).
Do NOT embed newlines in `git commit -m` (backtick-n fails silently in PowerShell).
Instead:
- `cd` into the team root first.
- Stage all `.squad/` files: `git add .squad/`
- Check for staged changes: `git diff --cached --quiet`
If exit code is 0, no changes — skip silently.
- Write the commit message to a temp file, then commit with `-F`:
description: "Standard collaboration patterns for all squad agents — worktree awareness, decisions, cross-agent communication"
domain: "team-workflow"
confidence: "high"
source: "extracted from charter boilerplate — identical content in 18+ agent charters"
---
## Context
Every agent on the team follows identical collaboration patterns for worktree awareness, decision recording, and cross-agent communication. These were previously duplicated in every charter's Collaboration section (~300 bytes × 18 agents = ~5.4KB of redundant context). Now centralized here.
The coordinator's spawn prompt already instructs agents to read decisions.md and their history.md. This skill adds the patterns for WRITING decisions and requesting help.
## Patterns
### Worktree Awareness
Use the `TEAM ROOT` path provided in your spawn prompt. All `.squad/` paths are relative to this root. If TEAM ROOT is not provided (rare), run `git rev-parse --show-toplevel` as fallback. Never assume CWD is the repo root.
### Decision Recording
After making a decision that affects other team members, write it to:
If you need another team member's input, say so in your response. The coordinator will bring them in. Don't try to do work outside your domain.
### Reviewer Protocol
If you have reviewer authority and reject work: the original author is locked out from revising that artifact. A different agent must own the revision. State who should revise in your rejection response.
## Anti-Patterns
- Don't read all agent charters — you only need your own context + decisions.md
- Don't write directly to `.squad/decisions.md` — always use the inbox drop-box
- Don't assume CWD is the repo root — always use TEAM ROOT
---
name: "agent-collaboration"
description: "Standard collaboration patterns for all squad agents — worktree awareness, decisions, cross-agent communication"
domain: "team-workflow"
confidence: "high"
source: "extracted from charter boilerplate — identical content in 18+ agent charters"
---
## Context
Every agent on the team follows identical collaboration patterns for worktree awareness, decision recording, and cross-agent communication. These were previously duplicated in every charter's Collaboration section (~300 bytes × 18 agents = ~5.4KB of redundant context). Now centralized here.
The coordinator's spawn prompt already instructs agents to read decisions.md and their history.md. This skill adds the patterns for WRITING decisions and requesting help.
## Patterns
### Worktree Awareness
Use the `TEAM ROOT` path provided in your spawn prompt. All `.squad/` paths are relative to this root. If TEAM ROOT is not provided (rare), run `git rev-parse --show-toplevel` as fallback. Never assume CWD is the repo root.
### Decision Recording
After making a decision that affects other team members, write it to:
If you need another team member's input, say so in your response. The coordinator will bring them in. Don't try to do work outside your domain.
### Reviewer Protocol
If you have reviewer authority and reject work: the original author is locked out from revising that artifact. A different agent must own the revision. State who should revise in your rejection response.
## Anti-Patterns
- Don't read all agent charters — you only need your own context + decisions.md
- Don't write directly to `.squad/decisions.md` — always use the inbox drop-box
description: "Shared hard rules enforced across all squad agents"
domain: "team-governance"
confidence: "high"
source: "reskill extraction — Product Isolation Rule and Peer Quality Check appeared in all 20 agent charters"
---
## Context
Every squad agent must follow these two hard rules. They were previously duplicated in every charter. Now they live here as a shared skill, loaded once.
## Patterns
### Product Isolation Rule (hard rule)
Tests, CI workflows, and product code must NEVER depend on specific agent names from any particular squad. "Our squad" must not impact "the squad." No hardcoded references to agent names (Flight, EECOM, FIDO, etc.) in test assertions, CI configs, or product logic. Use generic/parameterized values. If a test needs agent names, use obviously-fake test fixtures (e.g., "test-agent-1", "TestBot").
### Peer Quality Check (hard rule)
Before finishing work, verify your changes don't break existing tests. Run the test suite for files you touched. If CI has been failing, check your changes aren't contributing to the problem. When you learn from mistakes, update your history.md.
## Anti-Patterns
- Don't hardcode dev team agent names in product code or tests
- Don't skip test verification before declaring work done
- Don't ignore pre-existing CI failures that your changes may worsen
---
name: "agent-conduct"
description: "Shared hard rules enforced across all squad agents"
domain: "team-governance"
confidence: "high"
source: "reskill extraction — Product Isolation Rule and Peer Quality Check appeared in all 20 agent charters"
---
## Context
Every squad agent must follow these two hard rules. They were previously duplicated in every charter. Now they live here as a shared skill, loaded once.
## Patterns
### Product Isolation Rule (hard rule)
Tests, CI workflows, and product code must NEVER depend on specific agent names from any particular squad. "Our squad" must not impact "the squad." No hardcoded references to agent names (Flight, EECOM, FIDO, etc.) in test assertions, CI configs, or product logic. Use generic/parameterized values. If a test needs agent names, use obviously-fake test fixtures (e.g., "test-agent-1", "TestBot").
### Peer Quality Check (hard rule)
Before finishing work, verify your changes don't break existing tests. Run the test suite for files you touched. If CI has been failing, check your changes aren't contributing to the problem. When you learn from mistakes, update your history.md.
## Anti-Patterns
- Don't hardcode dev team agent names in product code or tests
- Don't skip test verification before declaring work done
- Don't ignore pre-existing CI failures that your changes may worsen
source: "extracted from Drucker and Trejo charters — earned knowledge from v0.8.22 release incident"
---
## Context
CI workflows must be defensive. These patterns were learned from the v0.8.22 release disaster where invalid semver, wrong token types, missing retry logic, and draft releases caused a multi-hour outage. Both Drucker (CI/CD) and Trejo (Release Manager) carried this knowledge in their charters — now centralized here.
## Patterns
### Semver Validation Gate
Every publish workflow MUST validate version format before `npm publish`. 4-part versions (e.g., 0.8.21.4) are NOT valid semver — npm mangles them.
```yaml
- name:Validate semver
run:|
VERSION="${{ github.event.release.tag_name }}"
VERSION="${VERSION#v}"
if ! npx semver "$VERSION" > /dev/null 2>&1; then
echo "❌ Invalid semver: $VERSION"
echo "Only 3-part versions (X.Y.Z) or prerelease (X.Y.Z-tag.N) are valid."
exit 1
fi
echo "✅ Valid semver: $VERSION"
```
### NPM Token Type Verification
NPM_TOKEN MUST be an Automation token, not a User token with 2FA:
- User tokens require OTP — CI can't provide it → EOTP error
- If using workflow_dispatch: verify release is published via GitHub API before proceeding
### Build Script Protection
Set `SKIP_BUILD_BUMP=1` (or `$env:SKIP_BUILD_BUMP = "1"` on Windows) before ANY release build. bump-build.mjs is for dev builds ONLY — it silently mutates versions.
## Known Failure Modes (v0.8.22 Incident)
| # | What Happened | Root Cause | Prevention |
|---|---------------|-----------|------------|
| 1 | 4-part version published, npm mangled it | No semver validation gate | `npx semver` check before every publish |
| 2 | CI failed 5+ times with EOTP | User token with 2FA | Automation token only |
| 3 | Verify returned false 404 | No retry logic for propagation | 5 attempts, 15s intervals |
| 4 | Workflow never triggered | Draft release doesn't emit event | Never create draft releases |
| 5 | Version mutated during release | bump-build.mjs ran in release | SKIP_BUILD_BUMP=1 |
## Anti-Patterns
- ❌ Publishing without semver validation gate
- ❌ Single-shot verification without retry
- ❌ Hard-coded secrets in workflows
- ❌ Silent CI failures — every error needs actionable output with remediation
source: "extracted from Drucker and Trejo charters — earned knowledge from v0.8.22 release incident"
---
## Context
CI workflows must be defensive. These patterns were learned from the v0.8.22 release disaster where invalid semver, wrong token types, missing retry logic, and draft releases caused a multi-hour outage. Both Drucker (CI/CD) and Trejo (Release Manager) carried this knowledge in their charters — now centralized here.
## Patterns
### Semver Validation Gate
Every publish workflow MUST validate version format before `npm publish`. 4-part versions (e.g., 0.8.21.4) are NOT valid semver — npm mangles them.
```yaml
- name:Validate semver
run:|
VERSION="${{ github.event.release.tag_name }}"
VERSION="${VERSION#v}"
if ! npx semver "$VERSION" > /dev/null 2>&1; then
echo "❌ Invalid semver: $VERSION"
echo "Only 3-part versions (X.Y.Z) or prerelease (X.Y.Z-tag.N) are valid."
exit 1
fi
echo "✅ Valid semver: $VERSION"
```
### NPM Token Type Verification
NPM_TOKEN MUST be an Automation token, not a User token with 2FA:
- User tokens require OTP — CI can't provide it → EOTP error
- If using workflow_dispatch: verify release is published via GitHub API before proceeding
### Build Script Protection
Set `SKIP_BUILD_BUMP=1` (or `$env:SKIP_BUILD_BUMP = "1"` on Windows) before ANY release build. bump-build.mjs is for dev builds ONLY — it silently mutates versions.
## Known Failure Modes (v0.8.22 Incident)
| # | What Happened | Root Cause | Prevention |
|---|---------------|-----------|------------|
| 1 | 4-part version published, npm mangled it | No semver validation gate | `npx semver` check before every publish |
| 2 | CI failed 5+ times with EOTP | User token with 2FA | Automation token only |
| 3 | Verify returned false 404 | No retry logic for propagation | 5 attempts, 15s intervals |
| 4 | Workflow never triggered | Draft release doesn't emit event | Never create draft releases |
| 5 | Version mutated during release | bump-build.mjs ran in release | SKIP_BUILD_BUMP=1 |
## Anti-Patterns
- ❌ Publishing without semver validation gate
- ❌ Single-shot verification without retry
- ❌ Hard-coded secrets in workflows
- ❌ Silent CI failures — every error needs actionable output with remediation
description: "Platform detection and adaptive spawning for CLI vs VS Code vs other surfaces"
domain: "orchestration"
confidence: "high"
source: "extracted"
---
## Context
Squad runs on multiple Copilot surfaces (CLI, VS Code, JetBrains, GitHub.com). The coordinator must detect its platform and adapt spawning behavior accordingly. Different tools are available on different platforms, requiring conditional logic for agent spawning, SQL usage, and response timing.
## Patterns
### Platform Detection
Before spawning agents, determine the platform by checking available tools:
1.**CLI mode** — `task` tool is available → full spawning control. Use `task` with `agent_type`, `mode`, `model`, `description`, `prompt` parameters. Collect results via `read_agent`.
2.**VS Code mode** — `runSubagent` or `agent` tool is available → conditional behavior. Use `runSubagent` with the task prompt. Drop `agent_type`, `mode`, and `model` parameters. Multiple subagents in one turn run concurrently (equivalent to background mode). Results return automatically — no `read_agent` needed.
3.**Fallback mode** — neither `task` nor `runSubagent`/`agent` available → work inline. Do not apologize or explain the limitation. Execute the task directly.
If both `task` and `runSubagent` are available, prefer `task` (richer parameter surface).
### VS Code Spawn Adaptations
When in VS Code mode, the coordinator changes behavior in these ways:
- **Spawning tool:** Use `runSubagent` instead of `task`. The prompt is the only required parameter — pass the full agent prompt (charter, identity, task, hygiene, response order) exactly as you would on CLI.
- **Parallelism:** Spawn ALL concurrent agents in a SINGLE turn. They run in parallel automatically. This replaces `mode: "background"` + `read_agent` polling.
- **Model selection:** Accept the session model. Do NOT attempt per-spawn model selection or fallback chains — they only work on CLI. In Phase 1, all subagents use whatever model the user selected in VS Code's model picker.
- **Scribe:** Cannot fire-and-forget. Batch Scribe as the LAST subagent in any parallel group. Scribe is light work (file ops only), so the blocking is tolerable.
- **Launch table:** Skip it. Results arrive with the response, not separately. By the time the coordinator speaks, the work is already done.
- **`read_agent`:** Skip entirely. Results return automatically when subagents complete.
- **`agent_type`:** Drop it. All VS Code subagents have full tool access by default. Subagents inherit the parent's tools.
- **`description`:** Drop it. The agent name is already in the prompt.
- **Prompt content:** Keep ALL prompt structure — charter, identity, task, hygiene, response order blocks are surface-independent.
### Feature Degradation Table
| Feature | CLI | VS Code | Degradation |
|---------|-----|---------|-------------|
| Parallel fan-out | `mode: "background"` + `read_agent` | Multiple subagents in one turn | None — equivalent concurrency |
| Model selection | Per-spawn `model` param (4-layer hierarchy) | Session model only (Phase 1) | Accept session model, log intent |
| Scribe fire-and-forget | Background, never read | Sync, must wait | Batch with last parallel group |
| Launch table UX | Show table → results later | Skip table → results with response | UX only — results are correct |
| SQL tool | Available | Not available | Avoid SQL in cross-platform code paths |
| Response order bug | Critical workaround | Possibly necessary (unverified) | Keep the block — harmless if unnecessary |
### SQL Tool Caveat
The `sql` tool is **CLI-only**. It does not exist on VS Code, JetBrains, or GitHub.com. Any coordinator logic or agent workflow that depends on SQL (todo tracking, batch processing, session state) will silently fail on non-CLI surfaces. Cross-platform code paths must not depend on SQL. Use filesystem-based state (`.squad/` files) for anything that must work everywhere.
## Examples
**Example 1: CLI parallel spawn**
```typescript
// Coordinator detects task tool available → CLI mode
// Coordinator detects runSubagent available → VS Code mode
runSubagent({prompt:"...Fenster charter + task..."})
runSubagent({prompt:"...Hockney charter + task..."})
runSubagent({prompt:"...Scribe charter + task..."})// Last in group
// Results return automatically, no read_agent
```
**Example 3: Fallback mode**
```typescript
// Neither task nor runSubagent available → work inline
// Coordinator executes the task directly without spawning
```
## Anti-Patterns
- ❌ Using SQL tool in cross-platform workflows (breaks on VS Code/JetBrains/GitHub.com)
- ❌ Attempting per-spawn model selection on VS Code (Phase 1 — only session model works)
- ❌ Fire-and-forget Scribe on VS Code (must batch as last subagent)
- ❌ Showing launch table on VS Code (results already inline)
- ❌ Apologizing or explaining platform limitations to the user
- ❌ Using `task` when only `runSubagent` is available
- ❌ Dropping prompt structure (charter/identity/task) on non-CLI platforms
---
name: "client-compatibility"
description: "Platform detection and adaptive spawning for CLI vs VS Code vs other surfaces"
domain: "orchestration"
confidence: "high"
source: "extracted"
---
## Context
Squad runs on multiple Copilot surfaces (CLI, VS Code, JetBrains, GitHub.com). The coordinator must detect its platform and adapt spawning behavior accordingly. Different tools are available on different platforms, requiring conditional logic for agent spawning, SQL usage, and response timing.
## Patterns
### Platform Detection
Before spawning agents, determine the platform by checking available tools:
1.**CLI mode** — `task` tool is available → full spawning control. Use `task` with `agent_type`, `mode`, `model`, `description`, `prompt` parameters. Collect results via `read_agent`.
2.**VS Code mode** — `runSubagent` or `agent` tool is available → conditional behavior. Use `runSubagent` with the task prompt. Drop `agent_type`, `mode`, and `model` parameters. Multiple subagents in one turn run concurrently (equivalent to background mode). Results return automatically — no `read_agent` needed.
3.**Fallback mode** — neither `task` nor `runSubagent`/`agent` available → work inline. Do not apologize or explain the limitation. Execute the task directly.
If both `task` and `runSubagent` are available, prefer `task` (richer parameter surface).
### VS Code Spawn Adaptations
When in VS Code mode, the coordinator changes behavior in these ways:
- **Spawning tool:** Use `runSubagent` instead of `task`. The prompt is the only required parameter — pass the full agent prompt (charter, identity, task, hygiene, response order) exactly as you would on CLI.
- **Parallelism:** Spawn ALL concurrent agents in a SINGLE turn. They run in parallel automatically. This replaces `mode: "background"` + `read_agent` polling.
- **Model selection:** Accept the session model. Do NOT attempt per-spawn model selection or fallback chains — they only work on CLI. In Phase 1, all subagents use whatever model the user selected in VS Code's model picker.
- **Scribe:** Cannot fire-and-forget. Batch Scribe as the LAST subagent in any parallel group. Scribe is light work (file ops only), so the blocking is tolerable.
- **Launch table:** Skip it. Results arrive with the response, not separately. By the time the coordinator speaks, the work is already done.
- **`read_agent`:** Skip entirely. Results return automatically when subagents complete.
- **`agent_type`:** Drop it. All VS Code subagents have full tool access by default. Subagents inherit the parent's tools.
- **`description`:** Drop it. The agent name is already in the prompt.
- **Prompt content:** Keep ALL prompt structure — charter, identity, task, hygiene, response order blocks are surface-independent.
### Feature Degradation Table
| Feature | CLI | VS Code | Degradation |
|---------|-----|---------|-------------|
| Parallel fan-out | `mode: "background"` + `read_agent` | Multiple subagents in one turn | None — equivalent concurrency |
| Model selection | Per-spawn `model` param (4-layer hierarchy) | Session model only (Phase 1) | Accept session model, log intent |
| Scribe fire-and-forget | Background, never read | Sync, must wait | Batch with last parallel group |
| Launch table UX | Show table → results later | Skip table → results with response | UX only — results are correct |
| SQL tool | Available | Not available | Avoid SQL in cross-platform code paths |
| Response order bug | Critical workaround | Possibly necessary (unverified) | Keep the block — harmless if unnecessary |
### SQL Tool Caveat
The `sql` tool is **CLI-only**. It does not exist on VS Code, JetBrains, or GitHub.com. Any coordinator logic or agent workflow that depends on SQL (todo tracking, batch processing, session state) will silently fail on non-CLI surfaces. Cross-platform code paths must not depend on SQL. Use filesystem-based state (`.squad/` files) for anything that must work everywhere.
## Examples
**Example 1: CLI parallel spawn**
```typescript
// Coordinator detects task tool available → CLI mode
description: "Coordinating work across multiple Squad instances"
domain: "orchestration"
confidence: "medium"
source: "manual"
tools:
- name: "squad-discover"
description: "List known squads and their capabilities"
when: "When you need to find which squad can handle a task"
- name: "squad-delegate"
description: "Create work in another squad's repository"
when: "When a task belongs to another squad's domain"
---
## Context
When an organization runs multiple Squad instances (e.g., platform-squad, frontend-squad, data-squad), those squads need to discover each other, share context, and hand off work across repository boundaries. This skill teaches agents how to coordinate across squads without creating tight coupling.
Cross-squad orchestration applies when:
- A task requires capabilities owned by another squad
- An architectural decision affects multiple squads
- A feature spans multiple repositories with different squads
- A squad needs to request infrastructure, tooling, or support from another squad
## Patterns
### Discovery via Manifest
Each squad publishes a `.squad/manifest.json` declaring its name, capabilities, and contact information. Squads discover each other through:
1.**Well-known paths**: Check `.squad/manifest.json` in known org repos
2.**Upstream config**: Squads already listed in `.squad/upstream.json` are checked for manifests
3.**Explicit registry**: A central `squad-registry.json` can list all squads in an org
- **Direct file writes across repos** — Never modify another squad's `.squad/` directory. Use issues and PRs as the communication protocol.
- **Tight coupling** — Don't depend on another squad's internal structure. Use the manifest as the public API contract.
- **Unbounded delegation** — Always include acceptance criteria and a timeout. Don't create open-ended requests.
- **Skipping discovery** — Don't hardcode squad locations. Use manifests and the discovery protocol.
- **Sharing secrets** — Never include credentials, tokens, or internal URLs in cross-squad issues.
- **Circular delegation** — Track delegation chains. If squad A delegates to B which delegates back to A, something is wrong.
---
name: "cross-squad"
description: "Coordinating work across multiple Squad instances"
domain: "orchestration"
confidence: "medium"
source: "manual"
tools:
- name: "squad-discover"
description: "List known squads and their capabilities"
when: "When you need to find which squad can handle a task"
- name: "squad-delegate"
description: "Create work in another squad's repository"
when: "When a task belongs to another squad's domain"
---
## Context
When an organization runs multiple Squad instances (e.g., platform-squad, frontend-squad, data-squad), those squads need to discover each other, share context, and hand off work across repository boundaries. This skill teaches agents how to coordinate across squads without creating tight coupling.
Cross-squad orchestration applies when:
- A task requires capabilities owned by another squad
- An architectural decision affects multiple squads
- A feature spans multiple repositories with different squads
- A squad needs to request infrastructure, tooling, or support from another squad
## Patterns
### Discovery via Manifest
Each squad publishes a `.squad/manifest.json` declaring its name, capabilities, and contact information. Squads discover each other through:
1.**Well-known paths**: Check `.squad/manifest.json` in known org repos
2.**Upstream config**: Squads already listed in `.squad/upstream.json` are checked for manifests
3.**Explicit registry**: A central `squad-registry.json` can list all squads in an org
**✅ THIS SKILL PRODUCES (exactly these, nothing more):**
1.**`mesh.json`** — Generated from user answers about zones and squads (which squads participate, what zone each is in, paths/URLs for each), using `mesh.json.example` in this skill's directory as the schema template
2.**`sync-mesh.sh` and `sync-mesh.ps1`** — Copied from this skill's directory into the project root (these are bundled resources, NOT generated code)
3.**Zone 2 state repo initialization** (if applicable) — If the user specified a Zone 2 shared state repo, run `sync-mesh.sh --init` to scaffold the state repo structure
4.**A decision entry** in `.squad/decisions/inbox/` documenting the mesh configuration for team awareness
**❌ THIS SKILL DOES NOT PRODUCE:**
- **No application code** — No validators, libraries, or modules of any kind
- **No test files** — No test suites, test cases, or test scaffolding
- **No GENERATING sync scripts** — They are bundled with this skill as pre-built resources. COPY them, don't generate them.
- **No daemons or services** — No background processes, servers, or persistent runtimes
- **No modifications to existing squad files** beyond the decision entry (no changes to team.md, routing.md, agent charters, etc.)
**Your role:** Configure the mesh topology and install the bundled sync scripts. Nothing more.
## Context
When squads are on different machines (developer laptops, CI runners, cloud VMs, partner orgs), the local file-reading convention still works — but remote files need to arrive on your disk first. This skill teaches the pattern for distributed squad communication.
**When this applies:**
- Squads span multiple machines, VMs, or CI runners
- Squads span organizations or companies
- An agent needs context from a squad whose files aren't on the local filesystem
**When this does NOT apply:**
- All squads are on the same machine (just read the files directly)
## Patterns
### The Core Principle
> "The filesystem is the mesh, and git is how the mesh crosses machine boundaries."
The agent interface never changes. Agents always read local files. The distributed layer's only job is to make remote files appear locally before the agent reads them.
### Three Zones of Communication
**Zone 1 — Local:** Same filesystem. Read files directly. Zero transport.
**Zone 2 — Remote-Trusted:** Different host, same org, shared git auth. Transport: `git pull` from a shared repo. This collapses Zone 2 into Zone 1 — files materialize on disk, agent reads them normally.
**Zone 3 — Remote-Opaque:** Different org, no shared auth. Transport: `curl` to fetch published contracts (SUMMARY.md). One-way visibility — you see only what they publish.
2. READ: cat .mesh/**/state.md — all files are local now
3. WORK: do their assigned work (the agent's normal task, NOT mesh-building)
4. WRITE: update own billboard, log, drops
5. PUBLISH: git add + commit + push — share state with remote peers
```
Steps 2–4 are identical to local-only. Steps 1 and 5 are the entire distributed extension. **Note:** "WORK" means the agent performs its normal squad duties — it does NOT mean "build mesh infrastructure."
Three zone types, one file. Local squads need only a path. Remote-trusted need a git URL. Remote-opaque need an HTTP URL.
### Write Partitioning
Each squad writes only to its own directory (`boards/{self}.md`, `squads/{self}/*`, `drops/{date}-{self}-*.md`). No two squads write to the same file. Git push/pull never conflicts. If push fails ("branch is behind"), the fix is always `git pull --rebase && git push`.
### Trust Boundaries
Trust maps to git permissions:
- **Same repo access** = full mesh visibility
- **Read-only access** = can observe, can't write
- **No access** = invisible (correct behavior)
For selective visibility, use separate repos per audience (internal, partner, public). Git permissions ARE the trust negotiation.
### Phased Rollout
- **Phase 0:** Convention only — document zones, agree on mesh.json fields, manually run `git pull`/`git push`. Zero new code.
- **Phase 1:** Sync script (~30 lines bash or PowerShell) when manual sync gets tedious.
- **Phase 2:** Published contracts + curl fetch when a Zone 3 partner appears.
- **Phase 3:** Never. No MCP federation, A2A, service discovery, message queues.
**Important:** Phases are NOT auto-advanced. These are project-level decisions — you start at Phase 0 (manual sync) and only move forward when the team decides complexity is justified.
### Mesh State Repo
The shared mesh state repo is a plain git repository — NOT a Squad project. It holds:
- One directory per participating squad
- Each directory contains at minimum a SUMMARY.md with the squad's current state
- A root README explaining what the repo is and who participates
No `.squad/` folder, no agents, no automation. Write partitioning means each squad only pushes to its own directory. The repo is a rendezvous point, not an intelligent system.
If you want a squad that *observes* mesh health, that's a separate Squad project that lists the state repo as a Zone 2 remote in its `mesh.json` — it does NOT live inside the state repo.
## Examples
### Developer Laptop + CI Squad (Zone 2)
Auth-squad agent wakes up. `git pull` brings ci-squad's latest results. Agent reads: "3 test failures in auth module." Adjusts work. Pushes results when done. **Overhead: one `git pull`, one `git push`.**
### Two Orgs Collaborating (Zone 3)
Payment-squad fetches partner's published SUMMARY.md via curl. Reads: "Risk scoring v3 API deprecated April 15. New field `device_fingerprint` required." The consuming agent (in payment-squad's team) reads this information and uses it to inform its work — for example, updating payment integration code to include the new field. Partner can't see payment-squad's internals.
### Same Org, Shared Mesh Repo (Zone 2)
Three squads on different machines. One shared git repo holds the mesh. Each squad: `git pull` before work, `git push` after. Write partitioning ensures zero merge conflicts.
## AGENT WORKFLOW (Deterministic Setup)
When a user invokes this skill to set up a distributed mesh, follow these steps **exactly, in order:**
### Step 1: ASK the user for mesh topology
Ask these questions (adapt phrasing naturally, but get these answers):
1.**Which squads are participating?** (List of squad names)
2.**For each squad, which zone is it in?**
-`local` — same filesystem (just need a path)
-`remote-trusted` — different machine, same org, shared git access (need git URL + ref)
-`remote-opaque` — different org, no shared auth (need HTTPS URL to published contract)
3.**For each squad, what's the connection info?**
- Local: relative or absolute path to their `.mesh/` directory
- Remote-trusted: git URL (SSH or HTTPS), ref (branch/tag), and where to sync it to locally
- Remote-opaque: HTTPS URL to their SUMMARY.md, where to sync it, and auth type (none/bearer)
4.**Where should the shared state live?** (For Zone 2 squads: git repo URL for the mesh state, or confirm each squad syncs independently)
### Step 2: GENERATE `mesh.json`
Using the answers from Step 1, create a `mesh.json` file at the project root. Use `mesh.json.example` from THIS skill's directory (`.squad/skills/distributed-mesh/mesh.json.example`) as the schema template.
**✅ THIS SKILL PRODUCES (exactly these, nothing more):**
1.**`mesh.json`** — Generated from user answers about zones and squads (which squads participate, what zone each is in, paths/URLs for each), using `mesh.json.example` in this skill's directory as the schema template
2.**`sync-mesh.sh` and `sync-mesh.ps1`** — Copied from this skill's directory into the project root (these are bundled resources, NOT generated code)
3.**Zone 2 state repo initialization** (if applicable) — If the user specified a Zone 2 shared state repo, run `sync-mesh.sh --init` to scaffold the state repo structure
4.**A decision entry** in `.squad/decisions/inbox/` documenting the mesh configuration for team awareness
**❌ THIS SKILL DOES NOT PRODUCE:**
- **No application code** — No validators, libraries, or modules of any kind
- **No test files** — No test suites, test cases, or test scaffolding
- **No GENERATING sync scripts** — They are bundled with this skill as pre-built resources. COPY them, don't generate them.
- **No daemons or services** — No background processes, servers, or persistent runtimes
- **No modifications to existing squad files** beyond the decision entry (no changes to team.md, routing.md, agent charters, etc.)
**Your role:** Configure the mesh topology and install the bundled sync scripts. Nothing more.
## Context
When squads are on different machines (developer laptops, CI runners, cloud VMs, partner orgs), the local file-reading convention still works — but remote files need to arrive on your disk first. This skill teaches the pattern for distributed squad communication.
**When this applies:**
- Squads span multiple machines, VMs, or CI runners
- Squads span organizations or companies
- An agent needs context from a squad whose files aren't on the local filesystem
**When this does NOT apply:**
- All squads are on the same machine (just read the files directly)
## Patterns
### The Core Principle
> "The filesystem is the mesh, and git is how the mesh crosses machine boundaries."
The agent interface never changes. Agents always read local files. The distributed layer's only job is to make remote files appear locally before the agent reads them.
### Three Zones of Communication
**Zone 1 — Local:** Same filesystem. Read files directly. Zero transport.
**Zone 2 — Remote-Trusted:** Different host, same org, shared git auth. Transport: `git pull` from a shared repo. This collapses Zone 2 into Zone 1 — files materialize on disk, agent reads them normally.
**Zone 3 — Remote-Opaque:** Different org, no shared auth. Transport: `curl` to fetch published contracts (SUMMARY.md). One-way visibility — you see only what they publish.
2. READ: cat .mesh/**/state.md — all files are local now
3. WORK: do their assigned work (the agent's normal task, NOT mesh-building)
4. WRITE: update own billboard, log, drops
5. PUBLISH: git add + commit + push — share state with remote peers
```
Steps 2–4 are identical to local-only. Steps 1 and 5 are the entire distributed extension. **Note:** "WORK" means the agent performs its normal squad duties — it does NOT mean "build mesh infrastructure."
Three zone types, one file. Local squads need only a path. Remote-trusted need a git URL. Remote-opaque need an HTTP URL.
### Write Partitioning
Each squad writes only to its own directory (`boards/{self}.md`, `squads/{self}/*`, `drops/{date}-{self}-*.md`). No two squads write to the same file. Git push/pull never conflicts. If push fails ("branch is behind"), the fix is always `git pull --rebase && git push`.
### Trust Boundaries
Trust maps to git permissions:
- **Same repo access** = full mesh visibility
- **Read-only access** = can observe, can't write
- **No access** = invisible (correct behavior)
For selective visibility, use separate repos per audience (internal, partner, public). Git permissions ARE the trust negotiation.
### Phased Rollout
- **Phase 0:** Convention only — document zones, agree on mesh.json fields, manually run `git pull`/`git push`. Zero new code.
- **Phase 1:** Sync script (~30 lines bash or PowerShell) when manual sync gets tedious.
- **Phase 2:** Published contracts + curl fetch when a Zone 3 partner appears.
- **Phase 3:** Never. No MCP federation, A2A, service discovery, message queues.
**Important:** Phases are NOT auto-advanced. These are project-level decisions — you start at Phase 0 (manual sync) and only move forward when the team decides complexity is justified.
### Mesh State Repo
The shared mesh state repo is a plain git repository — NOT a Squad project. It holds:
- One directory per participating squad
- Each directory contains at minimum a SUMMARY.md with the squad's current state
- A root README explaining what the repo is and who participates
No `.squad/` folder, no agents, no automation. Write partitioning means each squad only pushes to its own directory. The repo is a rendezvous point, not an intelligent system.
If you want a squad that *observes* mesh health, that's a separate Squad project that lists the state repo as a Zone 2 remote in its `mesh.json` — it does NOT live inside the state repo.
## Examples
### Developer Laptop + CI Squad (Zone 2)
Auth-squad agent wakes up. `git pull` brings ci-squad's latest results. Agent reads: "3 test failures in auth module." Adjusts work. Pushes results when done. **Overhead: one `git pull`, one `git push`.**
### Two Orgs Collaborating (Zone 3)
Payment-squad fetches partner's published SUMMARY.md via curl. Reads: "Risk scoring v3 API deprecated April 15. New field `device_fingerprint` required." The consuming agent (in payment-squad's team) reads this information and uses it to inform its work — for example, updating payment integration code to include the new field. Partner can't see payment-squad's internals.
### Same Org, Shared Mesh Repo (Zone 2)
Three squads on different machines. One shared git repo holds the mesh. Each squad: `git pull` before work, `git push` after. Write partitioning ensures zero merge conflicts.
## AGENT WORKFLOW (Deterministic Setup)
When a user invokes this skill to set up a distributed mesh, follow these steps **exactly, in order:**
### Step 1: ASK the user for mesh topology
Ask these questions (adapt phrasing naturally, but get these answers):
1.**Which squads are participating?** (List of squad names)
2.**For each squad, which zone is it in?**
-`local` — same filesystem (just need a path)
-`remote-trusted` — different machine, same org, shared git access (need git URL + ref)
-`remote-opaque` — different org, no shared auth (need HTTPS URL to published contract)
3.**For each squad, what's the connection info?**
- Local: relative or absolute path to their `.mesh/` directory
- Remote-trusted: git URL (SSH or HTTPS), ref (branch/tag), and where to sync it to locally
- Remote-opaque: HTTPS URL to their SUMMARY.md, where to sync it, and auth type (none/bearer)
4.**Where should the shared state live?** (For Zone 2 squads: git repo URL for the mesh state, or confirm each squad syncs independently)
### Step 2: GENERATE `mesh.json`
Using the answers from Step 1, create a `mesh.json` file at the project root. Use `mesh.json.example` from THIS skill's directory (`.squad/skills/distributed-mesh/mesh.json.example`) as the schema template.
Squad documentation follows the Microsoft Style Guide with Squad-specific conventions. Consistency across docs builds trust and improves discoverability.
## Patterns
### Microsoft Style Guide Rules
- **Sentence-case headings:** "Getting started" not "Getting Started"
- **Active voice:** "Run the command" not "The command should be run"
- **Second person:** "You can configure..." not "Users can configure..."
- **Present tense:** "The system routes..." not "The system will route..."
- **No ampersands in prose:** "and" not "&" (except in code, brand names, or UI elements)
### Squad Formatting Patterns
- **Scannability first:** Paragraphs for narrative (3-4 sentences max), bullets for scannable lists, tables for structured data
- **"Try this" prompts at top:** Start feature/scenario pages with practical prompts users can copy
- **Experimental warnings:** Features in preview get callout at top
- **Cross-references at bottom:** Related pages linked after main content
- **Always update test assertions:** When adding docs pages to `features/`, `scenarios/`, `guides/`, update corresponding `EXPECTED_*` arrays in `test/docs-build.test.ts` in the same commit
## Examples
✓ **Correct:**
```markdown
# Getting started with Squad
> ⚠️ **Experimental:** This feature is in preview.
Try this:
\`\`\`bash
squad init
\`\`\`
Squad helps you build AI teams...
---
## Install Squad
Run the following command...
```
✗ **Incorrect:**
```markdown
# Getting Started With Squad // Title case
Squad is a tool which will help users... // Third person, future tense
You can install Squad with npm & configure it... // Ampersand in prose
```
## Anti-Patterns
- Title-casing headings because "it looks nicer"
- Writing in passive voice or third person
- Long paragraphs of dense text (breaks scannability)
- Adding doc pages without updating test assertions
Squad documentation follows the Microsoft Style Guide with Squad-specific conventions. Consistency across docs builds trust and improves discoverability.
## Patterns
### Microsoft Style Guide Rules
- **Sentence-case headings:** "Getting started" not "Getting Started"
- **Active voice:** "Run the command" not "The command should be run"
- **Second person:** "You can configure..." not "Users can configure..."
- **Present tense:** "The system routes..." not "The system will route..."
- **No ampersands in prose:** "and" not "&" (except in code, brand names, or UI elements)
### Squad Formatting Patterns
- **Scannability first:** Paragraphs for narrative (3-4 sentences max), bullets for scannable lists, tables for structured data
- **"Try this" prompts at top:** Start feature/scenario pages with practical prompts users can copy
- **Experimental warnings:** Features in preview get callout at top
- **Cross-references at bottom:** Related pages linked after main content
- **Always update test assertions:** When adding docs pages to `features/`, `scenarios/`, `guides/`, update corresponding `EXPECTED_*` arrays in `test/docs-build.test.ts` in the same commit
## Examples
✓ **Correct:**
```markdown
# Getting started with Squad
> ⚠️ **Experimental:** This feature is in preview.
Try this:
\`\`\`bash
squad init
\`\`\`
Squad helps you build AI teams...
---
## Install Squad
Run the following command...
```
✗ **Incorrect:**
```markdown
# Getting Started With Squad // Title case
Squad is a tool which will help users... // Third person, future tense
You can install Squad with npm & configure it... // Ampersand in prose
```
## Anti-Patterns
- Title-casing headings because "it looks nicer"
- Writing in passive voice or third person
- Long paragraphs of dense text (breaks scannability)
- Adding doc pages without updating test assertions
description: "Shifts Layer 3 model selection to cost-optimized alternatives when economy mode is active."
domain: "model-selection"
confidence: "low"
source: "manual"
---
## SCOPE
✅ THIS SKILL PRODUCES:
- A modified Layer 3 model selection table applied when economy mode is active
-`economyMode: true` written to `.squad/config.json` when activated persistently
- Spawn acknowledgments with `💰` indicator when economy mode is active
❌ THIS SKILL DOES NOT PRODUCE:
- Code, tests, or documentation
- Cost reports or billing artifacts
- Changes to Layer 0, Layer 1, or Layer 2 resolution (user intent always wins)
## Context
Economy mode shifts Layer 3 (Task-Aware Auto-Selection) to lower-cost alternatives. It does NOT override persistent config (`defaultModel`, `agentModelOverrides`) or per-agent charter preferences — those represent explicit user intent and always take priority.
Use this skill when the user wants to reduce costs across an entire session or permanently, without manually specifying models for each agent.
**Prefer `gpt-4.1` over `gpt-5-mini`** when the task involves structured output or agentic tool use. Prefer `gpt-5-mini` for pure text generation tasks where latency matters.
## AGENT WORKFLOW
### On Session Start
1. READ `.squad/config.json`
2. CHECK for `economyMode: true` — if present, activate economy mode for the session
2. ACKNOWLEDGE: `✅ Economy mode active — using cost-optimized models this session. (Layer 0 and Layer 2 preferences still apply)`
**Persistent:** "always use economy mode", "save economy mode"
1. WRITE `economyMode: true` to `.squad/config.json` (merge, don't overwrite other fields)
2. ACKNOWLEDGE: `✅ Economy mode saved — cost-optimized models will be used until disabled.`
### On Every Agent Spawn (Economy Mode Active)
1. CHECK Layer 0a/0b first (agentModelOverrides, defaultModel) — if set, use that. Economy mode does NOT override Layer 0.
2. CHECK Layer 1 (session directive for a specific model) — if set, use that. Economy mode does NOT override explicit session directives.
3. CHECK Layer 2 (charter preference) — if set, use that. Economy mode does NOT override charter preferences.
4. APPLY economy table at Layer 3 instead of normal table.
5. INCLUDE `💰` in spawn acknowledgment: `🔧 {Name} ({model} · 💰 economy) — {task}`
### On Deactivation
**Trigger phrases:** "turn off economy mode", "disable economy mode", "use normal models"
1. REMOVE `economyMode` from `.squad/config.json` (if it was persisted)
2. CLEAR session economy mode state
3. ACKNOWLEDGE: `✅ Economy mode disabled — returning to standard model selection.`
### STOP
After updating economy mode state and including the `💰` indicator in spawn acknowledgments, this skill is done. Do NOT:
- Change Layer 0, Layer 1, or Layer 2 model choices
- Override charter-specified models
- Generate cost reports or comparisons
- Fall back to premium models via economy mode (economy mode never bumps UP)
## Config Schema
`.squad/config.json` economy-related fields:
```json
{
"version":1,
"economyMode":true
}
```
-`economyMode` — when `true`, Layer 3 uses the economy table. Optional; absent = economy mode off.
- Combines with `defaultModel` and `agentModelOverrides` — Layer 0 always wins.
## Anti-Patterns
- **Don't override Layer 0 in economy mode.** If the user set `defaultModel: "claude-opus-4.6"`, they want quality. Economy mode only affects Layer 3 auto-selection.
- **Don't silently apply economy mode.** Always acknowledge when activated or deactivated.
- **Don't treat economy mode as permanent by default.** Session phrases activate session-only; only "always" or `config.json` persist it.
- **Don't bump premium tasks down too far.** Architecture and security reviews shift from opus to sonnet in economy mode — they do NOT go to fast/cheap models.
---
name: "economy-mode"
description: "Shifts Layer 3 model selection to cost-optimized alternatives when economy mode is active."
domain: "model-selection"
confidence: "low"
source: "manual"
---
## SCOPE
✅ THIS SKILL PRODUCES:
- A modified Layer 3 model selection table applied when economy mode is active
-`economyMode: true` written to `.squad/config.json` when activated persistently
- Spawn acknowledgments with `💰` indicator when economy mode is active
❌ THIS SKILL DOES NOT PRODUCE:
- Code, tests, or documentation
- Cost reports or billing artifacts
- Changes to Layer 0, Layer 1, or Layer 2 resolution (user intent always wins)
## Context
Economy mode shifts Layer 3 (Task-Aware Auto-Selection) to lower-cost alternatives. It does NOT override persistent config (`defaultModel`, `agentModelOverrides`) or per-agent charter preferences — those represent explicit user intent and always take priority.
Use this skill when the user wants to reduce costs across an entire session or permanently, without manually specifying models for each agent.
**Prefer `gpt-4.1` over `gpt-5-mini`** when the task involves structured output or agentic tool use. Prefer `gpt-5-mini` for pure text generation tasks where latency matters.
## AGENT WORKFLOW
### On Session Start
1. READ `.squad/config.json`
2. CHECK for `economyMode: true` — if present, activate economy mode for the session
2. ACKNOWLEDGE: `✅ Economy mode active — using cost-optimized models this session. (Layer 0 and Layer 2 preferences still apply)`
**Persistent:** "always use economy mode", "save economy mode"
1. WRITE `economyMode: true` to `.squad/config.json` (merge, don't overwrite other fields)
2. ACKNOWLEDGE: `✅ Economy mode saved — cost-optimized models will be used until disabled.`
### On Every Agent Spawn (Economy Mode Active)
1. CHECK Layer 0a/0b first (agentModelOverrides, defaultModel) — if set, use that. Economy mode does NOT override Layer 0.
2. CHECK Layer 1 (session directive for a specific model) — if set, use that. Economy mode does NOT override explicit session directives.
3. CHECK Layer 2 (charter preference) — if set, use that. Economy mode does NOT override charter preferences.
4. APPLY economy table at Layer 3 instead of normal table.
5. INCLUDE `💰` in spawn acknowledgment: `🔧 {Name} ({model} · 💰 economy) — {task}`
### On Deactivation
**Trigger phrases:** "turn off economy mode", "disable economy mode", "use normal models"
1. REMOVE `economyMode` from `.squad/config.json` (if it was persisted)
2. CLEAR session economy mode state
3. ACKNOWLEDGE: `✅ Economy mode disabled — returning to standard model selection.`
### STOP
After updating economy mode state and including the `💰` indicator in spawn acknowledgments, this skill is done. Do NOT:
- Change Layer 0, Layer 1, or Layer 2 model choices
- Override charter-specified models
- Generate cost reports or comparisons
- Fall back to premium models via economy mode (economy mode never bumps UP)
## Config Schema
`.squad/config.json` economy-related fields:
```json
{
"version":1,
"economyMode":true
}
```
-`economyMode` — when `true`, Layer 3 uses the economy table. Optional; absent = economy mode off.
- Combines with `defaultModel` and `agentModelOverrides` — Layer 0 always wins.
## Anti-Patterns
- **Don't override Layer 0 in economy mode.** If the user set `defaultModel: "claude-opus-4.6"`, they want quality. Economy mode only affects Layer 3 auto-selection.
- **Don't silently apply economy mode.** Always acknowledge when activated or deactivated.
- **Don't treat economy mode as permanent by default.** Session phrases activate session-only; only "always" or `config.json` persist it.
- **Don't bump premium tasks down too far.** Architecture and security reviews shift from opus to sonnet in economy mode — they do NOT go to fast/cheap models.
| Author disagrees with a decision or design | Empathetic Disagreement | T8 |
| Need more reproduction info or context | Information Request | T9 |
Use exactly one template as the base draft. Replace placeholders with issue-specific details, then apply the humanizer patterns. If the thread spans multiple signals, choose the highest-risk template and capture the nuance in the thread summary.
### Confidence Classification
| Confidence | Criteria | Example |
|-----------|----------|---------|
| 🟢 High | Answer exists in Squad docs or FAQ, similar question answered before, no technical ambiguity | "How do I install Squad?" |
| 🟡 Medium | Technical answer is sound but involves judgment calls, OR docs exist but don't perfectly match the question, OR tone is tricky | "Can Squad work with Azure DevOps?" (yes, but setup is nuanced) |
| 🔴 Needs Review | Technical uncertainty, policy/roadmap question, potential reputational risk, author is frustrated/angry, question about unreleased features | "When will Squad support Claude?" |
**Auto-escalation rules:**
- Any mention of competitors → 🔴
- Any mention of pricing/licensing → 🔴
- Author has >3 follow-up comments without resolution → 🔴
- Question references a closed-wontfix issue → 🔴
### 3. Draft
Use the humanizer skill for every draft.
- Complete **Thread-Read Verification** before writing.
- Read the **full thread**, including all comments, before writing.
- Select the matching template from the **Template Selection Guide** and record the template ID in the review notes.
- Treat templates as reusable drafting assets: keep the structure, replace placeholders, and only improvise when the thread truly requires it.
- Validate the draft against the humanizer anti-patterns.
- Flag long threads (`>10` comments) with `⚠️`.
### Thread-Read Verification
Before drafting, PAO MUST verify complete thread coverage:
1.**Count verification:** Compare API comment count with actually-read comments. If mismatch, abort draft.
2.**Deleted comment check:** Use `gh api` timeline to detect deleted comments. If found, flag as ⚠️ in review table.
3.**Thread summary:** Include in every draft: "Thread: {N} comments, last activity {date}, {summary of key points}"
4.**Long thread flag:** If >10 comments, add ⚠️ to review table and include condensed thread summary
5.**Evidence line in review table:** Each draft row includes "Read: {N}/{total} comments" column
| Author disagrees with a decision or design | Empathetic Disagreement | T8 |
| Need more reproduction info or context | Information Request | T9 |
Use exactly one template as the base draft. Replace placeholders with issue-specific details, then apply the humanizer patterns. If the thread spans multiple signals, choose the highest-risk template and capture the nuance in the thread summary.
### Confidence Classification
| Confidence | Criteria | Example |
|-----------|----------|---------|
| 🟢 High | Answer exists in Squad docs or FAQ, similar question answered before, no technical ambiguity | "How do I install Squad?" |
| 🟡 Medium | Technical answer is sound but involves judgment calls, OR docs exist but don't perfectly match the question, OR tone is tricky | "Can Squad work with Azure DevOps?" (yes, but setup is nuanced) |
| 🔴 Needs Review | Technical uncertainty, policy/roadmap question, potential reputational risk, author is frustrated/angry, question about unreleased features | "When will Squad support Claude?" |
**Auto-escalation rules:**
- Any mention of competitors → 🔴
- Any mention of pricing/licensing → 🔴
- Author has >3 follow-up comments without resolution → 🔴
- Question references a closed-wontfix issue → 🔴
### 3. Draft
Use the humanizer skill for every draft.
- Complete **Thread-Read Verification** before writing.
- Read the **full thread**, including all comments, before writing.
- Select the matching template from the **Template Selection Guide** and record the template ID in the review notes.
- Treat templates as reusable drafting assets: keep the structure, replace placeholders, and only improvise when the thread truly requires it.
- Validate the draft against the humanizer anti-patterns.
- Flag long threads (`>10` comments) with `⚠️`.
### Thread-Read Verification
Before drafting, PAO MUST verify complete thread coverage:
1.**Count verification:** Compare API comment count with actually-read comments. If mismatch, abort draft.
2.**Deleted comment check:** Use `gh api` timeline to detect deleted comments. If found, flag as ⚠️ in review table.
3.**Thread summary:** Include in every draft: "Thread: {N} comments, last activity {date}, {summary of key points}"
4.**Long thread flag:** If >10 comments, add ⚠️ to review table and include condensed thread summary
5.**Evidence line in review table:** Each draft row includes "Read: {N}/{total} comments" column
Many developers use GitHub through an Enterprise Managed User (EMU) account at work while maintaining a personal GitHub account for open-source contributions. AI agents spawned by Squad inherit the shell's default `gh` authentication — which is usually the EMU account. This causes failures when agents try to push to personal repos, create PRs on forks, or interact with resources outside the enterprise org.
This skill teaches agents how to detect the active identity, switch contexts safely, and avoid mixing credentials across operations.
## Patterns
### Detect Current Identity
Before any GitHub operation, check which account is active:
```bash
gh auth status
```
Look for:
-`Logged in to github.com as USERNAME` — the active account
-`Token scopes: ...` — what permissions are available
- Multiple accounts will show separate entries
### Extract a Specific Account's Token
When you need to operate as a specific user (not the default):
```bash
# Get the personal account token (by username)
gh auth token --user personaluser
# Get the EMU account token
gh auth token --user corpalias_enterprise
```
**Use case:** Push to a personal fork while the default `gh` auth is the EMU account.
### Push to Personal Repos from EMU Shell
The most common scenario: your shell defaults to the EMU account, but you need to push to a personal GitHub repo.
**Why this works:**`gh auth token --user` reads from `gh`'s credential store without switching the active account. The token is used inline for a single operation and never persisted.
### Create PRs on Personal Forks
When the default `gh` context is EMU but you need to create a PR from a personal fork:
```bash
# Option 1: Use --repo flag (works if token has access)
Many developers use GitHub through an Enterprise Managed User (EMU) account at work while maintaining a personal GitHub account for open-source contributions. AI agents spawned by Squad inherit the shell's default `gh` authentication — which is usually the EMU account. This causes failures when agents try to push to personal repos, create PRs on forks, or interact with resources outside the enterprise org.
This skill teaches agents how to detect the active identity, switch contexts safely, and avoid mixing credentials across operations.
## Patterns
### Detect Current Identity
Before any GitHub operation, check which account is active:
```bash
gh auth status
```
Look for:
-`Logged in to github.com as USERNAME` — the active account
-`Token scopes: ...` — what permissions are available
- Multiple accounts will show separate entries
### Extract a Specific Account's Token
When you need to operate as a specific user (not the default):
```bash
# Get the personal account token (by username)
gh auth token --user personaluser
# Get the EMU account token
gh auth token --user corpalias_enterprise
```
**Use case:** Push to a personal fork while the default `gh` auth is the EMU account.
### Push to Personal Repos from EMU Shell
The most common scenario: your shell defaults to the EMU account, but you need to push to a personal GitHub repo.
**Why this works:**`gh auth token --user` reads from `gh`'s credential store without switching the active account. The token is used inline for a single operation and never persisted.
### Create PRs on Personal Forks
When the default `gh` context is EMU but you need to create a PR from a personal fork:
```bash
# Option 1: Use --repo flag (works if token has access)
When the coordinator routes multiple issues simultaneously (e.g., "fix bugs X, Y, and Z"), use `git worktree` to give each agent an isolated working directory. No filesystem collisions, no branch-switching overhead.
### When to Use Worktrees vs Sequential
| Scenario | Strategy |
|----------|----------|
| Single issue | Standard workflow above — no worktree needed |
| 2+ simultaneous issues in same repo | Worktrees — one per issue |
| Work spanning multiple repos | Separate clones as siblings (see Multi-Repo below) |
### Setup
From the main clone (must be on dev or any branch):
```bash
# Ensure dev is current
git fetch origin dev
# Create a worktree per issue — siblings to the main clone
All PRs target `dev` independently. Agents never interfere with each other's filesystem.
### .squad/ State in Worktrees
The `.squad/` directory exists in each worktree as a copy. This is safe because:
- `.gitattributes` declares `merge=union` on append-only files (history.md, decisions.md, logs)
- Each agent appends to its own section; union merge reconciles on PR merge to dev
- **Rule:** Never rewrite or reorder `.squad/` files in a worktree — append only
### Cleanup After Merge
After a worktree's PR is merged to dev:
```bash
# From the main clone
git worktree remove ../squad-195
git worktree prune # clean stale metadata
git branch -d squad/195-fix-stamp-bug
git push origin --delete squad/195-fix-stamp-bug
```
If a worktree was deleted manually (rm -rf), `git worktree prune` recovers the state.
---
## Multi-Repo Downstream Scenarios
When work spans multiple repositories (e.g., squad-cli changes need squad-sdk changes, or a user's app depends on squad):
### Setup
Clone downstream repos as siblings to the main repo:
```
~/work/
squad-pr/ # main repo
squad-sdk/ # downstream dependency
user-app/ # consumer project
```
Each repo gets its own issue branch following its own naming convention. If the downstream repo also uses Squad conventions, use `squad/{issue-number}-{slug}`.
### Coordinated PRs
- Create PRs in each repo independently
- Link them in PR descriptions:
```
Closes #42
**Depends on:** squad-sdk PR #17 (squad-sdk changes required for this feature)
```
- Merge order: dependencies first (e.g., squad-sdk), then dependents (e.g., squad-cli)
### Local Linking for Testing
Before pushing, verify cross-repo changes work together:
When the coordinator routes multiple issues simultaneously (e.g., "fix bugs X, Y, and Z"), use `git worktree` to give each agent an isolated working directory. No filesystem collisions, no branch-switching overhead.
### When to Use Worktrees vs Sequential
| Scenario | Strategy |
|----------|----------|
| Single issue | Standard workflow above — no worktree needed |
| 2+ simultaneous issues in same repo | Worktrees — one per issue |
| Work spanning multiple repos | Separate clones as siblings (see Multi-Repo below) |
### Setup
From the main clone (must be on dev or any branch):
```bash
# Ensure dev is current
git fetch origin dev
# Create a worktree per issue — siblings to the main clone
All PRs target `dev` independently. Agents never interfere with each other's filesystem.
### .squad/ State in Worktrees
The `.squad/` directory exists in each worktree as a copy. This is safe because:
- `.gitattributes` declares `merge=union` on append-only files (history.md, decisions.md, logs)
- Each agent appends to its own section; union merge reconciles on PR merge to dev
- **Rule:** Never rewrite or reorder `.squad/` files in a worktree — append only
### Cleanup After Merge
After a worktree's PR is merged to dev:
```bash
# From the main clone
git worktree remove ../squad-195
git worktree prune # clean stale metadata
git branch -d squad/195-fix-stamp-bug
git push origin --delete squad/195-fix-stamp-bug
```
If a worktree was deleted manually (rm -rf), `git worktree prune` recovers the state.
---
## Multi-Repo Downstream Scenarios
When work spans multiple repositories (e.g., squad-cli changes need squad-sdk changes, or a user's app depends on squad):
### Setup
Clone downstream repos as siblings to the main repo:
```
~/work/
squad-pr/ # main repo
squad-sdk/ # downstream dependency
user-app/ # consumer project
```
Each repo gets its own issue branch following its own naming convention. If the downstream repo also uses Squad conventions, use `squad/{issue-number}-{slug}`.
### Coordinated PRs
- Create PRs in each repo independently
- Link them in PR descriptions:
```
Closes #42
**Depends on:** squad-sdk PR #17 (squad-sdk changes required for this feature)
```
- Merge order: dependencies first (e.g., squad-sdk), then dependents (e.g., squad-cli)
### Local Linking for Testing
Before pushing, verify cross-repo changes work together:
description: Detect and set up account-locked gh aliases for multi-account GitHub. The AI reads this skill, detects accounts, asks the user which is personal/work, and runs the setup automatically.
description: Detect and set up account-locked gh aliases for multi-account GitHub. The AI reads this skill, detects accounts, asks the user which is personal/work, and runs the setup automatically.
description: Record final outcomes to history.md, not intermediate requests or reversed decisions
domain: documentation, team-collaboration
confidence: high
source: earned (Kobayashi v0.6.0 incident, team intervention)
---
## Context
History files (.md files tracking decisions, spawns, outcomes) are read cold by future agents. Stale or incorrect entries poison decision-making downstream. The Kobayashi incident proved this: history said "Brady decided v0.6.0" when Brady had reversed that to v0.8.17. Future spawns read the wrong truth and repeated the mistake.
## Patterns
- **Record the final outcome**, not the initial request.
- **Wait for confirmation** before writing to history — don't log intermediate states.
- **If a decision reverses**, update the entry immediately — don't leave stale data.
- **One read = one truth.** A future agent should never need to cross-reference other files to understand what actually happened.
## Examples
✓ **Correct:**
- "Migration target: v0.8.17 (initially discussed as v0.6.0, corrected by Brady)"
- "Reverted to Node 18 per Brady's explicit request on 2024-01-15"
✗ **Incorrect:**
- "Brady directed v0.6.0" (when later reversed)
- Recording what was *requested* instead of what *actually happened*
- Logging entries before outcome is confirmed
## Anti-Patterns
- Writing intermediate or "for now" states to disk
- Attributing decisions without confirming final direction
- Treating history like a draft — history is the source of truth
- Assuming readers will cross-reference or verify; they won't
---
name: history-hygiene
description: Record final outcomes to history.md, not intermediate requests or reversed decisions
domain: documentation, team-collaboration
confidence: high
source: earned (Kobayashi v0.6.0 incident, team intervention)
---
## Context
History files (.md files tracking decisions, spawns, outcomes) are read cold by future agents. Stale or incorrect entries poison decision-making downstream. The Kobayashi incident proved this: history said "Brady decided v0.6.0" when Brady had reversed that to v0.8.17. Future spawns read the wrong truth and repeated the mistake.
## Patterns
- **Record the final outcome**, not the initial request.
- **Wait for confirmation** before writing to history — don't log intermediate states.
- **If a decision reverses**, update the entry immediately — don't leave stale data.
- **One read = one truth.** A future agent should never need to cross-reference other files to understand what actually happened.
## Examples
✓ **Correct:**
- "Migration target: v0.8.17 (initially discussed as v0.6.0, corrected by Brady)"
- "Reverted to Node 18 per Brady's explicit request on 2024-01-15"
✗ **Incorrect:**
- "Brady directed v0.6.0" (when later reversed)
- Recording what was *requested* instead of what *actually happened*
- Logging entries before outcome is confirmed
## Anti-Patterns
- Writing intermediate or "for now" states to disk
- Attributing decisions without confirming final direction
- Treating history like a draft — history is the source of truth
- Assuming readers will cross-reference or verify; they won't
description: "Tone enforcement patterns for external-facing community responses"
domain: "communication, tone, community"
confidence: "low"
source: "manual (RFC #426 — PAO External Communications)"
---
## Context
Use this skill whenever PAO drafts external-facing responses for issues or discussions.
- Tone must be warm, helpful, and human-sounding — never robotic or corporate.
- Brady's constraint applies everywhere: **Humanized tone is mandatory**.
- This applies to **all external-facing content** drafted by PAO in Phase 1 issues/discussions workflows.
## Patterns
1.**Warm opening** — Start with acknowledgment ("Thanks for reporting this", "Great question!")
2.**Active voice** — "We're looking into this" not "This is being investigated"
3.**Second person** — Address the person directly ("you" not "the user")
4.**Conversational connectors** — "That said...", "Here's what we found...", "Quick note:"
5.**Specific, not vague** — "This affects the casting module in v0.8.x" not "We are aware of issues"
6.**Empathy markers** — "I can see how that would be frustrating", "Good catch!"
7.**Action-oriented closes** — "Let us know if that helps!" not "Please advise if further assistance is required"
8.**Uncertainty is OK** — "We're not 100% sure yet, but here's what we think is happening..." is better than false confidence
9.**Profanity filter** — Never include profanity, slurs, or aggressive language, even when quoting
10.**Baseline comparison** — Responses should align with tone of 5-10 "gold standard" responses (>80% similarity threshold)
11.**Empathetic disagreement** — "We hear you. That's a fair concern." before explaining the reasoning
12.**Information request** — Ask for specific details, not open-ended "can you provide more info?"
13.**No link-dumping** — Don't just paste URLs. Provide context: "Check out the [getting started guide](url) — specifically the section on routing" not just a bare link
## Examples
### 1. Welcome
```text
Hey {author}! Welcome to Squad 👋 Thanks for opening this.
{substantive response}
Let us know if you have questions — happy to help!
```
### 2. Troubleshooting
```text
Thanks for the detailed report, {author}!
Here's what we think is happening: {explanation}
{steps or workaround}
Let us know if that helps, or if you're seeing something different.
```
### 3. Feature guidance
```text
Great question! {context on current state}
{guidance or workaround}
We've noted this as a potential improvement — {tracking info if applicable}.
```
### 4. Redirect
```text
Thanks for reaching out! This one is actually better suited for {correct location}.
{brief explanation of why}
Feel free to open it there — they'll be able to help!
```
### 5. Acknowledgment
```text
Good catch, {author}. We've confirmed this is a real issue.
{what we know so far}
We'll update this thread when we have a fix. Thanks for flagging it!
```
### 6. Closing
```text
This should be resolved in {version/PR}! 🎉
{brief summary of what changed}
Thanks for reporting this, {author} — it made Squad better.
```
### 7. Technical uncertainty
```text
Interesting find, {author}. We're not 100% sure what's causing this yet.
Here's what we've ruled out: {list}
We'd love more context if you have it — {specific ask}.
We'll dig deeper and update this thread.
```
## Anti-Patterns
- ❌ Corporate speak: "We appreciate your patience as we investigate this matter"
- ❌ Marketing hype: "Squad is the BEST way to..." or "This amazing feature..."
- ❌ Passive voice: "It has been determined that..." or "The issue is being tracked"
- ❌ Dismissive: "This works as designed" without empathy
- ❌ Over-promising: "We'll ship this next week" without commitment from the team
- ❌ Empty acknowledgment: "Thanks for your feedback" with no substance
- ❌ Robot signatures: "Best regards, PAO" or "Sincerely, The Squad Team"
- ❌ Excessive emoji: More than 1-2 emoji per response
- ❌ Quoting profanity: Even when the original issue contains it, paraphrase instead
- ❌ Link-dumping: Pasting URLs without context ("See: https://...")
- ❌ Open-ended info requests: "Can you provide more information?" without specifying what information
---
name: "humanizer"
description: "Tone enforcement patterns for external-facing community responses"
domain: "communication, tone, community"
confidence: "low"
source: "manual (RFC #426 — PAO External Communications)"
---
## Context
Use this skill whenever PAO drafts external-facing responses for issues or discussions.
- Tone must be warm, helpful, and human-sounding — never robotic or corporate.
- Brady's constraint applies everywhere: **Humanized tone is mandatory**.
- This applies to **all external-facing content** drafted by PAO in Phase 1 issues/discussions workflows.
## Patterns
1.**Warm opening** — Start with acknowledgment ("Thanks for reporting this", "Great question!")
2.**Active voice** — "We're looking into this" not "This is being investigated"
3.**Second person** — Address the person directly ("you" not "the user")
4.**Conversational connectors** — "That said...", "Here's what we found...", "Quick note:"
5.**Specific, not vague** — "This affects the casting module in v0.8.x" not "We are aware of issues"
6.**Empathy markers** — "I can see how that would be frustrating", "Good catch!"
7.**Action-oriented closes** — "Let us know if that helps!" not "Please advise if further assistance is required"
8.**Uncertainty is OK** — "We're not 100% sure yet, but here's what we think is happening..." is better than false confidence
9.**Profanity filter** — Never include profanity, slurs, or aggressive language, even when quoting
10.**Baseline comparison** — Responses should align with tone of 5-10 "gold standard" responses (>80% similarity threshold)
11.**Empathetic disagreement** — "We hear you. That's a fair concern." before explaining the reasoning
12.**Information request** — Ask for specific details, not open-ended "can you provide more info?"
13.**No link-dumping** — Don't just paste URLs. Provide context: "Check out the [getting started guide](url) — specifically the section on routing" not just a bare link
## Examples
### 1. Welcome
```text
Hey {author}! Welcome to Squad 👋 Thanks for opening this.
{substantive response}
Let us know if you have questions — happy to help!
```
### 2. Troubleshooting
```text
Thanks for the detailed report, {author}!
Here's what we think is happening: {explanation}
{steps or workaround}
Let us know if that helps, or if you're seeing something different.
```
### 3. Feature guidance
```text
Great question! {context on current state}
{guidance or workaround}
We've noted this as a potential improvement — {tracking info if applicable}.
```
### 4. Redirect
```text
Thanks for reaching out! This one is actually better suited for {correct location}.
{brief explanation of why}
Feel free to open it there — they'll be able to help!
```
### 5. Acknowledgment
```text
Good catch, {author}. We've confirmed this is a real issue.
{what we know so far}
We'll update this thread when we have a fix. Thanks for flagging it!
```
### 6. Closing
```text
This should be resolved in {version/PR}! 🎉
{brief summary of what changed}
Thanks for reporting this, {author} — it made Squad better.
```
### 7. Technical uncertainty
```text
Interesting find, {author}. We're not 100% sure what's causing this yet.
Here's what we've ruled out: {list}
We'd love more context if you have it — {specific ask}.
We'll dig deeper and update this thread.
```
## Anti-Patterns
- ❌ Corporate speak: "We appreciate your patience as we investigate this matter"
- ❌ Marketing hype: "Squad is the BEST way to..." or "This amazing feature..."
- ❌ Passive voice: "It has been determined that..." or "The issue is being tracked"
- ❌ Dismissive: "This works as designed" without empathy
- ❌ Over-promising: "We'll ship this next week" without commitment from the team
- ❌ Empty acknowledgment: "Thanks for your feedback" with no substance
- ❌ Robot signatures: "Best regards, PAO" or "Sincerely, The Squad Team"
- ❌ Excessive emoji: More than 1-2 emoji per response
- ❌ Quoting profanity: Even when the original issue contains it, paraphrase instead
- ❌ Link-dumping: Pasting URLs without context ("See: https://...")
- ❌ Open-ended info requests: "Can you provide more information?" without specifying what information
description: "Confirm team roster with selectable menu"
when: "Phase 1 proposal — requires explicit user confirmation"
---
## Context
Init Mode activates when `.squad/team.md` does not exist, or exists but has zero roster entries under `## Members`. The coordinator proposes a team (Phase 1), waits for user confirmation, then creates the team structure (Phase 2).
## Patterns
### Phase 1: Propose the Team
No team exists yet. Propose one — but **DO NOT create any files until the user confirms.**
1.**Identify the user.** Run `git config user.name` to learn who you're working with. Use their name in conversation (e.g., *"Hey Brady, what are you building?"*). Store their name (NOT email) in `team.md` under Project Context. **Never read or store `git config user.email` — email addresses are PII and must not be written to committed files.**
2. Ask: *"What are you building? (language, stack, what it does)"*
3.**Cast the team.** Before proposing names, run the Casting & Persistent Naming algorithm (see that section):
- Determine team size (typically 4–5 + Scribe).
- Determine assignment shape from the user's project description.
- Derive resonance signals from the session and repo context.
- Select a universe. If the universe is custom, allocate character names from that universe based on the related list found in the `.squad/templates/casting/` directory. Prefer custom universes when available.
- Scribe is always "Scribe" — exempt from casting.
- Ralph is always "Ralph" — exempt from casting.
4. Propose the team with their cast names. Example (names will vary per cast):
```
🏗️ {CastName1} — Lead Scope, decisions, code review
⚛️ {CastName2} — Frontend Dev React, UI, components
🔧 {CastName3} — Backend Dev APIs, database, services
🔄 Ralph — (monitor) Work queue, backlog, keep-alive
```
5. Use the `ask_user` tool to confirm the roster. Provide choices so the user sees a selectable menu:
- **question:** *"Look right?"*
- **choices:** `["Yes, hire this team", "Add someone", "Change a role"]`
**⚠️ STOP. Your response ENDS here. Do NOT proceed to Phase 2. Do NOT create any files or directories. Wait for the user's reply.**
### Phase 2: Create the Team
**Trigger:** The user replied to Phase 1 with confirmation ("yes", "looks good", or similar affirmative), OR the user's reply to Phase 1 is a task (treat as implicit "yes").
> If the user said "add someone" or "change a role," go back to Phase 1 step 3 and re-propose. Do NOT enter Phase 2 until the user confirms.
6. Create the `.squad/` directory structure (see `.squad/templates/` for format guides or use the standard structure: team.md, routing.md, ceremonies.md, decisions.md, decisions/inbox/, casting/, agents/, orchestration-log/, skills/, log/).
**Casting state initialization:** Copy `.squad/templates/casting-policy.json` to `.squad/casting/policy.json` (or create from defaults). Create `registry.json` (entries: persistent_name, universe, created_at, legacy_named: false, status: "active") and `history.json` (first assignment snapshot with unique assignment_id).
**Seeding:** Each agent's `history.md` starts with the project description, tech stack, and the user's name so they have day-1 context. Agent folder names are the cast name in lowercase (e.g., `.squad/agents/ripley/`). The Scribe's charter includes maintaining `decisions.md` and cross-agent context sharing.
**Team.md structure:**`team.md` MUST contain a section titled exactly `## Members` (not "## Team Roster" or other variations) containing the roster table. This header is hard-coded in GitHub workflows (`squad-heartbeat.yml`, `squad-issue-assign.yml`, `squad-triage.yml`, `sync-squad-labels.yml`) for label automation. If the header is missing or titled differently, label routing breaks.
**Merge driver for append-only files:** Create or update `.gitattributes` at the repo root to enable conflict-free merging of `.squad/` state across branches:
```
.squad/decisions.md merge=union
.squad/agents/*/history.md merge=union
.squad/log/** merge=union
.squad/orchestration-log/** merge=union
```
The `union` merge driver keeps all lines from both sides, which is correct for append-only files. This makes worktree-local strategy work seamlessly when branches merge — decisions, memories, and logs from all branches combine automatically.
7. Say: *"✅ Team hired. Try: '{FirstCastName}, set up the project structure'"*
8.**Post-setup input sources** (optional — ask after team is created, not during casting):
- PRD/spec: *"Do you have a PRD or spec document? (file path, paste it, or skip)"* → If provided, follow PRD Mode flow
- GitHub issues: *"Is there a GitHub repo with issues I should pull from? (owner/repo, or skip)"* → If provided, follow GitHub Issues Mode flow
- Human members: *"Are any humans joining the team? (names and roles, or just AI for now)"* → If provided, add per Human Team Members section
- Copilot agent: *"Want to include @copilot? It can pick up issues autonomously. (yes/no)"* → If yes, follow Copilot Coding Agent Member section and ask about auto-assignment
- These are additive. Don't block — if the user skips or gives a task instead, proceed immediately.
## Examples
**Example flow:**
1. Coordinator detects no team.md → Init Mode
2. Runs `git config user.name` → "Brady"
3. Asks: *"Hey Brady, what are you building?"*
4. User: *"TypeScript CLI tool with GitHub API integration"*
description: "Confirm team roster with selectable menu"
when: "Phase 1 proposal — requires explicit user confirmation"
---
## Context
Init Mode activates when `.squad/team.md` does not exist, or exists but has zero roster entries under `## Members`. The coordinator proposes a team (Phase 1), waits for user confirmation, then creates the team structure (Phase 2).
## Patterns
### Phase 1: Propose the Team
No team exists yet. Propose one — but **DO NOT create any files until the user confirms.**
1.**Identify the user.** Run `git config user.name` to learn who you're working with. Use their name in conversation (e.g., *"Hey Brady, what are you building?"*). Store their name (NOT email) in `team.md` under Project Context. **Never read or store `git config user.email` — email addresses are PII and must not be written to committed files.**
2. Ask: *"What are you building? (language, stack, what it does)"*
3.**Cast the team.** Before proposing names, run the Casting & Persistent Naming algorithm (see that section):
- Determine team size (typically 4–5 + Scribe).
- Determine assignment shape from the user's project description.
- Derive resonance signals from the session and repo context.
- Select a universe. If the universe is custom, allocate character names from that universe based on the related list found in the `.squad/templates/casting/` directory. Prefer custom universes when available.
- Scribe is always "Scribe" — exempt from casting.
- Ralph is always "Ralph" — exempt from casting.
4. Propose the team with their cast names. Example (names will vary per cast):
```
🏗️ {CastName1} — Lead Scope, decisions, code review
⚛️ {CastName2} — Frontend Dev React, UI, components
🔧 {CastName3} — Backend Dev APIs, database, services
🔄 Ralph — (monitor) Work queue, backlog, keep-alive
```
5. Use the `ask_user` tool to confirm the roster. Provide choices so the user sees a selectable menu:
- **question:** *"Look right?"*
- **choices:** `["Yes, hire this team", "Add someone", "Change a role"]`
**⚠️ STOP. Your response ENDS here. Do NOT proceed to Phase 2. Do NOT create any files or directories. Wait for the user's reply.**
### Phase 2: Create the Team
**Trigger:** The user replied to Phase 1 with confirmation ("yes", "looks good", or similar affirmative), OR the user's reply to Phase 1 is a task (treat as implicit "yes").
> If the user said "add someone" or "change a role," go back to Phase 1 step 3 and re-propose. Do NOT enter Phase 2 until the user confirms.
6. Create the `.squad/` directory structure (see `.squad/templates/` for format guides or use the standard structure: team.md, routing.md, ceremonies.md, decisions.md, decisions/inbox/, casting/, agents/, orchestration-log/, skills/, log/).
**Casting state initialization:** Copy `.squad/templates/casting-policy.json` to `.squad/casting/policy.json` (or create from defaults). Create `registry.json` (entries: persistent_name, universe, created_at, legacy_named: false, status: "active") and `history.json` (first assignment snapshot with unique assignment_id).
**Seeding:** Each agent's `history.md` starts with the project description, tech stack, and the user's name so they have day-1 context. Agent folder names are the cast name in lowercase (e.g., `.squad/agents/ripley/`). The Scribe's charter includes maintaining `decisions.md` and cross-agent context sharing.
**Team.md structure:**`team.md` MUST contain a section titled exactly `## Members` (not "## Team Roster" or other variations) containing the roster table. This header is hard-coded in GitHub workflows (`squad-heartbeat.yml`, `squad-issue-assign.yml`, `squad-triage.yml`, `sync-squad-labels.yml`) for label automation. If the header is missing or titled differently, label routing breaks.
**Merge driver for append-only files:** Create or update `.gitattributes` at the repo root to enable conflict-free merging of `.squad/` state across branches:
```
.squad/decisions.md merge=union
.squad/agents/*/history.md merge=union
.squad/log/** merge=union
.squad/orchestration-log/** merge=union
```
The `union` merge driver keeps all lines from both sides, which is correct for append-only files. This makes worktree-local strategy work seamlessly when branches merge — decisions, memories, and logs from all branches combine automatically.
7. Say: *"✅ Team hired. Try: '{FirstCastName}, set up the project structure'"*
8.**Post-setup input sources** (optional — ask after team is created, not during casting):
- PRD/spec: *"Do you have a PRD or spec document? (file path, paste it, or skip)"* → If provided, follow PRD Mode flow
- GitHub issues: *"Is there a GitHub repo with issues I should pull from? (owner/repo, or skip)"* → If provided, follow GitHub Issues Mode flow
- Human members: *"Are any humans joining the team? (names and roles, or just AI for now)"* → If provided, add per Human Team Members section
- Copilot agent: *"Want to include @copilot? It can pick up issues autonomously. (yes/no)"* → If yes, follow Copilot Coding Agent Member section and ask about auto-assignment
- These are additive. Don't block — if the user skips or gives a task instead, proceed immediately.
## Examples
**Example flow:**
1. Coordinator detects no team.md → Init Mode
2. Runs `git config user.name` → "Brady"
3. Asks: *"Hey Brady, what are you building?"*
4. User: *"TypeScript CLI tool with GitHub API integration"*
> Determines which LLM model to use for each agent spawn.
## SCOPE
✅ THIS SKILL PRODUCES:
- A resolved `model` parameter for every `task` tool call
- Persistent model preferences in `.squad/config.json`
- Spawn acknowledgments that include the resolved model
❌ THIS SKILL DOES NOT PRODUCE:
- Code, tests, or documentation
- Model performance benchmarks
- Cost reports or billing artifacts
## Context
Squad supports 18+ models across three tiers (premium, standard, fast). The coordinator must select the right model for each agent spawn. Users can set persistent preferences that survive across sessions.
## 5-Layer Model Resolution Hierarchy
Resolution is **first-match-wins** — the highest layer with a value wins.
**Key principle:** Layer 0 (persistent config) beats everything. If the user said "always use opus" and it was saved to config.json, every agent gets opus regardless of role or task type. This is intentional — the user explicitly chose quality over cost.
## AGENT WORKFLOW
### On Session Start
1. READ `.squad/config.json`
2. CHECK for `defaultModel` field — if present, this is the Layer 0 override for all spawns
3. CHECK for `agentModelOverrides` field — if present, these are per-agent Layer 0a overrides
4. STORE both values in session context for the duration
### On Every Agent Spawn
1. CHECK Layer 0a: Is there an `agentModelOverrides.{agentName}` in config.json? → Use it.
2. CHECK Layer 0b: Is there a `defaultModel` in config.json? → Use it.
3. CHECK Layer 1: Did the user give a session directive? → Use it.
4. CHECK Layer 2: Does the agent's charter have a `## Model` section? → Use it.
**Never fall UP in tier.** A fast task won't land on a premium model via fallback.
# Model Selection
> Determines which LLM model to use for each agent spawn.
## SCOPE
✅ THIS SKILL PRODUCES:
- A resolved `model` parameter for every `task` tool call
- Persistent model preferences in `.squad/config.json`
- Spawn acknowledgments that include the resolved model
❌ THIS SKILL DOES NOT PRODUCE:
- Code, tests, or documentation
- Model performance benchmarks
- Cost reports or billing artifacts
## Context
Squad supports 18+ models across three tiers (premium, standard, fast). The coordinator must select the right model for each agent spawn. Users can set persistent preferences that survive across sessions.
## 5-Layer Model Resolution Hierarchy
Resolution is **first-match-wins** — the highest layer with a value wins.
**Key principle:** Layer 0 (persistent config) beats everything. If the user said "always use opus" and it was saved to config.json, every agent gets opus regardless of role or task type. This is intentional — the user explicitly chose quality over cost.
## AGENT WORKFLOW
### On Session Start
1. READ `.squad/config.json`
2. CHECK for `defaultModel` field — if present, this is the Layer 0 override for all spawns
3. CHECK for `agentModelOverrides` field — if present, these are per-agent Layer 0a overrides
4. STORE both values in session context for the duration
### On Every Agent Spawn
1. CHECK Layer 0a: Is there an `agentModelOverrides.{agentName}` in config.json? → Use it.
2. CHECK Layer 0b: Is there a `defaultModel` in config.json? → Use it.
3. CHECK Layer 1: Did the user give a session directive? → Use it.
4. CHECK Layer 2: Does the agent's charter have a `## Model` section? → Use it.
A personal squad is a user-level collection of AI agents that travel with you across projects. Unlike project agents (defined in a project's `.squad/` directory), personal agents live in your global config directory and are automatically discovered when you start a squad session.
## Directory Structure
```
~/.config/squad/personal-squad/ # Linux/macOS
%APPDATA%/squad/personal-squad/ # Windows
├── agents/
│ ├── {agent-name}/
│ │ ├── charter.md
│ │ └── history.md
│ └── ...
└── config.json # Optional: personal squad config
```
## How It Works
1.**Ambient Discovery:** When Squad starts a session, it checks for a personal squad directory
2.**Merge:** Personal agents are merged into the session cast alongside project agents
3.**Ghost Protocol:** Personal agents can read project state but not write to it
4.**Kill Switch:** Set `SQUAD_NO_PERSONAL=1` to disable ambient discovery
## Commands
-`squad personal init` — Bootstrap a personal squad directory
-`squad personal list` — List your personal agents
-`squad personal add {name} --role {role}` — Add a personal agent
-`squad personal remove {name}` — Remove a personal agent
-`squad cast` — Show the current session cast (project + personal)
## Ghost Protocol
See `templates/ghost-protocol.md` for the full rules. Key points:
- Personal agents advise; project agents execute
- No writes to project `.squad/` state
- Transparent origin tagging in logs
- Project agents take precedence on conflicts
## Configuration
Optional `config.json` in the personal squad directory:
```json
{
"defaultModel":"auto",
"ghostProtocol":true,
"agents":{}
}
```
## Environment Variables
-`SQUAD_NO_PERSONAL` — Set to any value to disable personal squad discovery
-`SQUAD_PERSONAL_DIR` — Override the default personal squad directory path
# Personal Squad — Skill Document
## What is a Personal Squad?
A personal squad is a user-level collection of AI agents that travel with you across projects. Unlike project agents (defined in a project's `.squad/` directory), personal agents live in your global config directory and are automatically discovered when you start a squad session.
## Directory Structure
```
~/.config/squad/personal-squad/ # Linux/macOS
%APPDATA%/squad/personal-squad/ # Windows
├── agents/
│ ├── {agent-name}/
│ │ ├── charter.md
│ │ └── history.md
│ └── ...
└── config.json # Optional: personal squad config
```
## How It Works
1.**Ambient Discovery:** When Squad starts a session, it checks for a personal squad directory
2.**Merge:** Personal agents are merged into the session cast alongside project agents
3.**Ghost Protocol:** Personal agents can read project state but not write to it
4.**Kill Switch:** Set `SQUAD_NO_PERSONAL=1` to disable ambient discovery
## Commands
-`squad personal init` — Bootstrap a personal squad directory
-`squad personal list` — List your personal agents
-`squad personal add {name} --role {role}` — Add a personal agent
-`squad personal remove {name}` — Remove a personal agent
-`squad cast` — Show the current session cast (project + personal)
## Ghost Protocol
See `templates/ghost-protocol.md` for the full rules. Key points:
- Personal agents advise; project agents execute
- No writes to project `.squad/` state
- Transparent origin tagging in logs
- Project agents take precedence on conflicts
## Configuration
Optional `config.json` in the personal squad directory:
```json
{
"defaultModel":"auto",
"ghostProtocol":true,
"agents":{}
}
```
## Environment Variables
-`SQUAD_NO_PERSONAL` — Set to any value to disable personal squad discovery
-`SQUAD_PERSONAL_DIR` — Override the default personal squad directory path
description: "Core conventions and patterns for this codebase"
domain: "project-conventions"
confidence: "medium"
source: "template"
---
## Context
> **This is a starter template.** Replace the placeholder patterns below with your actual project conventions. Skills train agents on codebase-specific practices — accurate documentation here improves agent output quality.
## Patterns
### [Pattern Name]
Describe a key convention or practice used in this codebase. Be specific about what to do and why.
### Error Handling
<!-- Example: How does your project handle errors? -->
<!-- - Use try/catch with specific error types? -->
<!-- - Log to a specific service? -->
<!-- - Return error objects vs throwing? -->
### Testing
<!-- Example: What test framework? Where do tests live? How to run them? -->
<!-- - Test framework: Jest/Vitest/node:test/etc. -->
<!-- - Test location: test/, __tests__/, *.test.ts, etc. -->
// Add code examples that demonstrate your conventions
```
## Anti-Patterns
<!-- List things to avoid in this codebase -->
- **[Anti-pattern]** — Explanation of what not to do and why.
---
name: "project-conventions"
description: "Core conventions and patterns for this codebase"
domain: "project-conventions"
confidence: "medium"
source: "template"
---
## Context
> **This is a starter template.** Replace the placeholder patterns below with your actual project conventions. Skills train agents on codebase-specific practices — accurate documentation here improves agent output quality.
## Patterns
### [Pattern Name]
Describe a key convention or practice used in this codebase. Be specific about what to do and why.
### Error Handling
<!-- Example: How does your project handle errors? -->
<!-- - Use try/catch with specific error types? -->
<!-- - Log to a specific service? -->
<!-- - Return error objects vs throwing? -->
### Testing
<!-- Example: What test framework? Where do tests live? How to run them? -->
<!-- - Test framework: Jest/Vitest/node:test/etc. -->
<!-- - Test location: test/, __tests__/, *.test.ts, etc. -->
description: "Step-by-step release checklist for Squad — prevents v0.8.22-style disasters"
domain: "release-management"
confidence: "high"
source: "team-decision"
---
## Context
This is the **definitive release runbook** for Squad. Born from the v0.8.22 release disaster (4-part semver mangled by npm, draft release never triggered publish, wrong NPM_TOKEN type, 6+ hours of broken `latest` dist-tag).
**Rule:** No agent releases Squad without following this checklist. No exceptions. No improvisation.
---
## Pre-Release Validation
Before starting ANY release work, validate the following:
### 1. Version Number Validation
**Rule:** Only 3-part semver (major.minor.patch) or prerelease (major.minor.patch-tag.N) are valid. 4-part versions (0.8.21.4) are NOT valid semver and npm will mangle them.
- [ ] dev branch has next preview version: `git show dev:package.json | grep version` shows next preview
---
## Post-Mortem Reference
This skill was created after the v0.8.22 release disaster. Full retrospective: `.squad/decisions/inbox/keaton-v0822-retrospective.md`
**Key learnings:**
1. No release without a runbook = improvisation = disaster
2. Semver validation is mandatory — 4-part versions break npm
3. NPM_TOKEN type matters — User tokens with 2FA fail in CI
4. Draft releases are a footgun — they don't trigger automation
5. Retry logic is essential — npm propagation takes time
**Never again.**
---
name: "release-process"
description: "Step-by-step release checklist for Squad — prevents v0.8.22-style disasters"
domain: "release-management"
confidence: "high"
source: "team-decision"
---
## Context
This is the **definitive release runbook** for Squad. Born from the v0.8.22 release disaster (4-part semver mangled by npm, draft release never triggered publish, wrong NPM_TOKEN type, 6+ hours of broken `latest` dist-tag).
**Rule:** No agent releases Squad without following this checklist. No exceptions. No improvisation.
---
## Pre-Release Validation
Before starting ANY release work, validate the following:
### 1. Version Number Validation
**Rule:** Only 3-part semver (major.minor.patch) or prerelease (major.minor.patch-tag.N) are valid. 4-part versions (0.8.21.4) are NOT valid semver and npm will mangle them.
description: "Team-wide charter and history optimization through skill extraction"
domain: "team-optimization"
confidence: "high"
source: "manual — Brady directive to reduce per-agent context overhead"
---
## Context
When the coordinator hears "team, reskill" (or similar: "optimize context", "slim down charters"), trigger a team-wide optimization pass. The goal: reduce per-agent context consumption by extracting shared patterns from charters and histories into reusable skills.
This is a periodic maintenance activity. Run whenever charter/history bloat is suspected.
## Process
### Step 1: Audit
Read all agent charters and histories. Measure byte sizes. Identify:
- **Boilerplate** — sections repeated across ≥3 charters with <10% variation (collaboration, model, boundaries template)
### Minimal Charter Template (target format after reskill)
```
# {Name} — {Role}
> {Tagline — one sentence capturing voice and philosophy}
## Identity
- **Name:** {Name}
- **Role:** {Role}
- **Expertise:** {comma-separated list}
## What I Own
- {bullet list of owned artifacts/domains}
## How I Work
- {unique patterns and principles — NOT boilerplate}
## Boundaries
**I handle:** {domain list}
**I don't handle:** {explicit exclusions}
## Model
Preferred: {model}
```
### Skill Extraction Threshold
- **1 charter** → leave in charter (unique to that agent)
- **2 charters** → consider extracting if >500 bytes of overlap
- **3+ charters** → always extract to a shared skill
## Anti-Patterns
- Don't delete unique per-agent identity or domain-specific knowledge
- Don't create skills for content only one agent uses
- Don't merge unrelated patterns into a single mega-skill
- Don't remove Model preference line (coordinator needs it for model selection)
- Don't touch `.squad/decisions.md` during reskill
- Don't remove the tagline blockquote — it's the charter's soul in one line
---
name: "reskill"
description: "Team-wide charter and history optimization through skill extraction"
domain: "team-optimization"
confidence: "high"
source: "manual — Brady directive to reduce per-agent context overhead"
---
## Context
When the coordinator hears "team, reskill" (or similar: "optimize context", "slim down charters"), trigger a team-wide optimization pass. The goal: reduce per-agent context consumption by extracting shared patterns from charters and histories into reusable skills.
This is a periodic maintenance activity. Run whenever charter/history bloat is suspected.
## Process
### Step 1: Audit
Read all agent charters and histories. Measure byte sizes. Identify:
- **Boilerplate** — sections repeated across ≥3 charters with <10% variation (collaboration, model, boundaries template)
### Minimal Charter Template (target format after reskill)
```
# {Name} — {Role}
> {Tagline — one sentence capturing voice and philosophy}
## Identity
- **Name:** {Name}
- **Role:** {Role}
- **Expertise:** {comma-separated list}
## What I Own
- {bullet list of owned artifacts/domains}
## How I Work
- {unique patterns and principles — NOT boilerplate}
## Boundaries
**I handle:** {domain list}
**I don't handle:** {explicit exclusions}
## Model
Preferred: {model}
```
### Skill Extraction Threshold
- **1 charter** → leave in charter (unique to that agent)
- **2 charters** → consider extracting if >500 bytes of overlap
- **3+ charters** → always extract to a shared skill
## Anti-Patterns
- Don't delete unique per-agent identity or domain-specific knowledge
- Don't create skills for content only one agent uses
- Don't merge unrelated patterns into a single mega-skill
- Don't remove Model preference line (coordinator needs it for model selection)
- Don't touch `.squad/decisions.md` during reskill
- Don't remove the tagline blockquote — it's the charter's soul in one line
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.