## Summary
The in-memory `PacketStore` had **no eviction or aging** — it grew
unbounded until OOM killed the process. At ~3K packets/hour and ~5KB per
packet (not the 450 bytes previously estimated), an 8GB VM would OOM in
a few days.
## Changes
### Time-based eviction
- Configurable via `config.json`: `"packetStore": { "retentionHours": 24
}`
- Packets older than the retention window are evicted from the head of
the sorted slice
### Memory-based cap
- Configurable via `"packetStore": { "maxMemoryMB": 1024 }`
- Hard ceiling — evicts oldest packets when estimated memory exceeds the
cap
### Index cleanup
When a `StoreTx` is evicted, ALL associated data is removed from:
- `byHash`, `byTxID`, `byObsID`, `byObserver`, `byNode`, `byPayloadType`
- `nodeHashes`, `distHops`, `distPaths`, `spIndex`
### Periodic execution
- Background ticker runs eviction every 60 seconds
- Analytics caches and hash size cache are invalidated after eviction
### Stats fixes
- `estimatedMB` now uses ~5KB/packet + ~500B/observation (was 430B +
200B)
- `evicted` counter reflects actual evictions (was hardcoded to 0)
- Removed fake `maxPackets: 2386092` and `maxMB: 1024` from stats
### Config example
```json
{
"packetStore": {
"retentionHours": 24,
"maxMemoryMB": 1024
}
}
```
Both values default to 0 (unlimited) for backward compatibility.
## Tests
- 7 new tests in `eviction_test.go` covering time-based, memory-based,
index cleanup, thread safety, config parsing, and no-op when disabled
- All existing tests pass unchanged
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Problem
The RF analytics `packetsPerHour` chart was counting **observations**
instead of **unique transmissions** per hour. With ~34 observations per
transmission on average, the chart showed ~5,645 packets/hr instead of
the correct ~163/hr.
**Evidence from prod API:**
- `packetsPerHour` total: 1,580,620 (sum of all hourly counts)
- `totalPackets`: 45,764
- That's a ~34× inflation — exactly the observations-per-transmission
ratio
## Root Cause
In `store.go`, the `hourBuckets[hr]++` counter was inside the
observations loop (both regional and non-regional paths). Other counters
like `packetSizes` and `typeBuckets` already deduplicate by hash —
`hourBuckets` was the only one that didn't.
## Fix
Added a `seenHourHash` map (keyed by `hash|hour`) to deduplicate. Each
unique transmission is counted once per hour bucket, matching how packet
sizes and payload types already work.
Both the regional observer path and the non-regional path are fixed. The
legacy path (transmissions without observations) was already correct
since it iterates per-transmission.
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Root causes from CI logs:
1. 'read /app/config.json: is a directory' — Docker creates a directory
when bind-mounting a non-existent file. The entrypoint now detects
and removes directory config.json before falling back to example.
2. 'unable to open database file: out of memory (14)' — old container
(3GB) not fully exited when new one starts. Deploy now uses
'docker compose down' with timeout and waits for memory reclaim.
3. Supervisor gave up after 3 fast retries (FATAL in ~6s). Increased
startretries to 10 and startsecs to 2 for server and ingestor.
Additional:
- Deploy step ensures staging config.json exists before starting
- Healthcheck: added start_period=60s, increased timeout and retries
- No longer uses manage.sh (CI working dir != repo checkout dir)
Root cause: on the 8GB VM, both prod (~2.5GB) and staging (~2GB) containers
run simultaneously. During deploy, manage.sh would rm the old staging container
and immediately start a new one. The old container's memory wasn't reclaimed
yet, so the new one got 'unable to open database file: out of memory (14)'
from SQLite and both corescope-server and corescope-ingestor entered FATAL.
Fix:
- manage.sh restart staging: wait up to 15s for old container to fully exit,
plus 3s for OS memory reclamation before starting new container
- manage.sh restart staging: verify config.json exists before starting
- docker-compose.staging.yml: add deploy.resources.limits.memory=3g to
prevent staging from consuming unbounded memory
## Summary
Adds `distributionByRepeaters` to the `/api/analytics/hash-sizes`
endpoint in the **Go server**.
### Problem
PR #263 implemented this feature in the deprecated Node.js server
(server.js). All backend changes should go in the Go server at
`cmd/server/`.
### Solution
- For each hash size (1, 2, 3), count how many unique repeaters (nodes)
advertise packets with that hash size
- Uses the existing `byNode` map already computed in
`computeAnalyticsHashSizes()`
- Added to both the live response and the empty/fallback response in
routes.go
- Frontend changes from PR #263 (`public/analytics.js`) already render
this field — no frontend changes needed
### Response shape
```json
{
"distributionByRepeaters": { "1": 42, "2": 7, "3": 2 },
...existing fields...
}
```
### Testing
- All Go server tests pass
- Replaces PR #263 (which modified the wrong server)
Closes#263
---------
Co-authored-by: you <you@example.com>
down tears down the entire compose project including prod.
rm -sf stops and removes just the named service.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Version/Commit/BuildTime now populated from package.json, git, and
date. Exported as env vars so docker compose build picks them up.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
Complete CI pipeline restructure. Sequential fail-fast chain, E2E tests
against Go server with real staging data, all deprecated Node.js server
tests removed.
### Pipeline (PR):
1. **Go unit tests** — fail-fast, coverage + badges
2. **Playwright E2E** — against Go server with fixture DB, frontend
coverage, fail-fast on first failure
3. **Docker build** — verify containers build
### Pipeline (master merge):
Same chain + deploy to staging + badge publishing
### Removed:
- All Node.js server-side unit tests (deprecated JS server)
- `npm ci` / `npm run test` steps
- JS server coverage collection (`COVERAGE=1 node server.js`)
- Changed-files detection logic
- Docs-only CI skip logic
- Cancel-workflow API hacks
### Added:
- `test-fixtures/e2e-fixture.db` — real data from staging (200 nodes, 31
observers, 500 packets)
- `scripts/capture-fixture.sh` — refresh fixture from staging API
- Go server launches with `-port 13581 -db test-fixtures/e2e-fixture.db
-public public-instrumented`
---------
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
Three optimizations to reduce wall-clock time:
1. Reduce safeClick timeout from 3000ms to 500ms
- Elements either exist immediately after navigation or don't exist at all
- ~75 safeClick calls; if ~30 miss, saves ~75s of dead wait time
2. Replace 18 page.goto() calls with SPA hash navigation
- After initial page load, the SPA shell is already in the DOM
- page.goto() reloads the entire page (network round-trip + parse)
- Hash navigation via location.hash triggers the SPA router instantly
- Only 3 page.goto() remain: initial load + 2 home page loads after localStorage.clear()
3. Remove redundant final route sweep
- All 10 routes were already visited during the page-specific sections
- The sweep just re-navigated to pages that had already been exercised
- Saves ~2s of redundant navigation
Also:
- Reduce inter-route wait from 200ms to 50ms (SPA router is synchronous)
- Merge utility function + packet filter exercises into single evaluate() call
- Use navHash() helper for consistent hash navigation with 150ms settle time
The test 'Node perf page should NOT show Go Runtime section' asserts
Node.js-specific behavior, but E2E tests now run against the Go server
(per this PR), so Go Runtime info is correctly present. Remove the
now-irrelevant assertion.
The Playwright E2E tests were starting `node server.js` (the deprecated
JS server) instead of the Go server, meaning E2E tests weren't testing
the production backend at all.
Changes:
- Add Go 1.22 setup and build steps to the node-test job
- Build the Go server binary before E2E tests run
- Replace `node server.js` with `./corescope-server` in both the
instrumented (coverage) and quick (no-coverage) E2E server starts
- Use `-port 13581` and `-public` flags to configure the Go server
- For coverage runs, serve from `public-instrumented/` directory
The Go server serves the same static files and exposes compatible
/api/* routes (stats, packets, health, perf) that the E2E tests hit.
Change healthThresholds config from milliseconds to hours for readability.
Config keys: infraDegradedHours, infraSilentHours, nodeDegradedHours, nodeSilentHours.
Defaults: infra degraded 24h, silent 72h; node degraded 1h, silent 24h.
- Config stored in hours, converted to ms at comparison time
- /api/config/client sends ms to frontend (backward compatible)
- Frontend tooltips use dynamic thresholds instead of hardcoded strings
- Added healthThresholds section to config.example.json
- Updated Go and Node.js servers, tests
* docs: remove letsmesh.net reference from README
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: remove paths-ignore from pull_request trigger
PR #233 only touches .md files, which were excluded by paths-ignore,
causing CI to be skipped entirely. Remove paths-ignore from the
pull_request trigger so all PRs get validated. Keep paths-ignore on
push to avoid unnecessary deploys for docs-only changes to master.
* ci: skip heavy CI jobs for docs-only PRs
Instead of using paths-ignore (which skips the entire workflow and
blocks required status checks), detect docs-only changes at the start
of each job and skip heavy steps while still reporting success.
This allows doc-only PRs to merge without waiting for Go builds,
Node.js tests, or Playwright E2E runs.
Reverts the approach from 7546ece (removing paths-ignore entirely)
in favor of a proper conditional skip within the jobs themselves.
* fix: update engine tests to match engine-badge HTML format
Tests expected [go]/[node] text but formatVersionBadge now renders
<span class="engine-badge">go</span>. Updated 6 assertions to
check for engine-badge class and engine name in HTML output.
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
Three optimizations to the CI frontend test pipeline:
1. Run E2E tests and coverage collection concurrently
- Previously sequential (E2E ~1.5min, then coverage ~5.75min)
- Now both run in parallel against the same instrumented server
- Expected savings: ~5 min (coverage runs alongside E2E instead of after)
2. Replace networkidle with domcontentloaded in coverage collector
- SPA uses hash routing — networkidle waits 500ms for network silence
on every navigation, adding ~10-15s of dead time across 23 navigations
- domcontentloaded fires immediately once HTML is parsed; JS initializes
the route handler synchronously
- For in-page hash changes, use 200ms setTimeout instead of
waitForLoadState (which would never re-fire for same-document nav)
3. Extract coverage from E2E tests too
- E2E tests already exercise the app against the instrumented server
- Now writes window.__coverage__ to .nyc_output/e2e-coverage.json
- nyc merges both coverage files for higher total coverage
Also:
- Split Playwright install into browser + deps steps (deps skip if present)
- Replace sleep 5 with health-check poll in quick E2E path
The poller's Start() calls GetMaxTransmissionID() to initialize its cursor.
When the test goroutine inserts data between go poller.Start() and the
actual GetMaxTransmissionID() call, the poller's cursor skips past the
test data and never broadcasts it, causing a timeout.
Adding a 100ms sleep after go poller.Start() ensures the poller has
initialized its cursors before the test inserts new data.
SQLite :memory: databases create separate databases per connection.
When the connection pool opens multiple connections (e.g. poller goroutine
vs main test goroutine), tables created on one connection are invisible
to others. Setting MaxOpenConns(1) ensures all queries use the same
in-memory database, fixing TestPollerBroadcastsMultipleObservations.
IngestNewFromDB now broadcasts one message per observation (not per
transmission). IngestNewObservations also broadcasts late arrivals.
Tests verify multi-observer packets produce multiple WS messages.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compared decoder.js against the MeshCore firmware source (Dispatcher.cpp,
Packet.h, Mesh.cpp, AdvertDataHelpers.h) and fixed all mismatches:
1. Field order: transport codes now parsed BEFORE path_length byte,
matching the spec: [header][transport_codes?][path_length][path][payload]
2. ACK payload: was incorrectly decoded as dest(1)+src(1)+ackHash(4).
Firmware shows ACK is just checksum(4) — no dest/src hashes.
3. TRACE payload: was incorrectly decoded as flags(1)+tag(4)+dest(6)+src(1).
Firmware shows tag(4)+authCode(4)+flags(1)+pathData.
4. ADVERT appdata: added missing feature1 (0x20 flag) and feature2
(0x40 flag) parsing — 2-byte fields between location and name.
5. Transport code field naming: renamed nextHop/lastHop to code1/code2
to match spec terminology (transport_code_1/transport_code_2).
6. Fixed incorrect field size labels in packets.js hex breakdown:
dest/src are 1 byte, MAC is 2 bytes (not 6B/6B/4B).
7. Fixed ANON_REQ/PATH comment typos (dest was listed as 6 bytes,
MAC as 4 bytes — both wrong, code was already correct).
All 329 tests pass (66 decoder + 263 spec/golden).