Files
meshcore-analyzer/cmd/server
Kpa-clawbot b21badbcbd fix(#1225): paginate channel messages at SQL level — 30s → <500ms (#1226)
## Summary
Fixes #1225 — channel messages endpoint took ~30s on staging.

## Root cause
`(*DB).GetChannelMessages` SELECTed every observation row for the
channel (one row per observation, not per transmission),
JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender,
packetHash)`, then sliced the tail in Go for pagination.

On staging `#wardriving`:
- `transmissions` rows with `channel_hash='#wardriving' AND
payload_type=5`: **5,703**
- `observations` joined to those: **274,632** (~48× amplification)
- `time curl /api/channels/%23wardriving/messages?limit=50`: **30.04s /
31.41s / 31.48s / 35.33s / 34.05s** (5 calls before I killed the loop)

`EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being
used — the cost was entirely in fetching, unmarshalling, and folding the
full observation set per request even for `limit=50`.

Hypothesis #1 from the issue (full table scan on `messages/decoded`) is
rejected; #2 (missing index) is rejected; the actual cause was
**pagination in Go instead of SQL** — request cost was O(observations)
not O(limit).

## Fix
Move pagination into SQL on the `transmissions` table. Because
`transmissions.hash` is `UNIQUE` and the original dedup key was
`(sender, hash)`, each transmission collapses to exactly one logical
message — paginating on transmissions is semantically equivalent to the
prior in-Go dedup + tail slice.

New shape:
1. `COUNT(*)` on transmissions for total (uses `idx_tx_channel_hash`).
2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ?
OFFSET ?` to pick the page of newest transmissions.
3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` —
typically 50 ids → a few hundred observation rows.
4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API
contract.

Region filtering, observation-count-as-`repeats`, and "first observation
wins for hops/snr/observer" semantics are preserved (observations are
scanned `ORDER BY o.id ASC`).

## Perf measurements
**Before** (staging `#wardriving`, limit=50, 5 samples killed mid-loop):
30.04s, 31.41s, 31.48s, 35.33s, 34.05s.
**Synthetic regression test**
(`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs.
- Broken impl: ~4.5s (test fails the 500ms budget — the RED commit).
- Fixed impl: well under 500ms (test passes).
**After (staging)**: will measure post-deploy and post-comment on issue
with numbers. Synthetic scaling: staging is ~2× the test's transmission
count, fixed-path cost scales with `limit` (50) + `COUNT(*)` (~5k rows
on index) — expect <100ms p99.

## TDD
- RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at
~4.5s.
- GREEN: `3f1f82d3` — fix; full suite green, perf test passes.

## Hypotheses status
| # | Hypothesis | Verdict |
|---|---|---|
| 1 | Endpoint slow on prod-sized data | **CONFIRMED** (different
mechanism — see root cause) |
| 2 | Missing channel_hash index | Rejected (`idx_tx_channel_hash`
exists & used) |
| 3 | Frontend re-render storm | Not investigated (backend was clearly
the bottleneck) |
| 4 | Decode in request path | Rejected (decode is at ingest time; JSON
unmarshal of cached `decoded_json` is the cost, addressed by reducing
row count) |
| 5 | WS subscription failure | Rejected |
| 6 | Staging artifact | Rejected (reproducible) |

## Out of scope
- The in-memory `(*PacketStore).GetChannelMessages` path (used when
`s.db == nil`) has the same shape but operates on bounded in-memory
data; not touched. If we ever fall back to it in production we'll
revisit.

---------

Co-authored-by: clawbot <bot@corescope>
2026-05-16 17:28:40 +00:00
..