mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-05-24 14:15:20 +00:00
d69d9fbf8e
## Summary Surgical fix for #1247: analytics endpoints regressed 3-9× between prod `d818527` and master. pprof against staging traced the regression to `resolveWithContext` tier-1 affinity loop running on every analytics `resolveHop` call (post-#1198 plumbing) with redundant per-(cand, ctx) work. **Result: 4.6× speedup on the synthetic hot-shape benchmark (202µs → 44µs / op).** ## Root cause - PR #1198 (`353c5264`) lit up `resolveWithContext` tier 1 from every analytics resolveHop closure (previously they passed `contextPubkeys=nil` and short-circuited the entire tier-1 block). - The inner loop did `N_cand × N_ctx` iterations where each one did: - `graph.Neighbors(strings.ToLower(ctxPK))` — graph RLock + ToLower allocation **per candidate**, redundantly - `strings.ToLower(cand.PublicKey)` per `ctxPK` - `strings.EqualFold(otherPK, ctxPK)` + `EqualFold(otherPK, candPK)` — both sides were already lowercased (`NeighborEdge.NodeA/B` via `makeEdgeKey`; `contextPubkeys` via `buildHopContextPubkeys`) - At staging scale (5k+ contextPubkeys × 30k+ resolveHop calls) this dominated `computeAnalyticsTopology` (37% of its CPU) and `computeAnalyticsRF` (55%). ## pprof attribution (staging, region-keyed queries bypassing #1240 cache) ``` computeAnalyticsTopology cum: 19.24% (5.45s / 28.32s sampled) └─ resolveWithContext 37% ├─ strings.ToLower 41% ├─ strings.EqualFold 28% └─ graph.Neighbors 24% computeAnalyticsRF cum: 10.38% ``` ## Fix (~80 LoC in `cmd/server/store.go`) 1. Lowercase `contextPubkeys` **once per call**, skipped entirely when already lowercased (the analytics fast path). 2. Lowercase candidate pubkeys **once per call**. 3. Invert the loop nesting: outer-ctx / inner-edge / candidate-map lookup. `graph.Neighbors` is called once per context pubkey instead of `N_cand` times. 4. Raw `==` instead of `strings.EqualFold` for pubkey comparisons (both sides lowercased by step 1/2). 5. Added a tiny `hasUpperASCII` byte-loop helper next to `isHexLower` for the fast-path check. Behavior preserved: same `Score × Confidence` formula, same tier-1 ratio + min-observations gate, same per-candidate "best edge wins" semantics. No change to tiers 2/3/4. ## TDD evidence - Red commit (`5f8d1564`): `TestResolveWithContextTier1Floor` asserts `<100 µs/call` on the hot shape. **199 µs/call on regressed master → FAIL.** - Green commit (`e3bdbc65`): surgical fix lands. **44 µs/call → PASS.** - Reverification: locally stashed the fix, ran the test → 199.5 µs FAIL; popped fix → 44 µs PASS. `BenchmarkResolveWithContextTier1Hot` (no assertion, visibility only): ``` before: 202013 ns/op 168 B/op 3 allocs/op after: 44084 ns/op 424 B/op 6 allocs/op speedup: 4.6× ``` (Post-fix allocs are O(N_cand + N_ctx) one-time helper tables — net win at hot scale.) ## Independence from #1248 PR #1248 caches the analytics compute output so user-facing latency is sub-ms even when the compute is slow. That's correct for UX but it masks the regression. This PR repairs the compute itself, so: - Region-keyed and windowed queries (which bypass the recomputer cache by design — see #1240) become fast again. - Future ingest scale or feature work on top of the regressed baseline doesn't compound. ## Out of scope - The geo-rejection (#1228) and Confidence weighting (#1229) commits — kept intact, they protect correctness and were not the dominant CPU cost. - Reverting any suspect commit — surgical only. ## Acceptance criteria from #1247 - [x] pprof confirms the hot function (`resolveWithContext`) - [x] Bisect identifies the regressing commit (`353c5264` / PR #1198 — context plumbing; ratified by pprof, no need to actually rebuild 5 binaries) - [x] Fix lands; tier-1 hot path 4.6× faster - [x] No regression in disambiguator correctness — full `go test ./...` green, all existing `ResolveWithContext` / `HopDisambig` / `NeighborGraph` / `Affinity` tests pass Fixes #1247 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>