Files
meshcore-analyzer/cmd/server/neighbor_graph_cache.go
T
Kpa-clawbot 13bdee57d4 perf: P0 hot-path fixes (observers, neighbor-graph, observer-analytics) (#1481) (#1483)
## What

Three of the four P0s from #1481's scale-test findings. Each cuts a
distinct
hot path; together they target /api/observers,
/api/analytics/neighbor-graph,
and /api/observers/{id}/analytics — the top three live offenders.

### P0-1: 5-min atomic-pointer cache for default neighbor-graph response
- Live p95 10.8s on the most-trafficked organic endpoint.
- Background recomputer (5-min cadence per operator directive) builds
the
  default-filter (`minCount=5 minScore=0.1`, no region, no role)
  `NeighborGraphResponse` and stores it via `atomic.Pointer`.
- `handleNeighborGraph` short-circuits on the default shape; non-default
filters take the extracted `computeNeighborGraphResponse` path
(identical
  semantics to the previous inline build).

### P0-2: cache parsed `StoreObs.Timestamp` + drop RLock window
- `handleObserverAnalytics` re-parsed the RFC3339 timestamp three times
  per observation, for 60k+ observations per active observer, under
  `s.store.mu.RLock` — blocking writers for the full scan.
- `StoreObs.ParsedTime()` parses once via `sync.Once` (mirrors
  `StoreTx.ParsedDecoded`).
- Handler snapshots the `byObserver[id]` pointer slice, releases the
  RLock immediately, then iterates locally.

### P0-3: 30s cache for `/api/observers` + sargable `IN` + covering
index
- Three SQL queries on every request → ~1.7s p50 at 50-concurrent.
- Atomic-pointer 30s cache for the default (no-filter) query.
- `GetNodeLocationsByKeys` drops `LOWER(public_key) IN (...)`
(non-sargable);
  callers pre-lowercase in Go and the plain `IN` matches the existing
  `public_key` index.
- New ingestor migration `obs_observer_ts_idx_v1` adds composite index
  `idx_observations_observer_idx_timestamp(observer_idx, timestamp)` so
  `GetObserverPacketCounts` can resolve its GROUP-BY + range filter from
  the index without scanning the 1.9M-row observations table.

### P0-4: deferred
`perfMiddleware`'s global mutex was claimed to serialize every API
request.
A direct test (`50 concurrent requests through the middleware, handler
sleeps 20ms each`) shows total elapsed ≈ 25ms, not 1s — the lock is held
only for the post-handler bookkeeping (a few µs). Real impact is below
measurement noise. Skipping to avoid invasive churn on PerfStats
consumers
without a demonstrable win.

## Test plan

Red → green per P0:
- `observers_cache_test.go` — handler reads `s.observersCache` before
SQL,
  TTL boundary, atomic.Pointer (no mutex contention).
- `storeobs_parsedtime_test.go` — parses three timestamp shapes, caches
  result, no race under concurrent readers.
- `neighbor_graph_cache_test.go` — handler serves from atomic pointer
  when set, bypasses cache when `?region=` (or any non-default filter)
  is passed.

Full server + ingestor suites pass: `go test -count=1 ./...`.

## Perf proof

Before/after p50/p95/p99 (50 requests × 50 concurrent) against prod
(before)
and staging once CI deploys (after) will be posted as a PR comment per
the
operator's "no merge without proof of improvement" gate.

Closes #1481


## TDD exemption — P0-1 and P0-2 (net-new surfaces, AGENTS.md)

Per CoreScope `AGENTS.md` § "Exemptions": **net-new code surfaces with
no
prior tests to break** may land tests in the same PR without a strict
test-first → impl commit split.

- **P0-1 (neighbor-graph atomic-pointer cache)** — `neighborGraphCache`,
  `recomputeNeighborGraphCache`, `loadNeighborGraphCacheBytes`,
  `startNeighborGraphRecomputer` and the default-shape short-circuit in
  `handleNeighborGraph` were brand-new code with no pre-existing
  assertions covering them. There was no green test to first turn red.
- **P0-2 (cached `StoreObs.Timestamp` + RLock window drop)** —
  `StoreObs.ParsedTime()` and the snapshot+release pattern in
  `handleObserverAnalytics` were new surfaces; the prior code did the
  parse inline per call with no behavioural test to break.

P0-3 was authored properly red-then-green (commit `6e63ec6a` red, then
`83ae129b` green) and does NOT use this exemption.

## Default-filter detection vs frontend reality (#1483 follow-up)

The Neighbor Graph analytics tab in `public/analytics.js` fetches
`/analytics/neighbor-graph?min_count=1&min_score=0` because the
client-side sliders need the full edge set to filter from. That shape
did NOT match the `(5, 0.1)` cached default, so the UI tab still paid
the cold compute cost despite #1481 P0-1.

The #1483 follow-up commit caches BOTH shapes in the same recomputer
pass:
- `(minCount=5, minScore=0.1, no region, no role)` — `live.js`
  affinity-scoring consumer.
- `(minCount=1, minScore=0, no region, no role)` — analytics tab.

Both are served from `atomic.Pointer` with an `X-Cache-Age-Seconds`
header. The per-shape cost in the background goroutine is roughly
linear in edge count; total recompute time stays well under the
5-minute cadence on prod-scale graphs.

---------

Co-authored-by: openclaw-bot <bot@openclaw.dev>
Co-authored-by: mc-bot <mc-bot@users.noreply.github.com>
2026-05-29 02:42:21 -07:00

156 lines
5.0 KiB
Go

package main
import (
"bytes"
"encoding/json"
"log"
"runtime/debug"
"strconv"
"sync/atomic"
"time"
)
// #1481 P0-1: cached default-filter neighbor-graph response.
//
// The /api/analytics/neighbor-graph handler does graph build + per-edge
// score + filter + ~900KB JSON marshal on every request. The default
// (no-region, no-role, minCount=5, minScore=0.1) shape covers the
// overwhelming majority of organic traffic; cache the fully-built AND
// pre-marshaled response so warm reads are a single Write. Recomputed
// every 5 minutes in the background — never on the hot path.
const neighborGraphCacheInterval = 5 * time.Minute
// neighborGraphCacheEntry holds both the response struct (kept for
// tests / structured access) and the pre-marshaled bytes that the
// handler writes verbatim.
type neighborGraphCacheEntry struct {
resp NeighborGraphResponse
json []byte
at time.Time
}
type neighborGraphCacheField struct {
ptr atomic.Pointer[neighborGraphCacheEntry]
// unfiltered = the (minCount=1, minScore=0, no region/role) shape
// the analytics tab actually hits. Cached separately so the UI
// tab also benefits from the warm path; client-side sliders then
// filter from full data. #1483 follow-up to perf claim.
unfilteredPtr atomic.Pointer[neighborGraphCacheEntry]
}
// startNeighborGraphRecomputer launches a background goroutine that
// rebuilds the default-shape response every interval. Returns when
// the stop channel is closed.
func (s *Server) startNeighborGraphRecomputer(interval time.Duration, stop <-chan struct{}) {
if interval <= 0 {
interval = neighborGraphCacheInterval
}
go func() {
s.recomputeNeighborGraphCache()
t := time.NewTicker(interval)
defer t.Stop()
for {
select {
case <-t.C:
s.recomputeNeighborGraphCache()
case <-stop:
return
}
}
}()
}
// recomputeNeighborGraphCache builds and pre-marshals the default-shape
// response and atomically swaps it in. Panic-defensive so a single bad
// rebuild doesn't kill the background goroutine — but logs the panic
// and increments a counter so operators see the failure (#1483 follow-up).
func (s *Server) recomputeNeighborGraphCache() {
defer func() {
if r := recover(); r != nil {
log.Printf("[neighbor-graph-cache] rebuild panic: %v\n%s", r, debug.Stack())
atomic.AddUint64(&s.neighborGraphCacheRebuildFailures, 1)
}
}()
start := time.Now()
resp := s.buildDefaultNeighborGraphResponse()
var buf bytes.Buffer
if err := json.NewEncoder(&buf).Encode(resp); err != nil {
log.Printf("[neighbor-graph-cache] marshal error: %v", err)
atomic.AddUint64(&s.neighborGraphCacheRebuildFailures, 1)
return
}
s.neighborGraphCache.ptr.Store(&neighborGraphCacheEntry{
resp: resp,
json: buf.Bytes(),
at: time.Now(),
})
log.Printf("[neighbor-graph-cache] rebuild ok in %v, nodes=%d", time.Since(start), len(resp.Nodes))
// Build + cache the analytics-tab shape (minCount=1, minScore=0).
// This is what the UI actually fetches so it can slider client-side.
// Cached separately so its TTL stays aligned with the default cache.
uStart := time.Now()
uResp := s.computeNeighborGraphResponseDispatch(1, 0, "", "")
var uBuf bytes.Buffer
if err := json.NewEncoder(&uBuf).Encode(uResp); err != nil {
log.Printf("[neighbor-graph-cache] unfiltered marshal error: %v", err)
atomic.AddUint64(&s.neighborGraphCacheRebuildFailures, 1)
return
}
s.neighborGraphCache.unfilteredPtr.Store(&neighborGraphCacheEntry{
resp: uResp,
json: uBuf.Bytes(),
at: time.Now(),
})
log.Printf("[neighbor-graph-cache] unfiltered rebuild ok in %v, nodes=%d", time.Since(uStart), len(uResp.Nodes))
}
// loadNeighborGraphCache returns the cached default response if present.
func (s *Server) loadNeighborGraphCache() (NeighborGraphResponse, bool) {
e := s.neighborGraphCache.ptr.Load()
if e == nil {
return NeighborGraphResponse{}, false
}
return e.resp, true
}
// loadNeighborGraphCacheBytes returns the pre-marshaled JSON for the
// cached default response if present, along with the age of the
// snapshot (zero when no entry is present).
func (s *Server) loadNeighborGraphCacheBytes() ([]byte, time.Duration, bool) {
e := s.neighborGraphCache.ptr.Load()
if e == nil || len(e.json) == 0 {
return nil, 0, false
}
age := time.Duration(0)
if !e.at.IsZero() {
age = time.Since(e.at)
}
return e.json, age, true
}
// loadNeighborGraphCacheBytesUnfiltered returns the pre-marshaled JSON
// for the (minCount=1, minScore=0) cache shape used by the analytics
// tab. #1483 follow-up.
func (s *Server) loadNeighborGraphCacheBytesUnfiltered() ([]byte, time.Duration, bool) {
e := s.neighborGraphCache.unfilteredPtr.Load()
if e == nil || len(e.json) == 0 {
return nil, 0, false
}
age := time.Duration(0)
if !e.at.IsZero() {
age = time.Since(e.at)
}
return e.json, age, true
}
// cacheAgeSecondsHeader formats a time.Duration as integer seconds for
// the X-Cache-Age-Seconds response header.
func cacheAgeSecondsHeader(d time.Duration) string {
if d < 0 {
d = 0
}
return strconv.FormatInt(int64(d/time.Second), 10)
}