Files
meshcore-analyzer/cmd/server/observers_cache.go
T
Kpa-clawbot 13bdee57d4 perf: P0 hot-path fixes (observers, neighbor-graph, observer-analytics) (#1481) (#1483)
## What

Three of the four P0s from #1481's scale-test findings. Each cuts a
distinct
hot path; together they target /api/observers,
/api/analytics/neighbor-graph,
and /api/observers/{id}/analytics — the top three live offenders.

### P0-1: 5-min atomic-pointer cache for default neighbor-graph response
- Live p95 10.8s on the most-trafficked organic endpoint.
- Background recomputer (5-min cadence per operator directive) builds
the
  default-filter (`minCount=5 minScore=0.1`, no region, no role)
  `NeighborGraphResponse` and stores it via `atomic.Pointer`.
- `handleNeighborGraph` short-circuits on the default shape; non-default
filters take the extracted `computeNeighborGraphResponse` path
(identical
  semantics to the previous inline build).

### P0-2: cache parsed `StoreObs.Timestamp` + drop RLock window
- `handleObserverAnalytics` re-parsed the RFC3339 timestamp three times
  per observation, for 60k+ observations per active observer, under
  `s.store.mu.RLock` — blocking writers for the full scan.
- `StoreObs.ParsedTime()` parses once via `sync.Once` (mirrors
  `StoreTx.ParsedDecoded`).
- Handler snapshots the `byObserver[id]` pointer slice, releases the
  RLock immediately, then iterates locally.

### P0-3: 30s cache for `/api/observers` + sargable `IN` + covering
index
- Three SQL queries on every request → ~1.7s p50 at 50-concurrent.
- Atomic-pointer 30s cache for the default (no-filter) query.
- `GetNodeLocationsByKeys` drops `LOWER(public_key) IN (...)`
(non-sargable);
  callers pre-lowercase in Go and the plain `IN` matches the existing
  `public_key` index.
- New ingestor migration `obs_observer_ts_idx_v1` adds composite index
  `idx_observations_observer_idx_timestamp(observer_idx, timestamp)` so
  `GetObserverPacketCounts` can resolve its GROUP-BY + range filter from
  the index without scanning the 1.9M-row observations table.

### P0-4: deferred
`perfMiddleware`'s global mutex was claimed to serialize every API
request.
A direct test (`50 concurrent requests through the middleware, handler
sleeps 20ms each`) shows total elapsed ≈ 25ms, not 1s — the lock is held
only for the post-handler bookkeeping (a few µs). Real impact is below
measurement noise. Skipping to avoid invasive churn on PerfStats
consumers
without a demonstrable win.

## Test plan

Red → green per P0:
- `observers_cache_test.go` — handler reads `s.observersCache` before
SQL,
  TTL boundary, atomic.Pointer (no mutex contention).
- `storeobs_parsedtime_test.go` — parses three timestamp shapes, caches
  result, no race under concurrent readers.
- `neighbor_graph_cache_test.go` — handler serves from atomic pointer
  when set, bypasses cache when `?region=` (or any non-default filter)
  is passed.

Full server + ingestor suites pass: `go test -count=1 ./...`.

## Perf proof

Before/after p50/p95/p99 (50 requests × 50 concurrent) against prod
(before)
and staging once CI deploys (after) will be posted as a PR comment per
the
operator's "no merge without proof of improvement" gate.

Closes #1481


## TDD exemption — P0-1 and P0-2 (net-new surfaces, AGENTS.md)

Per CoreScope `AGENTS.md` § "Exemptions": **net-new code surfaces with
no
prior tests to break** may land tests in the same PR without a strict
test-first → impl commit split.

- **P0-1 (neighbor-graph atomic-pointer cache)** — `neighborGraphCache`,
  `recomputeNeighborGraphCache`, `loadNeighborGraphCacheBytes`,
  `startNeighborGraphRecomputer` and the default-shape short-circuit in
  `handleNeighborGraph` were brand-new code with no pre-existing
  assertions covering them. There was no green test to first turn red.
- **P0-2 (cached `StoreObs.Timestamp` + RLock window drop)** —
  `StoreObs.ParsedTime()` and the snapshot+release pattern in
  `handleObserverAnalytics` were new surfaces; the prior code did the
  parse inline per call with no behavioural test to break.

P0-3 was authored properly red-then-green (commit `6e63ec6a` red, then
`83ae129b` green) and does NOT use this exemption.

## Default-filter detection vs frontend reality (#1483 follow-up)

The Neighbor Graph analytics tab in `public/analytics.js` fetches
`/analytics/neighbor-graph?min_count=1&min_score=0` because the
client-side sliders need the full edge set to filter from. That shape
did NOT match the `(5, 0.1)` cached default, so the UI tab still paid
the cold compute cost despite #1481 P0-1.

The #1483 follow-up commit caches BOTH shapes in the same recomputer
pass:
- `(minCount=5, minScore=0.1, no region, no role)` — `live.js`
  affinity-scoring consumer.
- `(minCount=1, minScore=0, no region, no role)` — analytics tab.

Both are served from `atomic.Pointer` with an `X-Cache-Age-Seconds`
header. The per-shape cost in the background goroutine is roughly
linear in edge count; total recompute time stays well under the
5-minute cadence on prod-scale graphs.

---------

Co-authored-by: openclaw-bot <bot@openclaw.dev>
Co-authored-by: mc-bot <mc-bot@users.noreply.github.com>
2026-05-29 02:42:21 -07:00

75 lines
2.4 KiB
Go

package main
// observers cache for /api/observers default (no-filter) response.
// Issue #1481 P0-3 + #1483 follow-up.
//
// Design:
// - Atomic pointer holds the immutable cached response.
// - Wall-clock TTL replaced with monotonic time.Time (#1483: NTP
// step-backward must not extend the cache).
// - singleflight collapses TTL-boundary thundering herd into one
// SQL fill, regardless of incoming concurrency.
import (
"sync/atomic"
"time"
"golang.org/x/sync/singleflight"
)
// observersCacheTTL is the default freshness window for the cached
// default (no-filter) /api/observers response when no per-server
// override is configured. Configurable via ObserversCache.TTLSeconds
// (#1483).
const observersCacheTTL = 30 * time.Second
// effectiveObserversCacheTTL returns the cfg-overridden TTL or the
// default. Falls back to the default on nil cfg / non-positive value.
func (s *Server) effectiveObserversCacheTTL() time.Duration {
if s.cfg != nil && s.cfg.ObserversCache != nil && s.cfg.ObserversCache.TTLSeconds > 0 {
return time.Duration(s.cfg.ObserversCache.TTLSeconds) * time.Second
}
return observersCacheTTL
}
// singleflight key for the default-shape cache fill.
const observersCacheFlightKey = "observers:default"
// observersCacheEntry pairs the response with the monotonic timestamp
// of when it was built. atomic.Pointer guarantees the read is a single
// load; the entry is immutable once stored.
type observersCacheEntry struct {
resp ObserverListResponse
at time.Time
}
// observersCacheField bundles the atomic pointer with the singleflight
// group that gates concurrent refills.
type observersCacheField struct {
ptr atomic.Pointer[observersCacheEntry]
sf singleflight.Group
// fillCount increments once per actual SQL fill (i.e., per
// singleflight winner). Tests use this to assert the herd was
// collapsed; production code never reads it.
fillCount atomic.Int64
}
// observersCacheExpired reports whether the cached entry at `t` is
// older than observersCacheTTL or absent (zero time).
func (s *Server) observersCacheExpired(t time.Time) bool {
if t.IsZero() {
return true
}
return time.Since(t) >= s.effectiveObserversCacheTTL()
}
// loadObserversCache returns the cached entry and its age, or nil.
func (s *Server) loadObserversCache() (*observersCacheEntry, bool) {
e := s.observersCacheV2.ptr.Load()
if e == nil {
return nil, false
}
return e, true
}