mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-06-03 04:44:07 +00:00
13bdee57d4
## What Three of the four P0s from #1481's scale-test findings. Each cuts a distinct hot path; together they target /api/observers, /api/analytics/neighbor-graph, and /api/observers/{id}/analytics — the top three live offenders. ### P0-1: 5-min atomic-pointer cache for default neighbor-graph response - Live p95 10.8s on the most-trafficked organic endpoint. - Background recomputer (5-min cadence per operator directive) builds the default-filter (`minCount=5 minScore=0.1`, no region, no role) `NeighborGraphResponse` and stores it via `atomic.Pointer`. - `handleNeighborGraph` short-circuits on the default shape; non-default filters take the extracted `computeNeighborGraphResponse` path (identical semantics to the previous inline build). ### P0-2: cache parsed `StoreObs.Timestamp` + drop RLock window - `handleObserverAnalytics` re-parsed the RFC3339 timestamp three times per observation, for 60k+ observations per active observer, under `s.store.mu.RLock` — blocking writers for the full scan. - `StoreObs.ParsedTime()` parses once via `sync.Once` (mirrors `StoreTx.ParsedDecoded`). - Handler snapshots the `byObserver[id]` pointer slice, releases the RLock immediately, then iterates locally. ### P0-3: 30s cache for `/api/observers` + sargable `IN` + covering index - Three SQL queries on every request → ~1.7s p50 at 50-concurrent. - Atomic-pointer 30s cache for the default (no-filter) query. - `GetNodeLocationsByKeys` drops `LOWER(public_key) IN (...)` (non-sargable); callers pre-lowercase in Go and the plain `IN` matches the existing `public_key` index. - New ingestor migration `obs_observer_ts_idx_v1` adds composite index `idx_observations_observer_idx_timestamp(observer_idx, timestamp)` so `GetObserverPacketCounts` can resolve its GROUP-BY + range filter from the index without scanning the 1.9M-row observations table. ### P0-4: deferred `perfMiddleware`'s global mutex was claimed to serialize every API request. A direct test (`50 concurrent requests through the middleware, handler sleeps 20ms each`) shows total elapsed ≈ 25ms, not 1s — the lock is held only for the post-handler bookkeeping (a few µs). Real impact is below measurement noise. Skipping to avoid invasive churn on PerfStats consumers without a demonstrable win. ## Test plan Red → green per P0: - `observers_cache_test.go` — handler reads `s.observersCache` before SQL, TTL boundary, atomic.Pointer (no mutex contention). - `storeobs_parsedtime_test.go` — parses three timestamp shapes, caches result, no race under concurrent readers. - `neighbor_graph_cache_test.go` — handler serves from atomic pointer when set, bypasses cache when `?region=` (or any non-default filter) is passed. Full server + ingestor suites pass: `go test -count=1 ./...`. ## Perf proof Before/after p50/p95/p99 (50 requests × 50 concurrent) against prod (before) and staging once CI deploys (after) will be posted as a PR comment per the operator's "no merge without proof of improvement" gate. Closes #1481 ## TDD exemption — P0-1 and P0-2 (net-new surfaces, AGENTS.md) Per CoreScope `AGENTS.md` § "Exemptions": **net-new code surfaces with no prior tests to break** may land tests in the same PR without a strict test-first → impl commit split. - **P0-1 (neighbor-graph atomic-pointer cache)** — `neighborGraphCache`, `recomputeNeighborGraphCache`, `loadNeighborGraphCacheBytes`, `startNeighborGraphRecomputer` and the default-shape short-circuit in `handleNeighborGraph` were brand-new code with no pre-existing assertions covering them. There was no green test to first turn red. - **P0-2 (cached `StoreObs.Timestamp` + RLock window drop)** — `StoreObs.ParsedTime()` and the snapshot+release pattern in `handleObserverAnalytics` were new surfaces; the prior code did the parse inline per call with no behavioural test to break. P0-3 was authored properly red-then-green (commit `6e63ec6a` red, then `83ae129b` green) and does NOT use this exemption. ## Default-filter detection vs frontend reality (#1483 follow-up) The Neighbor Graph analytics tab in `public/analytics.js` fetches `/analytics/neighbor-graph?min_count=1&min_score=0` because the client-side sliders need the full edge set to filter from. That shape did NOT match the `(5, 0.1)` cached default, so the UI tab still paid the cold compute cost despite #1481 P0-1. The #1483 follow-up commit caches BOTH shapes in the same recomputer pass: - `(minCount=5, minScore=0.1, no region, no role)` — `live.js` affinity-scoring consumer. - `(minCount=1, minScore=0, no region, no role)` — analytics tab. Both are served from `atomic.Pointer` with an `X-Cache-Age-Seconds` header. The per-shape cost in the background goroutine is roughly linear in edge count; total recompute time stays well under the 5-minute cadence on prod-scale graphs. --------- Co-authored-by: openclaw-bot <bot@openclaw.dev> Co-authored-by: mc-bot <mc-bot@users.noreply.github.com>
156 lines
5.0 KiB
Go
156 lines
5.0 KiB
Go
package main
|
|
|
|
import (
|
|
"bytes"
|
|
"encoding/json"
|
|
"log"
|
|
"runtime/debug"
|
|
"strconv"
|
|
"sync/atomic"
|
|
"time"
|
|
)
|
|
|
|
// #1481 P0-1: cached default-filter neighbor-graph response.
|
|
//
|
|
// The /api/analytics/neighbor-graph handler does graph build + per-edge
|
|
// score + filter + ~900KB JSON marshal on every request. The default
|
|
// (no-region, no-role, minCount=5, minScore=0.1) shape covers the
|
|
// overwhelming majority of organic traffic; cache the fully-built AND
|
|
// pre-marshaled response so warm reads are a single Write. Recomputed
|
|
// every 5 minutes in the background — never on the hot path.
|
|
|
|
const neighborGraphCacheInterval = 5 * time.Minute
|
|
|
|
// neighborGraphCacheEntry holds both the response struct (kept for
|
|
// tests / structured access) and the pre-marshaled bytes that the
|
|
// handler writes verbatim.
|
|
type neighborGraphCacheEntry struct {
|
|
resp NeighborGraphResponse
|
|
json []byte
|
|
at time.Time
|
|
}
|
|
|
|
type neighborGraphCacheField struct {
|
|
ptr atomic.Pointer[neighborGraphCacheEntry]
|
|
// unfiltered = the (minCount=1, minScore=0, no region/role) shape
|
|
// the analytics tab actually hits. Cached separately so the UI
|
|
// tab also benefits from the warm path; client-side sliders then
|
|
// filter from full data. #1483 follow-up to perf claim.
|
|
unfilteredPtr atomic.Pointer[neighborGraphCacheEntry]
|
|
}
|
|
|
|
// startNeighborGraphRecomputer launches a background goroutine that
|
|
// rebuilds the default-shape response every interval. Returns when
|
|
// the stop channel is closed.
|
|
func (s *Server) startNeighborGraphRecomputer(interval time.Duration, stop <-chan struct{}) {
|
|
if interval <= 0 {
|
|
interval = neighborGraphCacheInterval
|
|
}
|
|
go func() {
|
|
s.recomputeNeighborGraphCache()
|
|
t := time.NewTicker(interval)
|
|
defer t.Stop()
|
|
for {
|
|
select {
|
|
case <-t.C:
|
|
s.recomputeNeighborGraphCache()
|
|
case <-stop:
|
|
return
|
|
}
|
|
}
|
|
}()
|
|
}
|
|
|
|
// recomputeNeighborGraphCache builds and pre-marshals the default-shape
|
|
// response and atomically swaps it in. Panic-defensive so a single bad
|
|
// rebuild doesn't kill the background goroutine — but logs the panic
|
|
// and increments a counter so operators see the failure (#1483 follow-up).
|
|
func (s *Server) recomputeNeighborGraphCache() {
|
|
defer func() {
|
|
if r := recover(); r != nil {
|
|
log.Printf("[neighbor-graph-cache] rebuild panic: %v\n%s", r, debug.Stack())
|
|
atomic.AddUint64(&s.neighborGraphCacheRebuildFailures, 1)
|
|
}
|
|
}()
|
|
start := time.Now()
|
|
resp := s.buildDefaultNeighborGraphResponse()
|
|
var buf bytes.Buffer
|
|
if err := json.NewEncoder(&buf).Encode(resp); err != nil {
|
|
log.Printf("[neighbor-graph-cache] marshal error: %v", err)
|
|
atomic.AddUint64(&s.neighborGraphCacheRebuildFailures, 1)
|
|
return
|
|
}
|
|
s.neighborGraphCache.ptr.Store(&neighborGraphCacheEntry{
|
|
resp: resp,
|
|
json: buf.Bytes(),
|
|
at: time.Now(),
|
|
})
|
|
log.Printf("[neighbor-graph-cache] rebuild ok in %v, nodes=%d", time.Since(start), len(resp.Nodes))
|
|
|
|
// Build + cache the analytics-tab shape (minCount=1, minScore=0).
|
|
// This is what the UI actually fetches so it can slider client-side.
|
|
// Cached separately so its TTL stays aligned with the default cache.
|
|
uStart := time.Now()
|
|
uResp := s.computeNeighborGraphResponseDispatch(1, 0, "", "")
|
|
var uBuf bytes.Buffer
|
|
if err := json.NewEncoder(&uBuf).Encode(uResp); err != nil {
|
|
log.Printf("[neighbor-graph-cache] unfiltered marshal error: %v", err)
|
|
atomic.AddUint64(&s.neighborGraphCacheRebuildFailures, 1)
|
|
return
|
|
}
|
|
s.neighborGraphCache.unfilteredPtr.Store(&neighborGraphCacheEntry{
|
|
resp: uResp,
|
|
json: uBuf.Bytes(),
|
|
at: time.Now(),
|
|
})
|
|
log.Printf("[neighbor-graph-cache] unfiltered rebuild ok in %v, nodes=%d", time.Since(uStart), len(uResp.Nodes))
|
|
}
|
|
|
|
// loadNeighborGraphCache returns the cached default response if present.
|
|
func (s *Server) loadNeighborGraphCache() (NeighborGraphResponse, bool) {
|
|
e := s.neighborGraphCache.ptr.Load()
|
|
if e == nil {
|
|
return NeighborGraphResponse{}, false
|
|
}
|
|
return e.resp, true
|
|
}
|
|
|
|
// loadNeighborGraphCacheBytes returns the pre-marshaled JSON for the
|
|
// cached default response if present, along with the age of the
|
|
// snapshot (zero when no entry is present).
|
|
func (s *Server) loadNeighborGraphCacheBytes() ([]byte, time.Duration, bool) {
|
|
e := s.neighborGraphCache.ptr.Load()
|
|
if e == nil || len(e.json) == 0 {
|
|
return nil, 0, false
|
|
}
|
|
age := time.Duration(0)
|
|
if !e.at.IsZero() {
|
|
age = time.Since(e.at)
|
|
}
|
|
return e.json, age, true
|
|
}
|
|
|
|
// loadNeighborGraphCacheBytesUnfiltered returns the pre-marshaled JSON
|
|
// for the (minCount=1, minScore=0) cache shape used by the analytics
|
|
// tab. #1483 follow-up.
|
|
func (s *Server) loadNeighborGraphCacheBytesUnfiltered() ([]byte, time.Duration, bool) {
|
|
e := s.neighborGraphCache.unfilteredPtr.Load()
|
|
if e == nil || len(e.json) == 0 {
|
|
return nil, 0, false
|
|
}
|
|
age := time.Duration(0)
|
|
if !e.at.IsZero() {
|
|
age = time.Since(e.at)
|
|
}
|
|
return e.json, age, true
|
|
}
|
|
|
|
// cacheAgeSecondsHeader formats a time.Duration as integer seconds for
|
|
// the X-Cache-Age-Seconds response header.
|
|
func cacheAgeSecondsHeader(d time.Duration) string {
|
|
if d < 0 {
|
|
d = 0
|
|
}
|
|
return strconv.FormatInt(int64(d/time.Second), 10)
|
|
}
|