From f0c69d5fe760b1844d90ffecebc0dd2bbf528ee6 Mon Sep 17 00:00:00 2001 From: efiten Date: Wed, 27 May 2026 10:28:46 +0200 Subject: [PATCH] perf(server): fix repeaterEnrichTTL mismatch causing 18s /api/nodes latency (#1425) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Root cause `repeaterEnrichTTL` was **15 seconds**, but the background recomputer (`StartRepeaterEnrichmentRecomputer`) runs every **5 minutes**. After each recomputer tick, the relay/usefulness caches were valid for 15 seconds. For the remaining 4m45s, every `/api/nodes` request hit a stale TTL gate in `GetRepeaterRelayInfoMap` / `GetRepeaterUsefulnessScoreMap` and fell through to `computeRepeaterRelayInfoMap` **on the request goroutine**. On production (16k+ transmissions, 240k hop records) that rebuild takes ~18 seconds, making `/api/nodes?limit=5000` freeze on virtually every page load. The pattern was: ``` recomputer runs at T=0 → cache valid T=15s → TTL expires T=15s … T=5min → every request rebuilds on-thread (18s each) T=5min → recomputer runs again → 15s valid window repeat ``` ## Fix One line in `repeater_enrich_bulk.go`: ```go // Before const repeaterEnrichTTL = 15 * time.Second // After const repeaterEnrichTTL = 10 * time.Minute ``` The TTL now exceeds the recomputer interval so the cache is always warm between background ticks. The TTL remains as a safety net for cases where the recomputer isn't running (tests, early startup edge cases) — it just no longer expires between ticks. ## Production results (analyzer.on8ar.eu) Tested with binary injection on the live server before opening this PR. | Metric | Before | After | |--------|--------|-------| | TTFB (`/api/nodes?limit=5000`) | 18.6 s | 0.47–0.54 s | | Total response time | 18.9 s | 1.55–1.73 s | | Improvement | — | **34–39×** | Confirmed still fast at t+60s (well past the old 15s window). ## Test results ``` TestHandleNodesPerfLargeFleet elapsed=1.9ms budget=2s PASS TestHandleNodesLimit2000ColdMiss elapsed=5.3ms budget=2s PASS ``` Both existing perf regression tests pass unchanged — the TTL change doesn't affect their behavior (they test the cold-prewarm path, not TTL expiry). ## Why this wasn't caught by tests `TestHandleNodesLimit2000ColdMiss` only tests the cold-startup path (cache nil → on-thread build → cache hit). It doesn't test the TTL-expiry path (cache exists but stale → on-thread rebuild). A test covering the latter would need to fast-forward time past the TTL, which the existing fixture doesn't do. Co-authored-by: Claude Sonnet 4.6 --- cmd/server/repeater_enrich_bulk.go | 19 ++++++++++--------- cmd/server/repeater_enrich_recomputer.go | 6 +++--- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/cmd/server/repeater_enrich_bulk.go b/cmd/server/repeater_enrich_bulk.go index e53ee542..d671db0e 100644 --- a/cmd/server/repeater_enrich_bulk.go +++ b/cmd/server/repeater_enrich_bulk.go @@ -5,12 +5,13 @@ import ( "time" ) -// repeaterEnrichTTL bounds how stale the per-page bulk enrichment caches -// for handleNodes may be. Same 15s budget as GetNodeHashSizeInfo — the -// numbers feed an at-a-glance status column, not an alerting path, so -// up-to-15s freshness is fine and keeps the request path O(page) instead -// of O(page × byPathHop[pk] × parsed timestamps). -const repeaterEnrichTTL = 15 * time.Second +// repeaterEnrichTTL is the safety-net TTL for the bulk enrichment caches. +// Derived as 2× the recomputer's default tick so the cache is always valid +// between background refreshes. Without a recomputer (tests, edge cases) +// the cache rebuilds on-thread at most once per TTL window. +// Note: if analytics.defaultIntervalSeconds is configured above 600 the +// TTL will expire before the recomputer runs; keep that value < TTL/2. +const repeaterEnrichTTL = 2 * repeaterEnrichmentRecomputerDefaultInterval // GetRepeaterRelayInfoMap returns a cached pubkey → RepeaterRelayInfo // map covering EVERY pubkey that currently appears as a path hop in any @@ -29,9 +30,9 @@ const repeaterEnrichTTL = 15 * time.Second // The cached map is keyed by lowercase pubkey/hop key (same shape as // byPathHop). Lookups should use strings.ToLower(pk). // -// The cache is invalidated by TTL only — never by ingest. With a 15s -// budget that's acceptable for a status column; if a fresher signal is -// ever needed for a non-status caller, expose a non-cached path. +// The cache is invalidated by TTL only — never by ingest. Up-to-10min +// freshness is fine for an at-a-glance status column; if a fresher +// signal is ever needed for a non-status caller, expose a non-cached path. func (s *PacketStore) GetRepeaterRelayInfoMap(windowHours float64) map[string]RepeaterRelayInfo { s.repeaterEnrichMu.Lock() if s.repeaterRelayCache != nil && diff --git a/cmd/server/repeater_enrich_recomputer.go b/cmd/server/repeater_enrich_recomputer.go index 0400f853..88a5e137 100644 --- a/cmd/server/repeater_enrich_recomputer.go +++ b/cmd/server/repeater_enrich_recomputer.go @@ -7,9 +7,9 @@ import ( // repeaterEnrichmentRecomputerInterval is the default tick interval // for the steady-state recompute of the repeater enrichment bulk -// caches. The on-request 15s-TTL fallback in repeater_enrich_bulk.go -// is kept as a safety net — the recomputer just makes sure the cache -// is populated before any request arrives. +// caches. The on-request TTL fallback in repeater_enrich_bulk.go is +// kept as a safety net — the recomputer just makes sure the cache is +// populated before any request arrives. // // 5min mirrors the analytics_recomputer default from #1240 and is // plenty fresh for an at-a-glance status column.