perf(server): fix repeaterEnrichTTL mismatch causing 18s /api/nodes latency (#1425)

## Root cause

`repeaterEnrichTTL` was **15 seconds**, but the background recomputer
(`StartRepeaterEnrichmentRecomputer`) runs every **5 minutes**.

After each recomputer tick, the relay/usefulness caches were valid for
15 seconds. For the remaining 4m45s, every `/api/nodes` request hit a
stale TTL gate in `GetRepeaterRelayInfoMap` /
`GetRepeaterUsefulnessScoreMap` and fell through to
`computeRepeaterRelayInfoMap` **on the request goroutine**. On
production (16k+ transmissions, 240k hop records) that rebuild takes ~18
seconds, making `/api/nodes?limit=5000` freeze on virtually every page
load.

The pattern was:
```
recomputer runs at T=0  → cache valid
T=15s                   → TTL expires
T=15s … T=5min          → every request rebuilds on-thread (18s each)
T=5min                  → recomputer runs again → 15s valid window
repeat
```

## Fix

One line in `repeater_enrich_bulk.go`:

```go
// Before
const repeaterEnrichTTL = 15 * time.Second

// After
const repeaterEnrichTTL = 10 * time.Minute
```

The TTL now exceeds the recomputer interval so the cache is always warm
between background ticks. The TTL remains as a safety net for cases
where the recomputer isn't running (tests, early startup edge cases) —
it just no longer expires between ticks.

## Production results (analyzer.on8ar.eu)

Tested with binary injection on the live server before opening this PR.

| Metric | Before | After |
|--------|--------|-------|
| TTFB (`/api/nodes?limit=5000`) | 18.6 s | 0.47–0.54 s |
| Total response time | 18.9 s | 1.55–1.73 s |
| Improvement | — | **34–39×** |

Confirmed still fast at t+60s (well past the old 15s window).

## Test results

```
TestHandleNodesPerfLargeFleet      elapsed=1.9ms   budget=2s  PASS
TestHandleNodesLimit2000ColdMiss   elapsed=5.3ms   budget=2s  PASS
```

Both existing perf regression tests pass unchanged — the TTL change
doesn't affect their behavior (they test the cold-prewarm path, not TTL
expiry).

## Why this wasn't caught by tests

`TestHandleNodesLimit2000ColdMiss` only tests the cold-startup path
(cache nil → on-thread build → cache hit). It doesn't test the
TTL-expiry path (cache exists but stale → on-thread rebuild). A test
covering the latter would need to fast-forward time past the TTL, which
the existing fixture doesn't do.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
efiten
2026-05-27 10:28:46 +02:00
committed by GitHub
parent 48717aaccb
commit f0c69d5fe7
2 changed files with 13 additions and 12 deletions
+10 -9
View File
@@ -5,12 +5,13 @@ import (
"time"
)
// repeaterEnrichTTL bounds how stale the per-page bulk enrichment caches
// for handleNodes may be. Same 15s budget as GetNodeHashSizeInfo — the
// numbers feed an at-a-glance status column, not an alerting path, so
// up-to-15s freshness is fine and keeps the request path O(page) instead
// of O(page × byPathHop[pk] × parsed timestamps).
const repeaterEnrichTTL = 15 * time.Second
// repeaterEnrichTTL is the safety-net TTL for the bulk enrichment caches.
// Derived as 2× the recomputer's default tick so the cache is always valid
// between background refreshes. Without a recomputer (tests, edge cases)
// the cache rebuilds on-thread at most once per TTL window.
// Note: if analytics.defaultIntervalSeconds is configured above 600 the
// TTL will expire before the recomputer runs; keep that value < TTL/2.
const repeaterEnrichTTL = 2 * repeaterEnrichmentRecomputerDefaultInterval
// GetRepeaterRelayInfoMap returns a cached pubkey → RepeaterRelayInfo
// map covering EVERY pubkey that currently appears as a path hop in any
@@ -29,9 +30,9 @@ const repeaterEnrichTTL = 15 * time.Second
// The cached map is keyed by lowercase pubkey/hop key (same shape as
// byPathHop). Lookups should use strings.ToLower(pk).
//
// The cache is invalidated by TTL only — never by ingest. With a 15s
// budget that's acceptable for a status column; if a fresher signal is
// ever needed for a non-status caller, expose a non-cached path.
// The cache is invalidated by TTL only — never by ingest. Up-to-10min
// freshness is fine for an at-a-glance status column; if a fresher
// signal is ever needed for a non-status caller, expose a non-cached path.
func (s *PacketStore) GetRepeaterRelayInfoMap(windowHours float64) map[string]RepeaterRelayInfo {
s.repeaterEnrichMu.Lock()
if s.repeaterRelayCache != nil &&
+3 -3
View File
@@ -7,9 +7,9 @@ import (
// repeaterEnrichmentRecomputerInterval is the default tick interval
// for the steady-state recompute of the repeater enrichment bulk
// caches. The on-request 15s-TTL fallback in repeater_enrich_bulk.go
// is kept as a safety net — the recomputer just makes sure the cache
// is populated before any request arrives.
// caches. The on-request TTL fallback in repeater_enrich_bulk.go is
// kept as a safety net — the recomputer just makes sure the cache is
// populated before any request arrives.
//
// 5min mirrors the analytics_recomputer default from #1240 and is
// plenty fresh for an at-a-glance status column.