RED: 97f49a0c · CI:
https://github.com/Kpa-clawbot/CoreScope/actions/runs/26046530920Fixes#1265.
## Problem
On staging two clock-skew endpoints serve compute-on-request:
- `/api/observers/clock-skew` — 3.3s
- `/api/nodes/clock-skew` — 8.9s
Both drive a full `clockSkew.Recompute` over 100k+ adverts while holding
`s.mu.RLock`, blocking under concurrent reader load.
## Fix
Wire both endpoints into the established `analytics_recomputer.go`
pattern (PRs #1248 / #1259 / #1263). Two new slots:
- `recompObserversClockSkew` — wraps `computeObserverCalibrations()`
- `recompNodesClockSkew` — wraps `computeFleetClockSkew()`
Accessors `GetObserverCalibrations` / `GetFleetClockSkew` now prefer the
atomic-pointer snapshot; on-request compute is fallback-only for the
brief window before initial sync compute lands (and for tests that skip
the recomputer).
Default interval **300s**, overridable via:
```json
"analytics": {
"recomputeIntervalSeconds": {
"observersClockSkew": 300,
"nodesClockSkew": 300
}
}
```
`config.example.json` + the `_comment_analytics` doc updated.
## TDD
- RED `97f49a0c` — `TestClockSkewRecomputersRegistered` +
`TestClockSkewHandlersSteadyStateLatency` (8 concurrent readers × 25
reqs per endpoint, p99 < 100ms gate). Fails on master: recomputer slots
nil.
- GREEN `19599375` — wire + accessor switch. p99 well under 5ms on the
test fixture.
## Verification
```
cd cmd/server && go test ./... -count=1 # ok 42s
bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master # all gates pass
```
---------
Co-authored-by: CoreScope Bot <bot@corescope.local>
RED commit: `0190466d` — failing CI:
https://github.com/Kpa-clawbot/CoreScope/actions (will populate after PR
creation)
## Problem
On staging (commit `d69d9fb`, 78k tx, 2.3M obs), `curl
http://localhost/api/analytics/roles` times out at 60s with 0 bytes —
the Roles tab is unusable. Issue #1256.
PR #1248's steady-state recomputer fan-out (topology / rf / distance /
channels / hash-collisions / hash-sizes) **didn't include roles**. The
legacy handler:
1. Holds `s.mu.RLock` for the entire compute.
2. Calls `GetFleetClockSkew()`, which drives `clockSkew.Recompute(s)`
over all ADVERT transmissions — O(78k) per request.
3. Concurrent ingest writers compound the latency through
writer-starvation.
Result: every request hits the cold path; the response never comes back
inside the 60 s HTTP budget.
## Fix
Add `roles` as the 7th endpoint in the recomputer fan-out — same pattern
as #1248:
- `PacketStore.recompRoles` slot, registered in
`StartAnalyticsRecomputers` with default 5-min interval.
- `PacketStore.GetAnalyticsRoles()` → atomic-pointer load from the
snapshot (sub-ms), with a `computeAnalyticsRoles()` fallback only for
the brief startup window before the initial sync compute completes.
- Handler is now a thin wrapper — no lock-held work on the request path.
- New optional `roles` key under `analytics.recomputeIntervalSeconds` in
config; `config.example.json` and `_comment_analytics` updated.
## Latency (unit-scope benchmark)
- Worst-of-50 handler latency: **<100 ms** (test budget; well under the
2 s p99 acceptance).
- Compute itself is bounded by the existing 5-min recompute window — it
runs once in the background, never on the request path.
## Tests
- RED `0190466d`: asserts `recompRoles` is registered and the handler
returns under the latency budget. Fails on master with `recompRoles not
registered`.
- GREEN `d7784f76`: registers the recomputer + snapshot accessor — both
tests pass.
Fixes#1256
---------
Co-authored-by: openclaw-bot <bot@openclaw.local>
RED commit: `27630f6a` — adds latency test that fails on master
(p99=225ms > 50ms budget) and a stub `StartAnalyticsRecomputers` that
returns a no-op so the assertion (not a build error) gates the change.
GREEN commit: `20fbbceb` — wires real background recompute
infrastructure. Test passes at p99=~1µs.
## What changed
Replaces the on-request "compute-then-cache" pattern for the
default-shape analytics queries with a steady-state background recompute
loop. Reads always hit an `atomic.Value` snapshot in <1µs regardless of
compute cost or writer contention. Operator principle: serving slightly
stale data quickly beats real-time data slowly.
## Endpoints converted (default 5min interval each)
| Endpoint | Cold compute | Recomputer interval |
|---|---|---|
| `/api/analytics/topology` | ~5s | 5 min |
| `/api/analytics/rf` | ~4s | 5 min |
| `/api/analytics/distance` | ~3s | 5 min |
| `/api/analytics/channels` | ~0.5s | 5 min |
| `/api/analytics/hash-collisions` | ~0.5s | 5 min |
| `/api/analytics/hash-sizes` | ~22ms | 5 min |
All intervals configurable per-endpoint via
`analytics.recomputeIntervalSeconds.<name>` in `config.json`; documented
in `config.example.json`. Default override via
`analytics.defaultIntervalSeconds`.
## Scope: default query only
Only the canonical shape `(region="", window=zero)` is precomputed.
Region- or window-filtered requests fall back to the legacy TTL cache +
on-request compute — keeps recomputer count bounded (6, not 6×N×M).
## Latency
Test `TestAnalyticsRecomputerSteadyStateLatency`: 100 concurrent readers
+ 4 writers churning `s.mu.Lock` on 20k distHops.
- Before: p50=188ms p99=225ms (assertion failed)
- After: p50=240ns p99=1.1µs (atomic load + map return)
## Shutdown integration
`StartAnalyticsRecomputers` returns a stop closure invoked from
`main.go`'s SIGTERM handler BEFORE `dbClose()` so any in-flight SQLite
compute drains cleanly. `TestAnalyticsRecomputerShutdownNoLeak` confirms
all 6 goroutines are reaped (Δ=6 within 2s).
## Safety details
- Initial compute is synchronous in `Start()` — first read after startup
never sees nil.
- `recover()` inside `runOnce` keeps a compute panic from killing the
goroutine; previous snapshot remains valid.
- `analyticsRecomputerMu` is a sync.RWMutex; recomputer pointers are
read-locked in the hot path. The atomic.Value swap inside `runOnce` is
lock-free.
Fixes#1240.
---------
Co-authored-by: OpenClaw Bot <bot@openclaw.local>