Files
meshcore-analyzer/docs/specs/startup-performance.md
T

213 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Startup Performance: Serve HTTP Within 2 Minutes on Any Database Size
## Problem
CoreScope takes 3045 minutes to start on large databases (325K transmissions, 7.3M observations, 1.4GB SQLite). The HTTP server is completely unavailable during this time. Operators cannot restart without 30+ minutes of downtime.
### Where time goes (7.3M observation benchmark)
| Phase | Time | Blocking? |
|---|---|---|
| `Load()` — read SQLite → memory | ~90s | Yes |
| Build subpath index | ~20s | Yes |
| Build distance index | ~15s | Yes |
| Build path-hop index | <1s | Yes |
| Load neighbor edges from SQLite | <1s | Yes |
| **Backfill `resolved_path` for NULL observations** | **2030+ min** | **Yes — the killer** |
| Re-pick best observations | ~10s | Yes |
The backfill calls `resolvePathForObs` for every observation with `resolved_path IS NULL`, then writes results back to SQLite and updates in-memory state. On first run (or after schema migration), this means resolving all 7.3M observations.
### Root cause
`backfillResolvedPaths()` in `neighbor_persist.go` runs synchronously in `main()` before `httpServer.ListenAndServe()`. It:
1. Collects all observations with `ResolvedPath == nil` under a read lock
2. Resolves paths (CPU-bound, ~millions of calls to `resolvePathForObs`)
3. Writes results to SQLite in a single transaction
4. Updates in-memory state under a write lock
Steps 24 block the main goroutine for 2030 minutes.
## Solution: Async Chunked Backfill
### Design
Move `backfillResolvedPaths` out of the startup critical path. Start the HTTP server immediately after loading data and building indexes. Run backfill in a background goroutine with chunked processing that yields between batches.
### Startup sequence (new)
```
1. OpenDB, verify tables (~1s)
2. store.Load() (~90s)
3. ensureNeighborEdgesTable (<1s)
4. ensureResolvedPathColumn (<1s)
5. Load/build neighbor graph (<1s)
6. Build subpath/distance/path-hop indexes (~35s)
7. pickBestObservation (with whatever (~10s)
resolved_path data exists)
8. *** START HTTP SERVER *** — serving at ~2min mark
9. Background: backfillResolvedPaths (20-30 min, non-blocking)
→ chunked, yields between batches
→ updates in-memory + SQLite incrementally
→ re-picks best obs for affected txs
```
Total time to first HTTP response: **~2 minutes** regardless of database size.
### Implementation details
#### 1. Background backfill goroutine
```go
// In main(), after starting HTTP server:
go func() {
backfillResolvedPathsAsync(store, dbPath, 5000, 100*time.Millisecond)
}()
```
The async backfill processes observations in chunks of N (e.g., 5,000):
```go
func backfillResolvedPathsAsync(store *PacketStore, dbPath string, chunkSize int, yieldDuration time.Duration) {
for {
n := backfillResolvedPathsChunk(store, dbPath, chunkSize)
if n == 0 {
break // done
}
log.Printf("[store] backfilled resolved_path for %d observations (async)", n)
time.Sleep(yieldDuration) // yield to HTTP handlers
}
log.Printf("[store] async resolved_path backfill complete")
}
```
Each chunk:
1. Takes a read lock, collects up to `chunkSize` pending observations, releases lock
2. Resolves paths (no lock held — `resolvePathForObs` only reads immutable data)
3. Opens a separate RW SQLite connection, writes results in a transaction
4. Takes a write lock, updates in-memory `obs.ResolvedPath` and re-picks best obs for affected transmissions, releases lock
5. Sleeps briefly to yield CPU/lock time to HTTP handlers
#### 2. Readiness flag and API degraded-mode header
Add a boolean to `PacketStore`:
```go
type PacketStore struct {
// ...
backfillComplete atomic.Bool
}
```
API responses include a header during backfill:
```
X-CoreScope-Status: backfilling
X-CoreScope-Backfill-Remaining: 4523000
```
After backfill completes:
```
X-CoreScope-Status: ready
```
The frontend can read this header and show a subtle banner: *"Resolving hop paths… some paths may show abbreviated pubkeys."*
#### 3. Index rebuilds
The subpath, distance, and path-hop indexes are built during startup from whatever data exists. During backfill, newly resolved paths need to update these indexes incrementally.
Options (in order of preference):
**Option A: Defer index updates to end of backfill.** Indexes work fine with unresolved paths — they just produce slightly less precise results. After backfill completes, rebuild indexes once. Simple, correct, low risk.
**Option B: Incremental index updates per chunk.** After each chunk, update affected index entries. More complex, better real-time accuracy. Only worth it if index accuracy during backfill matters for production use.
**Recommendation: Option A.** The indexes are usable with unresolved paths. A single rebuild at the end (~35s) is cheap compared to the backfill duration. The API works throughout — results just improve after backfill finishes.
#### 4. SQLite contention
The backfill opens a separate RW connection for writes. The main server uses a read-only connection for polling. SQLite WAL mode (already in use) allows concurrent readers and one writer. Contention risk is minimal:
- Write transactions are small (5,000 UPDATEs per chunk, batched in a single tx)
- Read queries from HTTP handlers are unaffected by WAL writes
- The 100ms yield between chunks prevents sustained write pressure
#### 5. Lock contention
The write lock is held only during the in-memory update phase of each chunk (~5,000 pointer assignments + re-picks). This takes microseconds. HTTP handlers acquire read locks for API responses — they will not be blocked for any perceptible duration.
#### 6. Frontend handling
The `hop-resolver.js` module already handles unresolved (prefix) hops gracefully — it shows abbreviated pubkeys. No frontend changes are required for correctness.
Optional enhancement: read the `X-CoreScope-Status` header and show a transient info banner during backfill. This is cosmetic and can be done in a follow-up.
### What about first-run specifically?
On first run with a pre-existing database (e.g., migrating from a version without `resolved_path`), ALL 7.3M observations need backfill. The async approach handles this identically — it just takes longer in the background while HTTP is already serving.
On subsequent restarts, `resolved_path` is already persisted in SQLite and loaded by `store.Load()`. The backfill loop finds zero pending observations and exits immediately.
### What about new observations during backfill?
The poller ingests new packets continuously. New observations written by the ingestor already have `resolved_path` set at ingest time (this is already implemented). The backfill only processes observations with `ResolvedPath == nil`, so there's no conflict with new data.
## Alternatives considered
### Lazy resolution (resolve on API access)
Resolve `resolved_path` only when an observation is accessed via API, cache the result.
**Rejected because:**
- Adds latency to every API call that touches unresolved observations
- Cache invalidation complexity (when does a cached resolution become stale?)
- Doesn't help with index accuracy — indexes still need full data
- The backfill is a one-time cost; lazy resolution makes it a recurring cost
### Progressive loading (recent data first)
Load only the last 24h into memory, start serving, load historical data in background.
**Rejected because:**
- Significantly more complex — all store operations need "is this data loaded yet?" checks
- Memory implications: need to track which time ranges are loaded
- Historical queries return wrong results during loading (not just degraded — wrong)
- The actual bottleneck is backfill, not `Load()`. Even loading all 7.3M observations takes only ~90s.
### Chunked blocking backfill (yield to HTTP between chunks, but keep in main startup)
Process N observations per tick with `runtime.Gosched()` between chunks, but still in `main()` before `ListenAndServe`.
**Rejected because:**
- HTTP still isn't available until all chunks complete
- Adds complexity without solving the core problem
## Carmack Review (Performance)
**The approach is sound.** Moving a 2030 minute blocking operation to a background goroutine is the right call. Some notes:
1. **Chunk size tuning.** 5,000 is a reasonable starting point. Monitor: if write lock contention shows up in pprof (unlikely with microsecond hold times), reduce chunk size. If backfill is too slow, increase it or reduce yield time.
2. **Memory is not a concern.** The observations are already fully loaded in memory by `Load()`. The backfill only mutates the `ResolvedPath` field on existing objects — no additional memory allocation beyond temporary slices for the chunk.
3. **No hidden costs in `resolvePathForObs`.** It reads `nodePM` (a `PrefixMatcher`, immutable after startup) and `graph` (neighbor graph, immutable after startup). No locks needed during resolution. This is embarrassingly parallelizable if needed, but single-goroutine processing with chunking is sufficient.
4. **The index rebuild at the end is O(n) and takes ~35s.** This is a one-time cost after the first backfill. Not worth optimizing further unless the profile shows otherwise.
5. **Risk: `pickBestObservation` during backfill.** API responses may flip their "best" observation as resolved paths become available. This is cosmetically noisy but functionally correct. Document this as expected behavior.
6. **Future optimization if needed:** The backfill loop could be parallelized across multiple goroutines (partition observations by transmission hash). The resolution step is CPU-bound and read-only. This would reduce backfill wall time from 30 min to ~5 min on 8 cores. Not needed for MVP — the goal is HTTP availability, not backfill speed.
## Implementation plan
1. **Refactor `backfillResolvedPaths` into chunked async version** — new function `backfillResolvedPathsAsync` that processes in chunks and yields
2. **Move backfill call in `main.go` to after `ListenAndServe`** — wrap in goroutine
3. **Add `backfillComplete` atomic flag to `PacketStore`** — set after backfill finishes
4. **Add `X-CoreScope-Status` response header** — middleware reads the flag
5. **Rebuild indexes after backfill completes** — single call to rebuild subpath/distance/path-hop
6. **Tests:** unit test for chunked backfill (mock store with N unresolved obs, verify chunks process correctly)
7. **Frontend (follow-up):** optional banner during backfill state
Estimated effort: 12 hours for steps 15, plus tests.