mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-05-12 17:25:24 +00:00
213 lines
10 KiB
Markdown
213 lines
10 KiB
Markdown
# Startup Performance: Serve HTTP Within 2 Minutes on Any Database Size
|
||
|
||
## Problem
|
||
|
||
CoreScope takes 30–45 minutes to start on large databases (325K transmissions, 7.3M observations, 1.4GB SQLite). The HTTP server is completely unavailable during this time. Operators cannot restart without 30+ minutes of downtime.
|
||
|
||
### Where time goes (7.3M observation benchmark)
|
||
|
||
| Phase | Time | Blocking? |
|
||
|---|---|---|
|
||
| `Load()` — read SQLite → memory | ~90s | Yes |
|
||
| Build subpath index | ~20s | Yes |
|
||
| Build distance index | ~15s | Yes |
|
||
| Build path-hop index | <1s | Yes |
|
||
| Load neighbor edges from SQLite | <1s | Yes |
|
||
| **Backfill `resolved_path` for NULL observations** | **20–30+ min** | **Yes — the killer** |
|
||
| Re-pick best observations | ~10s | Yes |
|
||
|
||
The backfill calls `resolvePathForObs` for every observation with `resolved_path IS NULL`, then writes results back to SQLite and updates in-memory state. On first run (or after schema migration), this means resolving all 7.3M observations.
|
||
|
||
### Root cause
|
||
|
||
`backfillResolvedPaths()` in `neighbor_persist.go` runs synchronously in `main()` before `httpServer.ListenAndServe()`. It:
|
||
1. Collects all observations with `ResolvedPath == nil` under a read lock
|
||
2. Resolves paths (CPU-bound, ~millions of calls to `resolvePathForObs`)
|
||
3. Writes results to SQLite in a single transaction
|
||
4. Updates in-memory state under a write lock
|
||
|
||
Steps 2–4 block the main goroutine for 20–30 minutes.
|
||
|
||
## Solution: Async Chunked Backfill
|
||
|
||
### Design
|
||
|
||
Move `backfillResolvedPaths` out of the startup critical path. Start the HTTP server immediately after loading data and building indexes. Run backfill in a background goroutine with chunked processing that yields between batches.
|
||
|
||
### Startup sequence (new)
|
||
|
||
```
|
||
1. OpenDB, verify tables (~1s)
|
||
2. store.Load() (~90s)
|
||
3. ensureNeighborEdgesTable (<1s)
|
||
4. ensureResolvedPathColumn (<1s)
|
||
5. Load/build neighbor graph (<1s)
|
||
6. Build subpath/distance/path-hop indexes (~35s)
|
||
7. pickBestObservation (with whatever (~10s)
|
||
resolved_path data exists)
|
||
8. *** START HTTP SERVER *** — serving at ~2min mark
|
||
9. Background: backfillResolvedPaths (20-30 min, non-blocking)
|
||
→ chunked, yields between batches
|
||
→ updates in-memory + SQLite incrementally
|
||
→ re-picks best obs for affected txs
|
||
```
|
||
|
||
Total time to first HTTP response: **~2 minutes** regardless of database size.
|
||
|
||
### Implementation details
|
||
|
||
#### 1. Background backfill goroutine
|
||
|
||
```go
|
||
// In main(), after starting HTTP server:
|
||
go func() {
|
||
backfillResolvedPathsAsync(store, dbPath, 5000, 100*time.Millisecond)
|
||
}()
|
||
```
|
||
|
||
The async backfill processes observations in chunks of N (e.g., 5,000):
|
||
|
||
```go
|
||
func backfillResolvedPathsAsync(store *PacketStore, dbPath string, chunkSize int, yieldDuration time.Duration) {
|
||
for {
|
||
n := backfillResolvedPathsChunk(store, dbPath, chunkSize)
|
||
if n == 0 {
|
||
break // done
|
||
}
|
||
log.Printf("[store] backfilled resolved_path for %d observations (async)", n)
|
||
time.Sleep(yieldDuration) // yield to HTTP handlers
|
||
}
|
||
log.Printf("[store] async resolved_path backfill complete")
|
||
}
|
||
```
|
||
|
||
Each chunk:
|
||
1. Takes a read lock, collects up to `chunkSize` pending observations, releases lock
|
||
2. Resolves paths (no lock held — `resolvePathForObs` only reads immutable data)
|
||
3. Opens a separate RW SQLite connection, writes results in a transaction
|
||
4. Takes a write lock, updates in-memory `obs.ResolvedPath` and re-picks best obs for affected transmissions, releases lock
|
||
5. Sleeps briefly to yield CPU/lock time to HTTP handlers
|
||
|
||
#### 2. Readiness flag and API degraded-mode header
|
||
|
||
Add a boolean to `PacketStore`:
|
||
|
||
```go
|
||
type PacketStore struct {
|
||
// ...
|
||
backfillComplete atomic.Bool
|
||
}
|
||
```
|
||
|
||
API responses include a header during backfill:
|
||
|
||
```
|
||
X-CoreScope-Status: backfilling
|
||
X-CoreScope-Backfill-Remaining: 4523000
|
||
```
|
||
|
||
After backfill completes:
|
||
```
|
||
X-CoreScope-Status: ready
|
||
```
|
||
|
||
The frontend can read this header and show a subtle banner: *"Resolving hop paths… some paths may show abbreviated pubkeys."*
|
||
|
||
#### 3. Index rebuilds
|
||
|
||
The subpath, distance, and path-hop indexes are built during startup from whatever data exists. During backfill, newly resolved paths need to update these indexes incrementally.
|
||
|
||
Options (in order of preference):
|
||
|
||
**Option A: Defer index updates to end of backfill.** Indexes work fine with unresolved paths — they just produce slightly less precise results. After backfill completes, rebuild indexes once. Simple, correct, low risk.
|
||
|
||
**Option B: Incremental index updates per chunk.** After each chunk, update affected index entries. More complex, better real-time accuracy. Only worth it if index accuracy during backfill matters for production use.
|
||
|
||
**Recommendation: Option A.** The indexes are usable with unresolved paths. A single rebuild at the end (~35s) is cheap compared to the backfill duration. The API works throughout — results just improve after backfill finishes.
|
||
|
||
#### 4. SQLite contention
|
||
|
||
The backfill opens a separate RW connection for writes. The main server uses a read-only connection for polling. SQLite WAL mode (already in use) allows concurrent readers and one writer. Contention risk is minimal:
|
||
|
||
- Write transactions are small (5,000 UPDATEs per chunk, batched in a single tx)
|
||
- Read queries from HTTP handlers are unaffected by WAL writes
|
||
- The 100ms yield between chunks prevents sustained write pressure
|
||
|
||
#### 5. Lock contention
|
||
|
||
The write lock is held only during the in-memory update phase of each chunk (~5,000 pointer assignments + re-picks). This takes microseconds. HTTP handlers acquire read locks for API responses — they will not be blocked for any perceptible duration.
|
||
|
||
#### 6. Frontend handling
|
||
|
||
The `hop-resolver.js` module already handles unresolved (prefix) hops gracefully — it shows abbreviated pubkeys. No frontend changes are required for correctness.
|
||
|
||
Optional enhancement: read the `X-CoreScope-Status` header and show a transient info banner during backfill. This is cosmetic and can be done in a follow-up.
|
||
|
||
### What about first-run specifically?
|
||
|
||
On first run with a pre-existing database (e.g., migrating from a version without `resolved_path`), ALL 7.3M observations need backfill. The async approach handles this identically — it just takes longer in the background while HTTP is already serving.
|
||
|
||
On subsequent restarts, `resolved_path` is already persisted in SQLite and loaded by `store.Load()`. The backfill loop finds zero pending observations and exits immediately.
|
||
|
||
### What about new observations during backfill?
|
||
|
||
The poller ingests new packets continuously. New observations written by the ingestor already have `resolved_path` set at ingest time (this is already implemented). The backfill only processes observations with `ResolvedPath == nil`, so there's no conflict with new data.
|
||
|
||
## Alternatives considered
|
||
|
||
### Lazy resolution (resolve on API access)
|
||
|
||
Resolve `resolved_path` only when an observation is accessed via API, cache the result.
|
||
|
||
**Rejected because:**
|
||
- Adds latency to every API call that touches unresolved observations
|
||
- Cache invalidation complexity (when does a cached resolution become stale?)
|
||
- Doesn't help with index accuracy — indexes still need full data
|
||
- The backfill is a one-time cost; lazy resolution makes it a recurring cost
|
||
|
||
### Progressive loading (recent data first)
|
||
|
||
Load only the last 24h into memory, start serving, load historical data in background.
|
||
|
||
**Rejected because:**
|
||
- Significantly more complex — all store operations need "is this data loaded yet?" checks
|
||
- Memory implications: need to track which time ranges are loaded
|
||
- Historical queries return wrong results during loading (not just degraded — wrong)
|
||
- The actual bottleneck is backfill, not `Load()`. Even loading all 7.3M observations takes only ~90s.
|
||
|
||
### Chunked blocking backfill (yield to HTTP between chunks, but keep in main startup)
|
||
|
||
Process N observations per tick with `runtime.Gosched()` between chunks, but still in `main()` before `ListenAndServe`.
|
||
|
||
**Rejected because:**
|
||
- HTTP still isn't available until all chunks complete
|
||
- Adds complexity without solving the core problem
|
||
|
||
## Carmack Review (Performance)
|
||
|
||
**The approach is sound.** Moving a 20–30 minute blocking operation to a background goroutine is the right call. Some notes:
|
||
|
||
1. **Chunk size tuning.** 5,000 is a reasonable starting point. Monitor: if write lock contention shows up in pprof (unlikely with microsecond hold times), reduce chunk size. If backfill is too slow, increase it or reduce yield time.
|
||
|
||
2. **Memory is not a concern.** The observations are already fully loaded in memory by `Load()`. The backfill only mutates the `ResolvedPath` field on existing objects — no additional memory allocation beyond temporary slices for the chunk.
|
||
|
||
3. **No hidden costs in `resolvePathForObs`.** It reads `nodePM` (a `PrefixMatcher`, immutable after startup) and `graph` (neighbor graph, immutable after startup). No locks needed during resolution. This is embarrassingly parallelizable if needed, but single-goroutine processing with chunking is sufficient.
|
||
|
||
4. **The index rebuild at the end is O(n) and takes ~35s.** This is a one-time cost after the first backfill. Not worth optimizing further unless the profile shows otherwise.
|
||
|
||
5. **Risk: `pickBestObservation` during backfill.** API responses may flip their "best" observation as resolved paths become available. This is cosmetically noisy but functionally correct. Document this as expected behavior.
|
||
|
||
6. **Future optimization if needed:** The backfill loop could be parallelized across multiple goroutines (partition observations by transmission hash). The resolution step is CPU-bound and read-only. This would reduce backfill wall time from 30 min to ~5 min on 8 cores. Not needed for MVP — the goal is HTTP availability, not backfill speed.
|
||
|
||
## Implementation plan
|
||
|
||
1. **Refactor `backfillResolvedPaths` into chunked async version** — new function `backfillResolvedPathsAsync` that processes in chunks and yields
|
||
2. **Move backfill call in `main.go` to after `ListenAndServe`** — wrap in goroutine
|
||
3. **Add `backfillComplete` atomic flag to `PacketStore`** — set after backfill finishes
|
||
4. **Add `X-CoreScope-Status` response header** — middleware reads the flag
|
||
5. **Rebuild indexes after backfill completes** — single call to rebuild subpath/distance/path-hop
|
||
6. **Tests:** unit test for chunked backfill (mock store with N unresolved obs, verify chunks process correctly)
|
||
7. **Frontend (follow-up):** optional banner during backfill state
|
||
|
||
Estimated effort: 1–2 hours for steps 1–5, plus tests.
|