dandri/meshcore-analyzer

Fork 0

mirror of https://github.com/Kpa-clawbot/meshcore-analyzer.git synced 2026-04-21 19:45:43 +00:00

Files

you 8a94c43334 docs: startup performance spec — serve HTTP within 2 minutes on any DB size

2026-04-05 07:09:55 +00:00

10 KiB

Raw Blame History

Startup Performance: Serve HTTP Within 2 Minutes on Any Database Size

Problem

CoreScope takes 30–45 minutes to start on large databases (325K transmissions, 7.3M observations, 1.4GB SQLite). The HTTP server is completely unavailable during this time. Operators cannot restart without 30+ minutes of downtime.

Where time goes (7.3M observation benchmark)

Phase	Time	Blocking?
`Load()` — read SQLite → memory	~90s	Yes
Build subpath index	~20s	Yes
Build distance index	~15s	Yes
Build path-hop index	<1s	Yes
Load neighbor edges from SQLite	<1s	Yes
Backfill `resolved_path` for NULL observations	20–30+ min	Yes — the killer
Re-pick best observations	~10s	Yes

The backfill calls resolvePathForObs for every observation with resolved_path IS NULL, then writes results back to SQLite and updates in-memory state. On first run (or after schema migration), this means resolving all 7.3M observations.

Root cause

backfillResolvedPaths() in neighbor_persist.go runs synchronously in main() before httpServer.ListenAndServe(). It:

Collects all observations with ResolvedPath == nil under a read lock
Resolves paths (CPU-bound, ~millions of calls to resolvePathForObs)
Writes results to SQLite in a single transaction
Updates in-memory state under a write lock

Steps 2–4 block the main goroutine for 20–30 minutes.

Solution: Async Chunked Backfill

Design

Move backfillResolvedPaths out of the startup critical path. Start the HTTP server immediately after loading data and building indexes. Run backfill in a background goroutine with chunked processing that yields between batches.

Startup sequence (new)

1. OpenDB, verify tables                    (~1s)
2. store.Load()                             (~90s)
3. ensureNeighborEdgesTable                  (<1s)
4. ensureResolvedPathColumn                  (<1s)
5. Load/build neighbor graph                 (<1s)
6. Build subpath/distance/path-hop indexes   (~35s)
7. pickBestObservation (with whatever        (~10s)
   resolved_path data exists)
8. *** START HTTP SERVER ***                 — serving at ~2min mark
9. Background: backfillResolvedPaths         (20-30 min, non-blocking)
   → chunked, yields between batches
   → updates in-memory + SQLite incrementally
   → re-picks best obs for affected txs

Total time to first HTTP response: ~2 minutes regardless of database size.

Implementation details

1. Background backfill goroutine

// In main(), after starting HTTP server:
go func() {
    backfillResolvedPathsAsync(store, dbPath, 5000, 100*time.Millisecond)
}()

The async backfill processes observations in chunks of N (e.g., 5,000):

func backfillResolvedPathsAsync(store *PacketStore, dbPath string, chunkSize int, yieldDuration time.Duration) {
    for {
        n := backfillResolvedPathsChunk(store, dbPath, chunkSize)
        if n == 0 {
            break // done
        }
        log.Printf("[store] backfilled resolved_path for %d observations (async)", n)
        time.Sleep(yieldDuration) // yield to HTTP handlers
    }
    log.Printf("[store] async resolved_path backfill complete")
}

Each chunk:

Takes a read lock, collects up to chunkSize pending observations, releases lock
Resolves paths (no lock held — resolvePathForObs only reads immutable data)
Opens a separate RW SQLite connection, writes results in a transaction
Takes a write lock, updates in-memory obs.ResolvedPath and re-picks best obs for affected transmissions, releases lock
Sleeps briefly to yield CPU/lock time to HTTP handlers

2. Readiness flag and API degraded-mode header

Add a boolean to PacketStore:

type PacketStore struct {
    // ...
    backfillComplete atomic.Bool
}

API responses include a header during backfill:

X-CoreScope-Status: backfilling
X-CoreScope-Backfill-Remaining: 4523000

After backfill completes:

X-CoreScope-Status: ready

The frontend can read this header and show a subtle banner: "Resolving hop paths… some paths may show abbreviated pubkeys."

3. Index rebuilds

The subpath, distance, and path-hop indexes are built during startup from whatever data exists. During backfill, newly resolved paths need to update these indexes incrementally.

Options (in order of preference):

Option A: Defer index updates to end of backfill. Indexes work fine with unresolved paths — they just produce slightly less precise results. After backfill completes, rebuild indexes once. Simple, correct, low risk.

Option B: Incremental index updates per chunk. After each chunk, update affected index entries. More complex, better real-time accuracy. Only worth it if index accuracy during backfill matters for production use.

Recommendation: Option A. The indexes are usable with unresolved paths. A single rebuild at the end (~35s) is cheap compared to the backfill duration. The API works throughout — results just improve after backfill finishes.

4. SQLite contention

The backfill opens a separate RW connection for writes. The main server uses a read-only connection for polling. SQLite WAL mode (already in use) allows concurrent readers and one writer. Contention risk is minimal:

Write transactions are small (5,000 UPDATEs per chunk, batched in a single tx)
Read queries from HTTP handlers are unaffected by WAL writes
The 100ms yield between chunks prevents sustained write pressure

5. Lock contention

The write lock is held only during the in-memory update phase of each chunk (~5,000 pointer assignments + re-picks). This takes microseconds. HTTP handlers acquire read locks for API responses — they will not be blocked for any perceptible duration.

6. Frontend handling

The hop-resolver.js module already handles unresolved (prefix) hops gracefully — it shows abbreviated pubkeys. No frontend changes are required for correctness.

Optional enhancement: read the X-CoreScope-Status header and show a transient info banner during backfill. This is cosmetic and can be done in a follow-up.

What about first-run specifically?

On first run with a pre-existing database (e.g., migrating from a version without resolved_path), ALL 7.3M observations need backfill. The async approach handles this identically — it just takes longer in the background while HTTP is already serving.

On subsequent restarts, resolved_path is already persisted in SQLite and loaded by store.Load(). The backfill loop finds zero pending observations and exits immediately.

What about new observations during backfill?

The poller ingests new packets continuously. New observations written by the ingestor already have resolved_path set at ingest time (this is already implemented). The backfill only processes observations with ResolvedPath == nil, so there's no conflict with new data.

Alternatives considered

Lazy resolution (resolve on API access)

Resolve resolved_path only when an observation is accessed via API, cache the result.

Rejected because:

Adds latency to every API call that touches unresolved observations
Cache invalidation complexity (when does a cached resolution become stale?)
Doesn't help with index accuracy — indexes still need full data
The backfill is a one-time cost; lazy resolution makes it a recurring cost

Progressive loading (recent data first)

Load only the last 24h into memory, start serving, load historical data in background.

Rejected because:

Significantly more complex — all store operations need "is this data loaded yet?" checks
Memory implications: need to track which time ranges are loaded
Historical queries return wrong results during loading (not just degraded — wrong)
The actual bottleneck is backfill, not Load(). Even loading all 7.3M observations takes only ~90s.

Chunked blocking backfill (yield to HTTP between chunks, but keep in main startup)

Process N observations per tick with runtime.Gosched() between chunks, but still in main() before ListenAndServe.

Rejected because:

HTTP still isn't available until all chunks complete
Adds complexity without solving the core problem

Carmack Review (Performance)

The approach is sound. Moving a 20–30 minute blocking operation to a background goroutine is the right call. Some notes:

Chunk size tuning. 5,000 is a reasonable starting point. Monitor: if write lock contention shows up in pprof (unlikely with microsecond hold times), reduce chunk size. If backfill is too slow, increase it or reduce yield time.
Memory is not a concern. The observations are already fully loaded in memory by Load(). The backfill only mutates the ResolvedPath field on existing objects — no additional memory allocation beyond temporary slices for the chunk.
No hidden costs in resolvePathForObs. It reads nodePM (a PrefixMatcher, immutable after startup) and graph (neighbor graph, immutable after startup). No locks needed during resolution. This is embarrassingly parallelizable if needed, but single-goroutine processing with chunking is sufficient.
The index rebuild at the end is O(n) and takes ~35s. This is a one-time cost after the first backfill. Not worth optimizing further unless the profile shows otherwise.
Risk: pickBestObservation during backfill. API responses may flip their "best" observation as resolved paths become available. This is cosmetically noisy but functionally correct. Document this as expected behavior.
Future optimization if needed: The backfill loop could be parallelized across multiple goroutines (partition observations by transmission hash). The resolution step is CPU-bound and read-only. This would reduce backfill wall time from 30 min to ~5 min on 8 cores. Not needed for MVP — the goal is HTTP availability, not backfill speed.

Implementation plan

Refactor backfillResolvedPaths into chunked async version — new function backfillResolvedPathsAsync that processes in chunks and yields
Move backfill call in main.go to after ListenAndServe — wrap in goroutine
Add backfillComplete atomic flag to PacketStore — set after backfill finishes
Add X-CoreScope-Status response header — middleware reads the flag
Rebuild indexes after backfill completes — single call to rebuild subpath/distance/path-hop
Tests: unit test for chunked backfill (mock store with N unresolved obs, verify chunks process correctly)
Frontend (follow-up): optional banner during backfill state

Estimated effort: 1–2 hours for steps 1–5, plus tests.

10 KiB Raw Blame History Unescape Escape

Startup Performance: Serve HTTP Within 2 Minutes on Any Database Size

Problem

Where time goes (7.3M observation benchmark)

Root cause

Solution: Async Chunked Backfill

Design

Startup sequence (new)

Implementation details

1. Background backfill goroutine

2. Readiness flag and API degraded-mode header

3. Index rebuilds

4. SQLite contention

5. Lock contention

6. Frontend handling

What about first-run specifically?

What about new observations during backfill?

Alternatives considered

Lazy resolution (resolve on API access)

Progressive loading (recent data first)

Chunked blocking backfill (yield to HTTP between chunks, but keep in main startup)

Carmack Review (Performance)

Implementation plan

10 KiB

Raw Blame History