mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-05-11 16:54:58 +00:00
74dffa2fb7
## Summary Implements per-component disk I/O + write source metrics on the Perf page so operators can self-diagnose write-volume anomalies (cf. the BackfillPathJSON loop debugged in #1119) without SSHing in to run iotop/fatrace. Partial fix for #1120 ## What's done (4/6 ACs) - ✅ `/api/perf/io` — server-process `/proc/self/io` delta rates (read/write bytes per sec, syscalls) - ✅ `/api/perf/sqlite` — WAL size, page count, page size, cache hit rate - ✅ `/api/perf/write-sources` — per-component counters from ingestor (tx/obs/upserts/backfill_*) - ✅ Frontend Perf page — three new sections with anomaly thresholds + per-second rate columns ## What's NOT done (deferred to follow-up) - ❌ `cancelledWriteBytesPerSec` field — issue #1120 lists this under server-process I/O ("writes the kernel discarded — interesting signal"); not exposed in this PR - ❌ Ingestor `/proc/<pid>/io` — issue #1120 says "Both ingestor and server"; only server-process I/O lands here. Adding ingestor I/O requires either a unix socket back to the server, or surfacing the ingestor pid through the stats file. Doable without changing the existing API shape. - ❌ Adaptive baselining — anomaly thresholds remain static (10×, 100 MB, 90%); steady-state baselining can come once we have enough deployed Perf-page telemetry Per AGENTS.md rule 34, this PR uses "Partial fix for #1120" rather than "Fixes #1120" so the issue stays open until the remaining ACs land. ## Backend **Server (`cmd/server/perf_io.go`)** - `GET /api/perf/io` — reads `/proc/self/io` and returns delta-rate `{readBytesPerSec, writeBytesPerSec, syscallsRead, syscallsWrite}` since last call (in-memory tracker, no allocation per sample). - `GET /api/perf/sqlite` — returns `{walSize, walSizeMB, pageCount, pageSize, cacheSize, cacheHitRate}`. `cacheHitRate` is proxied from the in-process row cache (closest available signal under the modernc sqlite driver). - `GET /api/perf/write-sources` — reads the ingestor's stats JSON file and returns a flat `{sources: {...}, sampleAt}` payload. **Ingestor (`cmd/ingestor/`)** - `DBStats` gains `WALCommits atomic.Int64` (incremented on every successful `tx.Commit()` and on every auto-commit `InsertTransmission` write) and `BackfillUpdates sync.Map` keyed by backfill name with `IncBackfill(name)` / `SnapshotBackfills()` helpers. - `BackfillPathJSONAsync` now increments `BackfillUpdates["path_json"]` per row write — the BackfillPathJSON-style infinite loop becomes immediately visible at `backfill_path_json` in the Write Sources table. - New `StartStatsFileWriter` publishes a JSON snapshot to `/tmp/corescope-ingestor-stats.json` (override via `CORESCOPE_INGESTOR_STATS`) every second using atomic tmp+rename. The tmp file is opened with `O_CREATE|O_WRONLY|O_TRUNC|O_NOFOLLOW` mode `0o600` so a pre-planted symlink in a world-writable `/tmp` cannot redirect the write to an arbitrary file. ## Frontend (`public/perf.js`) Three new sections on the Perf page, all auto-refreshed via the existing 5s interval: - **Disk I/O (server process)** — read/write rates (formatted B/KB/MB-per-sec) + syscall counts. Write rate >10 MB/s flags ⚠️. - **Write Sources** — sorted table of per-component counters with a per-second rate column derived from snapshot deltas. Backfill rows show ⚠️ only when `tx_inserted >= 100` (meaningful baseline) AND the backfill's per-second rate exceeds 10× the live tx rate. Avoids the startup-spurious-alarm where cumulative-vs-cumulative was a tautology. - **SQLite (WAL + Cache Hit)** — WAL size (⚠️ when >100 MB), page count, page size, cache hit rate (⚠️ when <90%). ## Tests - **Backend** (`cmd/server/perf_io_test.go`) — `TestPerfIOEndpoint_ReturnsValidJSON`, `TestPerfSqliteEndpoint_ReturnsValidJSON`, `TestPerfWriteSourcesEndpoint_ReturnsSources` exercise the three new endpoints. Skips the `/proc/self/io` non-zero-rate assertion when `/proc` is unavailable. - **Frontend** (`test-perf-disk-io-1120.js`) — vm-sandbox runs `perf.js` with stubbed `fetch`, asserts the three new sections render with their headings + values. E2E assertion added: test-perf-disk-io-1120.js:91 ## TDD 1. Red commit (`21abd22`) — added the three handlers as no-op stubs returning empty values; tests fail on assertion mismatches (non-zero rate, `pageSize > 0`, headings present). 2. Green commit (`d8da54c`) — fills in the real `/proc/self/io` parser, PRAGMA queries, ingestor stats writer, and Perf page rendering. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com>
201 lines
5.8 KiB
Go
201 lines
5.8 KiB
Go
package main
|
|
|
|
import (
|
|
"bufio"
|
|
"encoding/json"
|
|
"net/http"
|
|
"os"
|
|
"strconv"
|
|
"strings"
|
|
"sync"
|
|
"time"
|
|
)
|
|
|
|
// PerfIOResponse holds per-process disk I/O metrics derived from /proc/self/io.
|
|
type PerfIOResponse struct {
|
|
ReadBytesPerSec float64 `json:"readBytesPerSec"`
|
|
WriteBytesPerSec float64 `json:"writeBytesPerSec"`
|
|
SyscallsRead float64 `json:"syscallsRead"`
|
|
SyscallsWrite float64 `json:"syscallsWrite"`
|
|
}
|
|
|
|
// PerfSqliteResponse holds SQLite-specific perf metrics.
|
|
type PerfSqliteResponse struct {
|
|
WalSizeMB float64 `json:"walSizeMB"`
|
|
WalSize int64 `json:"walSize"`
|
|
PageCount int64 `json:"pageCount"`
|
|
PageSize int64 `json:"pageSize"`
|
|
CacheSize int64 `json:"cacheSize"`
|
|
CacheHitRate float64 `json:"cacheHitRate"`
|
|
}
|
|
|
|
// procIOSample is a snapshot of /proc/self/io counters.
|
|
type procIOSample struct {
|
|
at time.Time
|
|
readBytes int64
|
|
writeBytes int64
|
|
syscR int64
|
|
syscW int64
|
|
}
|
|
|
|
// perfIOTracker keeps the previous sample so handlePerfIO can compute deltas.
|
|
var (
|
|
perfIOMu sync.Mutex
|
|
perfIOLastSample procIOSample
|
|
)
|
|
|
|
// readProcIO parses /proc/self/io. Returns zero sample on non-Linux or read failure.
|
|
func readProcIO() procIOSample {
|
|
s := procIOSample{at: time.Now()}
|
|
f, err := os.Open("/proc/self/io")
|
|
if err != nil {
|
|
return s
|
|
}
|
|
defer f.Close()
|
|
sc := bufio.NewScanner(f)
|
|
for sc.Scan() {
|
|
line := sc.Text()
|
|
parts := strings.SplitN(line, ":", 2)
|
|
if len(parts) != 2 {
|
|
continue
|
|
}
|
|
key := strings.TrimSpace(parts[0])
|
|
val, err := strconv.ParseInt(strings.TrimSpace(parts[1]), 10, 64)
|
|
if err != nil {
|
|
continue
|
|
}
|
|
switch key {
|
|
case "read_bytes":
|
|
s.readBytes = val
|
|
case "write_bytes":
|
|
s.writeBytes = val
|
|
case "syscr":
|
|
s.syscR = val
|
|
case "syscw":
|
|
s.syscW = val
|
|
}
|
|
}
|
|
return s
|
|
}
|
|
|
|
// handlePerfIO returns delta-rate disk I/O for the server process (per-second).
|
|
// On the first call (no prior sample), rates are zero; subsequent calls
|
|
// report the delta divided by elapsed seconds.
|
|
func (s *Server) handlePerfIO(w http.ResponseWriter, r *http.Request) {
|
|
cur := readProcIO()
|
|
resp := PerfIOResponse{}
|
|
|
|
perfIOMu.Lock()
|
|
prev := perfIOLastSample
|
|
perfIOLastSample = cur
|
|
perfIOMu.Unlock()
|
|
|
|
if !prev.at.IsZero() {
|
|
dt := cur.at.Sub(prev.at).Seconds()
|
|
if dt < 0.001 {
|
|
dt = 0.001
|
|
}
|
|
resp.ReadBytesPerSec = float64(cur.readBytes-prev.readBytes) / dt
|
|
resp.WriteBytesPerSec = float64(cur.writeBytes-prev.writeBytes) / dt
|
|
resp.SyscallsRead = float64(cur.syscR-prev.syscR) / dt
|
|
resp.SyscallsWrite = float64(cur.syscW-prev.syscW) / dt
|
|
}
|
|
writeJSON(w, resp)
|
|
}
|
|
|
|
// handlePerfSqlite returns SQLite WAL size + cache hit-rate stats.
|
|
func (s *Server) handlePerfSqlite(w http.ResponseWriter, r *http.Request) {
|
|
resp := PerfSqliteResponse{}
|
|
if s.db != nil && s.db.conn != nil {
|
|
var pageCount, pageSize int64
|
|
_ = s.db.conn.QueryRow("PRAGMA page_count").Scan(&pageCount)
|
|
_ = s.db.conn.QueryRow("PRAGMA page_size").Scan(&pageSize)
|
|
var cacheSize int64
|
|
_ = s.db.conn.QueryRow("PRAGMA cache_size").Scan(&cacheSize)
|
|
resp.PageCount = pageCount
|
|
resp.PageSize = pageSize
|
|
resp.CacheSize = cacheSize
|
|
|
|
// Cache hit rate: derived from PacketStore cache (rw_cache). We don't
|
|
// have a direct SQLite cache counter via the modernc driver, so we
|
|
// surface the closest available proxy — the in-process row cache.
|
|
if s.store != nil {
|
|
cs := s.store.GetCacheStatsTyped()
|
|
total := cs.Hits + cs.Misses
|
|
if total > 0 {
|
|
resp.CacheHitRate = float64(cs.Hits) / float64(total)
|
|
}
|
|
}
|
|
|
|
if s.db.path != "" && s.db.path != ":memory:" {
|
|
if info, err := os.Stat(s.db.path + "-wal"); err == nil {
|
|
resp.WalSize = info.Size()
|
|
resp.WalSizeMB = float64(info.Size()) / 1048576
|
|
}
|
|
}
|
|
}
|
|
writeJSON(w, resp)
|
|
}
|
|
|
|
// IngestorStats is the on-disk JSON shape the ingestor writes periodically
|
|
// for the server to expose via /api/perf/write-sources.
|
|
type IngestorStats struct {
|
|
SampledAt string `json:"sampledAt"`
|
|
TxInserted int64 `json:"tx_inserted"`
|
|
ObsInserted int64 `json:"obs_inserted"`
|
|
DuplicateTx int64 `json:"tx_dupes"`
|
|
NodeUpserts int64 `json:"node_upserts"`
|
|
ObserverUpserts int64 `json:"observer_upserts"`
|
|
WriteErrors int64 `json:"write_errors"`
|
|
SignatureDrops int64 `json:"sig_drops"`
|
|
WALCommits int64 `json:"walCommits"`
|
|
GroupCommitFlushes int64 `json:"groupCommitFlushes"`
|
|
BackfillUpdates map[string]int64 `json:"backfillUpdates"`
|
|
}
|
|
|
|
// IngestorStatsPath is the well-known location where the ingestor writes its
|
|
// rolling stats snapshot. Overridable by env CORESCOPE_INGESTOR_STATS for tests.
|
|
func IngestorStatsPath() string {
|
|
if p := os.Getenv("CORESCOPE_INGESTOR_STATS"); p != "" {
|
|
return p
|
|
}
|
|
return "/tmp/corescope-ingestor-stats.json"
|
|
}
|
|
|
|
// handlePerfWriteSources reads the ingestor's stats file and returns a flat
|
|
// map of source-name -> counter, plus the sample timestamp.
|
|
func (s *Server) handlePerfWriteSources(w http.ResponseWriter, r *http.Request) {
|
|
out := map[string]interface{}{
|
|
"sources": map[string]int64{},
|
|
"sampleAt": "",
|
|
}
|
|
|
|
data, err := os.ReadFile(IngestorStatsPath())
|
|
if err != nil {
|
|
writeJSON(w, out)
|
|
return
|
|
}
|
|
var st IngestorStats
|
|
if err := json.Unmarshal(data, &st); err != nil {
|
|
writeJSON(w, out)
|
|
return
|
|
}
|
|
sources := map[string]int64{
|
|
"tx_inserted": st.TxInserted,
|
|
"tx_dupes": st.DuplicateTx,
|
|
"obs_inserted": st.ObsInserted,
|
|
"node_upserts": st.NodeUpserts,
|
|
"observer_upserts": st.ObserverUpserts,
|
|
"write_errors": st.WriteErrors,
|
|
"sig_drops": st.SignatureDrops,
|
|
"walCommits": st.WALCommits,
|
|
"groupCommitFlushes": st.GroupCommitFlushes,
|
|
}
|
|
for name, v := range st.BackfillUpdates {
|
|
sources["backfill_"+name] = v
|
|
}
|
|
out["sources"] = sources
|
|
out["sampleAt"] = st.SampledAt
|
|
writeJSON(w, out)
|
|
}
|