Files
meshcore-analyzer/cmd/server/perf_io.go
T
Kpa-clawbot 74dffa2fb7 feat(perf): per-component disk I/O + write source metrics on Perf page (#1120) (#1123)
## Summary

Implements per-component disk I/O + write source metrics on the Perf
page so operators can self-diagnose write-volume anomalies (cf. the
BackfillPathJSON loop debugged in #1119) without SSHing in to run
iotop/fatrace.

Partial fix for #1120

## What's done (4/6 ACs)
-  `/api/perf/io` — server-process `/proc/self/io` delta rates
(read/write bytes per sec, syscalls)
-  `/api/perf/sqlite` — WAL size, page count, page size, cache hit rate
-  `/api/perf/write-sources` — per-component counters from ingestor
(tx/obs/upserts/backfill_*)
-  Frontend Perf page — three new sections with anomaly thresholds +
per-second rate columns

## What's NOT done (deferred to follow-up)
-  `cancelledWriteBytesPerSec` field — issue #1120 lists this under
server-process I/O ("writes the kernel discarded — interesting signal");
not exposed in this PR
-  Ingestor `/proc/<pid>/io` — issue #1120 says "Both ingestor and
server"; only server-process I/O lands here. Adding ingestor I/O
requires either a unix socket back to the server, or surfacing the
ingestor pid through the stats file. Doable without changing the
existing API shape.
-  Adaptive baselining — anomaly thresholds remain static (10×, 100 MB,
90%); steady-state baselining can come once we have enough deployed
Perf-page telemetry

Per AGENTS.md rule 34, this PR uses "Partial fix for #1120" rather than
"Fixes #1120" so the issue stays open until the remaining ACs land.

## Backend

**Server (`cmd/server/perf_io.go`)**
- `GET /api/perf/io` — reads `/proc/self/io` and returns delta-rate
`{readBytesPerSec, writeBytesPerSec, syscallsRead, syscallsWrite}` since
last call (in-memory tracker, no allocation per sample).
- `GET /api/perf/sqlite` — returns `{walSize, walSizeMB, pageCount,
pageSize, cacheSize, cacheHitRate}`. `cacheHitRate` is proxied from the
in-process row cache (closest available signal under the modernc sqlite
driver).
- `GET /api/perf/write-sources` — reads the ingestor's stats JSON file
and returns a flat `{sources: {...}, sampleAt}` payload.

**Ingestor (`cmd/ingestor/`)**
- `DBStats` gains `WALCommits atomic.Int64` (incremented on every
successful `tx.Commit()` and on every auto-commit `InsertTransmission`
write) and `BackfillUpdates sync.Map` keyed by backfill name with
`IncBackfill(name)` / `SnapshotBackfills()` helpers.
- `BackfillPathJSONAsync` now increments `BackfillUpdates["path_json"]`
per row write — the BackfillPathJSON-style infinite loop becomes
immediately visible at `backfill_path_json` in the Write Sources table.
- New `StartStatsFileWriter` publishes a JSON snapshot to
`/tmp/corescope-ingestor-stats.json` (override via
`CORESCOPE_INGESTOR_STATS`) every second using atomic tmp+rename. The
tmp file is opened with `O_CREATE|O_WRONLY|O_TRUNC|O_NOFOLLOW` mode
`0o600` so a pre-planted symlink in a world-writable `/tmp` cannot
redirect the write to an arbitrary file.

## Frontend (`public/perf.js`)

Three new sections on the Perf page, all auto-refreshed via the existing
5s interval:

- **Disk I/O (server process)** — read/write rates (formatted
B/KB/MB-per-sec) + syscall counts. Write rate >10 MB/s flags ⚠️.
- **Write Sources** — sorted table of per-component counters with a
per-second rate column derived from snapshot deltas. Backfill rows show
⚠️ only when `tx_inserted >= 100` (meaningful baseline) AND the
backfill's per-second rate exceeds 10× the live tx rate. Avoids the
startup-spurious-alarm where cumulative-vs-cumulative was a tautology.
- **SQLite (WAL + Cache Hit)** — WAL size (⚠️ when >100 MB), page count,
page size, cache hit rate (⚠️ when <90%).

## Tests

- **Backend** (`cmd/server/perf_io_test.go`) —
`TestPerfIOEndpoint_ReturnsValidJSON`,
`TestPerfSqliteEndpoint_ReturnsValidJSON`,
`TestPerfWriteSourcesEndpoint_ReturnsSources` exercise the three new
endpoints. Skips the `/proc/self/io` non-zero-rate assertion when
`/proc` is unavailable.
- **Frontend** (`test-perf-disk-io-1120.js`) — vm-sandbox runs `perf.js`
with stubbed `fetch`, asserts the three new sections render with their
headings + values.

E2E assertion added: test-perf-disk-io-1120.js:91

## TDD

1. Red commit (`21abd22`) — added the three handlers as no-op stubs
returning empty values; tests fail on assertion mismatches (non-zero
rate, `pageSize > 0`, headings present).
2. Green commit (`d8da54c`) — fills in the real `/proc/self/io` parser,
PRAGMA queries, ingestor stats writer, and Perf page rendering.

---------

Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com>
2026-05-05 17:56:56 -07:00

201 lines
5.8 KiB
Go

package main
import (
"bufio"
"encoding/json"
"net/http"
"os"
"strconv"
"strings"
"sync"
"time"
)
// PerfIOResponse holds per-process disk I/O metrics derived from /proc/self/io.
type PerfIOResponse struct {
ReadBytesPerSec float64 `json:"readBytesPerSec"`
WriteBytesPerSec float64 `json:"writeBytesPerSec"`
SyscallsRead float64 `json:"syscallsRead"`
SyscallsWrite float64 `json:"syscallsWrite"`
}
// PerfSqliteResponse holds SQLite-specific perf metrics.
type PerfSqliteResponse struct {
WalSizeMB float64 `json:"walSizeMB"`
WalSize int64 `json:"walSize"`
PageCount int64 `json:"pageCount"`
PageSize int64 `json:"pageSize"`
CacheSize int64 `json:"cacheSize"`
CacheHitRate float64 `json:"cacheHitRate"`
}
// procIOSample is a snapshot of /proc/self/io counters.
type procIOSample struct {
at time.Time
readBytes int64
writeBytes int64
syscR int64
syscW int64
}
// perfIOTracker keeps the previous sample so handlePerfIO can compute deltas.
var (
perfIOMu sync.Mutex
perfIOLastSample procIOSample
)
// readProcIO parses /proc/self/io. Returns zero sample on non-Linux or read failure.
func readProcIO() procIOSample {
s := procIOSample{at: time.Now()}
f, err := os.Open("/proc/self/io")
if err != nil {
return s
}
defer f.Close()
sc := bufio.NewScanner(f)
for sc.Scan() {
line := sc.Text()
parts := strings.SplitN(line, ":", 2)
if len(parts) != 2 {
continue
}
key := strings.TrimSpace(parts[0])
val, err := strconv.ParseInt(strings.TrimSpace(parts[1]), 10, 64)
if err != nil {
continue
}
switch key {
case "read_bytes":
s.readBytes = val
case "write_bytes":
s.writeBytes = val
case "syscr":
s.syscR = val
case "syscw":
s.syscW = val
}
}
return s
}
// handlePerfIO returns delta-rate disk I/O for the server process (per-second).
// On the first call (no prior sample), rates are zero; subsequent calls
// report the delta divided by elapsed seconds.
func (s *Server) handlePerfIO(w http.ResponseWriter, r *http.Request) {
cur := readProcIO()
resp := PerfIOResponse{}
perfIOMu.Lock()
prev := perfIOLastSample
perfIOLastSample = cur
perfIOMu.Unlock()
if !prev.at.IsZero() {
dt := cur.at.Sub(prev.at).Seconds()
if dt < 0.001 {
dt = 0.001
}
resp.ReadBytesPerSec = float64(cur.readBytes-prev.readBytes) / dt
resp.WriteBytesPerSec = float64(cur.writeBytes-prev.writeBytes) / dt
resp.SyscallsRead = float64(cur.syscR-prev.syscR) / dt
resp.SyscallsWrite = float64(cur.syscW-prev.syscW) / dt
}
writeJSON(w, resp)
}
// handlePerfSqlite returns SQLite WAL size + cache hit-rate stats.
func (s *Server) handlePerfSqlite(w http.ResponseWriter, r *http.Request) {
resp := PerfSqliteResponse{}
if s.db != nil && s.db.conn != nil {
var pageCount, pageSize int64
_ = s.db.conn.QueryRow("PRAGMA page_count").Scan(&pageCount)
_ = s.db.conn.QueryRow("PRAGMA page_size").Scan(&pageSize)
var cacheSize int64
_ = s.db.conn.QueryRow("PRAGMA cache_size").Scan(&cacheSize)
resp.PageCount = pageCount
resp.PageSize = pageSize
resp.CacheSize = cacheSize
// Cache hit rate: derived from PacketStore cache (rw_cache). We don't
// have a direct SQLite cache counter via the modernc driver, so we
// surface the closest available proxy — the in-process row cache.
if s.store != nil {
cs := s.store.GetCacheStatsTyped()
total := cs.Hits + cs.Misses
if total > 0 {
resp.CacheHitRate = float64(cs.Hits) / float64(total)
}
}
if s.db.path != "" && s.db.path != ":memory:" {
if info, err := os.Stat(s.db.path + "-wal"); err == nil {
resp.WalSize = info.Size()
resp.WalSizeMB = float64(info.Size()) / 1048576
}
}
}
writeJSON(w, resp)
}
// IngestorStats is the on-disk JSON shape the ingestor writes periodically
// for the server to expose via /api/perf/write-sources.
type IngestorStats struct {
SampledAt string `json:"sampledAt"`
TxInserted int64 `json:"tx_inserted"`
ObsInserted int64 `json:"obs_inserted"`
DuplicateTx int64 `json:"tx_dupes"`
NodeUpserts int64 `json:"node_upserts"`
ObserverUpserts int64 `json:"observer_upserts"`
WriteErrors int64 `json:"write_errors"`
SignatureDrops int64 `json:"sig_drops"`
WALCommits int64 `json:"walCommits"`
GroupCommitFlushes int64 `json:"groupCommitFlushes"`
BackfillUpdates map[string]int64 `json:"backfillUpdates"`
}
// IngestorStatsPath is the well-known location where the ingestor writes its
// rolling stats snapshot. Overridable by env CORESCOPE_INGESTOR_STATS for tests.
func IngestorStatsPath() string {
if p := os.Getenv("CORESCOPE_INGESTOR_STATS"); p != "" {
return p
}
return "/tmp/corescope-ingestor-stats.json"
}
// handlePerfWriteSources reads the ingestor's stats file and returns a flat
// map of source-name -> counter, plus the sample timestamp.
func (s *Server) handlePerfWriteSources(w http.ResponseWriter, r *http.Request) {
out := map[string]interface{}{
"sources": map[string]int64{},
"sampleAt": "",
}
data, err := os.ReadFile(IngestorStatsPath())
if err != nil {
writeJSON(w, out)
return
}
var st IngestorStats
if err := json.Unmarshal(data, &st); err != nil {
writeJSON(w, out)
return
}
sources := map[string]int64{
"tx_inserted": st.TxInserted,
"tx_dupes": st.DuplicateTx,
"obs_inserted": st.ObsInserted,
"node_upserts": st.NodeUpserts,
"observer_upserts": st.ObserverUpserts,
"write_errors": st.WriteErrors,
"sig_drops": st.SignatureDrops,
"walCommits": st.WALCommits,
"groupCommitFlushes": st.GroupCommitFlushes,
}
for name, v := range st.BackfillUpdates {
sources["backfill_"+name] = v
}
out["sources"] = sources
out["sampleAt"] = st.SampledAt
writeJSON(w, out)
}