meshcore-analyzer

mirror of https://github.com/Kpa-clawbot/meshcore-analyzer.git synced 2026-06-06 10:41:36 +00:00

Author	SHA1	Message	Date
Kpa-clawbot	467b01a1b3	fix(#1285 ): exclude RTC-reset outliers from clock-skew hash median + recent bad count (#1288 ) Red commit: `97c9a22a55` (CI: https://github.com/Kpa-clawbot/CoreScope/commit/97c9a22a55b07d1576c579aa9d23b290dad33eb6/checks) Fixes #1285. ## What was broken Bug A — outlier-dominated hash-evidence median. On the per-hash evidence panel a single observer reporting an RTC-reset advert (firmware emitting factory timestamp, ~700d off) dragged the displayed median to "median corrected: -704d 18h" even when every other observer of that hash saw a normal value. Bug B — false "N of last K had nonsense timestamps" warning. `recentBadSampleCount` lumped RTC-reset adverts in with "bimodal-bad" samples. On the repro node every recent skew was -16…-22s (healthy), but the lone RTC-reset advert that landed inside the recent window was counted as bad → "3 of last 5 adverts had nonsense timestamps" fired and the node was misclassified `bimodal_clock`. Root cause of B: the recent-window split (`cmd/server/clock_skew.go` ~L575) classified anything `\|corrected skew\| > 1h` as "bad". That conflates true bimodal RTC oscillation (1h…24h) with factory-timestamp resets (>24h, already surfaced via the RTC-reset badge). ## Fix - New `rtcResetOutlierThresholdSec = 24h`. Rationale: real µC drift is sub-second/advert; real bimodal RTC misbehaves in the hours range; anything >1d is not a drift signal. - Recent-window split puts `\|skew\| > 24h` in a third bucket excluded from both `recentSampleCount` and `recentBadCount`. - New `hashEvidenceMedian()` filters outliers before computing the per-hash median. UI labels the hash "insufficient data (N RTC-reset outliers excluded)" when every observer saw a reset-shaped advert. - Three pre-existing #845 tests used -50M-sec "bad" samples (RTC-reset range) — re-pointed to -7200s (true bimodal range), what `bimodal_clock` actually models. ## Preflight overrides - check-branch-clean: cross-stack: justified — backend computes counts/median; frontend renders the new label. ## Browser verification Confirmed staging node `c0dedad…` repro matches the test fixture. No new CSS vars. ## E2E assertion added `cmd/server/clock_skew_issue1285_test.go:81` and `:103`. --------- Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-19 01:17:12 -07:00
Kpa-clawbot	b47587f031	feat(#690 ): expose observer skew + per-hash evidence in clock UI (#906 ) ## Summary UI completion of #690 — surfaces observer clock skew and per-hash evidence that the backend already computes but wasn't exposed in the frontend. Not related to #845/PR #894 (bimodal detection) — this is the UI surface for the original #690 scope. ## Changes ### Backend: per-hash evidence in node clock-skew API (commit 1) - Extended `GET /api/nodes/{pubkey}/clock-skew` to return `recentHashEvidence` (most recent 10 hashes with per-observer raw/corrected skew and observer offset) and `calibrationSummary` (total/calibrated/uncalibrated counts). - Evidence is cached during `ClockSkewEngine.Recompute()` — route handler is cheap. - Fleet endpoint omits evidence to keep payload small. ### Frontend: observer list page — clock offset column (commit 2) - Added "Clock Offset" column to observers table. - Fetches `/api/observers/clock-skew` once on page load, joins by ObserverID. - Color-coded severity badge + sample count tooltip. - Singleton observers show "—" not "0". ### Frontend: observer-detail clock card (commit 3) - Added clock offset card mirroring node clock card style. - Shows: offset value, sample count, severity badge. - Inline explainer describing how offset is computed from multi-observer packets. ### Frontend: node clock card evidence panel (commit 4) - Collapsible "Evidence" section in existing node clock skew card. - Per-hash breakdown: observer count, median corrected skew, per-observer raw/corrected/offset. - Calibration summary line and plain-English severity reason at top. ## Test Results ``` go test ./... (cmd/server) — PASS (19.3s) go test ./... (cmd/ingestor) — PASS (31.6s) Frontend helpers: 610 passed, 0 failed ``` New test: `TestNodeClockSkew_EvidencePayload` — 3-observer scenario verifying per-hash array shape, corrected = raw + offset math, and median. No frontend JS smoke test added — no existing test harness for clock/observer rendering. Noted for future. ## Screenshots Screenshots TBD ## Perf justification Evidence is computed inside the existing `Recompute()` cycle (already O(n) on samples). The `hashEvidence` map adds ~32 bytes per sample of memory. Evidence is stripped from fleet responses. Per-node endpoint returns at most 10 evidence entries — bounded payload. --------- Co-authored-by: you <you@example.com>	2026-05-02 10:30:54 -07:00
Kpa-clawbot	441409203e	feat(#845 ): bimodal_clock severity — surface flaky-RTC nodes instead of hiding as 'No Clock' (#850 ) ## Problem Nodes with flaky RTC (firmware emitting interleaved good and nonsense timestamps) were classified as `no_clock` because the broken samples poisoned the recent median. Operators lost visibility into these nodes — they showed "No Clock" even though ~60% of their adverts had valid timestamps. Observed on staging: a node with 31K samples where recent adverts interleave good skew (-6.8s, -13.6s) with firmware nonsense (-56M, -60M seconds). Under the old logic, median of the mixed window → `no_clock`. ## Solution New `bimodal_clock` severity tier that surfaces flaky-RTC nodes with their real (good-sample) skew value. ### Classification order (first match wins) \| Severity \| Good Fraction \| Description \| \|----------\|--------------\|-------------\| \| `no_clock` \| < 10% \| Essentially no real clock \| \| `bimodal_clock` \| 10–80% (and bad > 0) \| Mixed good/bad — flaky RTC \| \| `ok`/`warn`/`critical`/`absurd` \| ≥ 80% \| Normal classification \| "Good" = `\|skew\| <= 1 hour`; "bad" = likely uninitialized RTC nonsense. When `bimodal_clock`, `recentMedianSkewSec` is computed from good samples only, so the dashboard shows the real working-clock value (e.g. -7s) instead of the broken median. ### Backend changes - New constant `BimodalSkewThresholdSec = 3600` - New severity `bimodal_clock` in classification logic - New API fields: `goodFraction`, `recentBadSampleCount`, `recentSampleCount` ### Frontend changes - Amber `Bimodal` badge with tooltip showing bad-sample percentage - Bimodal nodes render skew value like ok/warn/severe (not the "No Clock" path) - Warning line below sparkline: "⚠️ X of last Y adverts had nonsense timestamps (likely RTC reset)" ### Tests - 3 new Go unit tests: bimodal (60% good → bimodal_clock), all-bad (→ no_clock), 90%-good (→ ok) - 1 new frontend test: bimodal badge rendering with tooltip - Existing `TestReporterScenario_789` passes unchanged Builds on #789 (recent-window severity). Closes #845 --------- Co-authored-by: you <you@example.com>	2026-04-21 09:11:14 -07:00
Kpa-clawbot	a0fddb50aa	fix(#789 ): severity from recent samples; Theil-Sen drift with outlier rejection (#828 ) Closes #789. ## The two bugs 1. Severity from stale median. `classifySkew(absMedian)` used the all-time `MedianSkewSec` over every advert ever recorded for the node. A repeater that was off for hours and then GPS-corrected stayed pinned to `absurd` because hundreds of historical bad samples poisoned the median. Reporter's case: `medianSkewSec: -59,063,561.8` while `lastSkewSec: -0.8` — current health was perfect, dashboard said catastrophic. 2. Drift from a single correction jump. Drift used OLS over every `(ts, skew)` pair, with no outlier rejection. A single GPS-correction event (skew jumps millions of seconds in ~30s) dominated the regression and produced `+1,793,549.9 s/day` — physically nonsense; the existing `maxReasonableDriftPerDay` cap then zeroed it (better than absurd, but still useless). ## The two fixes 1. Recent-window severity. New field `recentMedianSkewSec` = median over the last `N=5` samples or last `1h`, whichever is narrower (more current view). Severity now derives from `abs(recentMedianSkewSec)`. `MeanSkewSec`, `MedianSkewSec`, `LastSkewSec` are preserved unchanged so the frontend, fleet view, and any external consumers continue to work. 2. Theil-Sen drift with outlier filter. Drift now uses the Theil-Sen estimator (median of all pairwise slopes — textbook robust regression, ~29% breakdown point) on a series pre-filtered to drop samples whose skew jumps more than `maxPlausibleSkewJumpSec = 60s` from the previous accepted point. Real µC drift is fractions of a second per advert; clock corrections fall well outside. Capped at `theilSenMaxPoints = 200` (most-recent) so O(n²) stays bounded for chatty nodes. ## What stays the same - Epoch-0 / out-of-range advert filter (PR #769). - `minDriftSamples = 5` floor. - `maxReasonableDriftPerDay = 86400` hard backstop. - API shape: only additions (`recentMedianSkewSec`); no fields removed or renamed. ## Tests All in `cmd/server/clock_skew_test.go`: - `TestSeverityUsesRecentNotMedian` — 100 bad samples (-60s) + 5 good (-1s) → severity = `ok`, historical median still huge. - `TestDriftRejectsCorrectionJump` — 30 min of clean linear drift + one 1000s jump → drift small (~12 s/day). - `TestTheilSenMatchesOLSWhenClean` — clean linear data, Theil-Sen within ~1% of OLS. - `TestReporterScenario_789` — exact reproducer: 1662 samples, 1657 @ -683 days then 5 @ -1s → severity `ok`, `recentMedianSkewSec ≈ 0`, drift bounded; legacy `medianSkewSec` preserved as historical context. `go test ./... -count=1` (cmd/server) and `node test-frontend-helpers.js` both pass. --------- Co-authored-by: clawbot <bot@corescope.local> Co-authored-by: you <you@example.com>	2026-04-20 22:47:10 -07:00
Kpa-clawbot	ba7cd0fba7	fix: clock skew sanity checks — filter epoch-0, cap drift, min samples (#769 ) Nodes with dead RTCs show -690d skew and -3 billion s/day drift. Fix: 1. No Clock severity: \|skew\| > 365d → `no_clock`, skip drift 2. Drift cap: \|drift\| > 86400 s/day → nil (physically impossible) 3. Min samples: < 5 samples → no drift regression 4. Frontend: 'No Clock' badge, '–' for unreliable drift Fixes the crazy stats on the Clock Health fleet view. --------- Co-authored-by: you <you@example.com>	2026-04-16 08:10:47 -07:00
Kpa-clawbot	a815e70975	feat: Clock skew detection — backend computation (M1) (#746 ) ## Summary Implements Milestone 1 of #690 — backend clock skew computation for nodes and observers. ## What's New ### Clock Skew Engine (`clock_skew.go`) Phase 1 — Raw Skew Calculation: For every ADVERT observation: `raw_skew = advert_timestamp - observation_timestamp` Phase 2 — Observer Calibration: Same packet seen by multiple observers → compute each observer's clock offset as the median deviation from the per-packet median observation timestamp. This identifies observers with their own clock drift. Phase 3 — Corrected Node Skew: `corrected_skew = raw_skew + observer_offset` — compensates for observer clock error. Phase 4 — Trend Analysis: Linear regression over time-ordered skew samples estimates drift rate in seconds/day. Detects crystal drift vs stable offset vs sudden jumps. ### Severity Classification \| Level \| Threshold \| Meaning \| \|-------\|-----------\|---------\| \| ✅ OK \| < 5 min \| Normal \| \| ⚠️ Warning \| 5 min – 1 hour \| Clock drifting \| \| 🔴 Critical \| 1 hour – 30 days \| Likely no time source \| \| 🟣 Absurd \| > 30 days \| Firmware default or epoch 0 \| ### New API Endpoints - `GET /api/nodes/{pubkey}/clock-skew` — per-node skew data (mean, median, last, drift, severity) - `GET /api/observers/clock-skew` — observer calibration offsets - Clock skew also included in `GET /api/nodes/{pubkey}/analytics` response as `clockSkew` field ### Performance - 30-second compute cache avoids reprocessing on every request - Operates on in-memory `byPayloadType[ADVERT]` index — no DB queries - O(n) in total ADVERT observations, O(m log m) for median calculations ## Tests 15 unit tests covering: - Severity classification at all thresholds - Median/mean math helpers - ISO timestamp parsing - Timestamp extraction from decoded JSON (nested and top-level) - Observer calibration with single and multi-observer scenarios - Observer offset correction direction (verified the sign is `+obsOffset`) - Drift estimation: stable, linear, insufficient data, short time span - JSON number extraction edge cases ## What's NOT in This PR - No UI changes (M2–M4) - No customizer integration (M5) - Thresholds are hardcoded constants (will be configurable in M5) Implements #690 M1. --------- Co-authored-by: you <you@example.com>	2026-04-14 23:22:35 -07:00

6 Commits