mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-06-03 23:51:22 +00:00
dc6c79cff8
RED `f06887` — GREEN `8f53c1`. CI: (will populate on PR open) `Fixes #1335` ## Problem PR #1216 added per-source stall **detection** (`LivenessStalled`) but only **logged**. Staging's `lincomatic` source has been silently losing ~14k pkts/hr behind a half-open TCP socket the Azure NAT abandons: paho reports `IsConnected==true`, no messages arrive for 1h+, container restart is the only known recovery. Prod (MikroTik networking) doesn't see it. ## Fix Make the watchdog actually recover. - **`SourceLivenessState.ForceReconnectFn`** — per-source closure wired in `main.go` next to `IsConnectedFn`, wraps `client.Disconnect(250) + client.Connect()`. - **`processLivenessTransition`** — on the `LivenessStalled` edge AND on every heartbeat re-emit while still Stalled, invoke `maybeForceReconnect`. `LivenessNeverReceived` (cold-start ACL deny / wrong hash) is **deliberately not** force-reconnected — a new TCP socket won't fix an ACL deny and would just churn the broker. - **`maybeForceReconnect`** — throttled at `forceReconnectThrottle = 60s` per source so a stall→reconnect→re-stall loop self-recovers without hammering the broker. The Disconnect+Connect runs in a goroutine so a single slow source can't stall the watchdog tick. - **`buildMQTTOpts`** — explicit `SetKeepAlive(30 * time.Second)`. paho's default happens to be 30s, but the #1335 RCA called this out — making it explicit so it can't drift and so operators reading the code know it's intentional. - **Telemetry** — `WATCHDOG forcing reconnect` (intent), `WATCHDOG reconnect attempt issued` (post-goroutine), `WATCHDOG suppressing forced reconnect` (throttle window). ## TDD - **RED** `f06887` — `mqtt_watchdog_force_reconnect_test.go`. Stub field + constant added so the file compiles; assertions fail because `processLivenessTransition` never invokes `ForceReconnectFn`. Reverting just the `s.ForceReconnectFn()` call line from GREEN re-fails the same assertion (mutation verified). - **GREEN** `8f53c1` — wiring + throttle + keepalive. ## Scope discipline Additive only. No regression to currently-flowing sources: `LivenessOK`, `LivenessRecovered`, `LivenessDisconnected`, `LivenessHeartbeat`, and `LivenessNeverReceived` transitions are unchanged. Throttle bound = ≤1 reconnect/min/source = ≤60/hr worst-case across all sources, well within any broker rate limit. Preflight: clean (all gates pass). --------- Co-authored-by: openclaw-bot <bot@openclaw.local>