mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-04-03 20:59:48 +00:00
## Summary Fixes #450 — staging deployment flaky due to container not shutting down cleanly. ## Root Causes 1. **Server never closed DB on shutdown** — SQLite WAL lock held indefinitely, blocking new container startup 2. **`httpServer.Close()` instead of `Shutdown()`** — abruptly kills connections instead of draining them 3. **No `stop_grace_period` in compose configs** — Docker sends SIGTERM then immediately SIGKILL (default 10s is often not enough for WAL checkpoint) 4. **Supervisor didn't forward SIGTERM** — missing `stopsignal`/`stopwaitsecs` meant Go processes got SIGKILL instead of graceful shutdown 5. **Deploy scripts used default `docker stop` timeout** — only 10s grace period ## Changes ### Go Server (`cmd/server/`) - **Graceful HTTP shutdown**: `httpServer.Shutdown(ctx)` with 15s context timeout — drains in-flight requests before closing - **WebSocket cleanup**: New `Hub.Close()` method sends `CloseGoingAway` frames to all connected clients - **DB close on shutdown**: Explicitly closes DB after HTTP server stops (was never closed before) - **WAL checkpoint**: `PRAGMA wal_checkpoint(TRUNCATE)` before DB close — flushes WAL to main DB file and removes WAL/SHM lock files ### Go Ingestor (`cmd/ingestor/`) - **WAL checkpoint on shutdown**: New `Store.Checkpoint()` method, called before `Close()` - **Longer MQTT disconnect timeout**: 5s (was 1s) to allow in-flight messages to drain ### Docker Compose (all 4 variants) - Added `stop_grace_period: 30s` and `stop_signal: SIGTERM` ### Supervisor Configs (both variants) - Added `stopsignal=TERM` and `stopwaitsecs=20` to server and ingestor programs ### Deploy Scripts - `deploy-staging.sh`: `docker stop -t 30` with explicit grace period - `deploy-live.sh`: `docker stop -t 30` with explicit grace period ## Shutdown Sequence (after fix) 1. Docker sends SIGTERM to supervisord (PID 1) 2. Supervisord forwards SIGTERM to server + ingestor (waits up to 20s each) 3. Server: stops poller → drains HTTP (15s) → closes WS clients → checkpoints WAL → closes DB 4. Ingestor: stops tickers → disconnects MQTT (5s) → checkpoints WAL → closes DB 5. Docker waits up to 30s total before SIGKILL ## Tests All existing tests pass: - `cd cmd/server && go test ./...` ✅ - `cd cmd/ingestor && go test ./...` ✅ --------- Co-authored-by: you <you@example.com> Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
41 lines
1.3 KiB
YAML
41 lines
1.3 KiB
YAML
# All container config lives here. manage.sh is just a wrapper around docker compose.
|
|
# Override defaults via .env or environment variables.
|
|
# CRITICAL: All data mounts use bind mounts (~/path), NOT named volumes.
|
|
# This ensures the DB and theme are visible on the host filesystem for backup.
|
|
|
|
services:
|
|
prod:
|
|
build:
|
|
context: .
|
|
args:
|
|
APP_VERSION: ${APP_VERSION:-unknown}
|
|
GIT_COMMIT: ${GIT_COMMIT:-unknown}
|
|
BUILD_TIME: ${BUILD_TIME:-unknown}
|
|
image: corescope:latest
|
|
container_name: corescope-prod
|
|
restart: unless-stopped
|
|
stop_grace_period: 30s
|
|
stop_signal: SIGTERM
|
|
extra_hosts:
|
|
- "host.docker.internal:host-gateway"
|
|
ports:
|
|
- "${PROD_HTTP_PORT:-80}:80"
|
|
- "${PROD_HTTPS_PORT:-443}:443"
|
|
- "${PROD_MQTT_PORT:-1883}:1883"
|
|
volumes:
|
|
- ./caddy-config/Caddyfile:/etc/caddy/Caddyfile:ro
|
|
- ${PROD_DATA_DIR:-~/meshcore-data}:/app/data
|
|
- caddy-data:/data/caddy
|
|
environment:
|
|
- NODE_ENV=production
|
|
- DISABLE_MOSQUITTO=${DISABLE_MOSQUITTO:-false}
|
|
healthcheck:
|
|
test: ["CMD", "wget", "-qO-", "http://localhost:3000/api/stats"]
|
|
interval: 30s
|
|
timeout: 5s
|
|
retries: 3
|
|
|
|
volumes:
|
|
# Named volumes for Caddy TLS certificates (not user data — managed by Caddy internally)
|
|
caddy-data:
|