Files
meshcore-analyzer/docker-compose.yml
Kpa-clawbot f87eb3601c fix: graceful container shutdown for reliable deployments (#453)
## Summary

Fixes #450 — staging deployment flaky due to container not shutting down
cleanly.

## Root Causes

1. **Server never closed DB on shutdown** — SQLite WAL lock held
indefinitely, blocking new container startup
2. **`httpServer.Close()` instead of `Shutdown()`** — abruptly kills
connections instead of draining them
3. **No `stop_grace_period` in compose configs** — Docker sends SIGTERM
then immediately SIGKILL (default 10s is often not enough for WAL
checkpoint)
4. **Supervisor didn't forward SIGTERM** — missing
`stopsignal`/`stopwaitsecs` meant Go processes got SIGKILL instead of
graceful shutdown
5. **Deploy scripts used default `docker stop` timeout** — only 10s
grace period

## Changes

### Go Server (`cmd/server/`)
- **Graceful HTTP shutdown**: `httpServer.Shutdown(ctx)` with 15s
context timeout — drains in-flight requests before closing
- **WebSocket cleanup**: New `Hub.Close()` method sends `CloseGoingAway`
frames to all connected clients
- **DB close on shutdown**: Explicitly closes DB after HTTP server stops
(was never closed before)
- **WAL checkpoint**: `PRAGMA wal_checkpoint(TRUNCATE)` before DB close
— flushes WAL to main DB file and removes WAL/SHM lock files

### Go Ingestor (`cmd/ingestor/`)
- **WAL checkpoint on shutdown**: New `Store.Checkpoint()` method,
called before `Close()`
- **Longer MQTT disconnect timeout**: 5s (was 1s) to allow in-flight
messages to drain

### Docker Compose (all 4 variants)
- Added `stop_grace_period: 30s` and `stop_signal: SIGTERM`

### Supervisor Configs (both variants)
- Added `stopsignal=TERM` and `stopwaitsecs=20` to server and ingestor
programs

### Deploy Scripts
- `deploy-staging.sh`: `docker stop -t 30` with explicit grace period
- `deploy-live.sh`: `docker stop -t 30` with explicit grace period

## Shutdown Sequence (after fix)

1. Docker sends SIGTERM to supervisord (PID 1)
2. Supervisord forwards SIGTERM to server + ingestor (waits up to 20s
each)
3. Server: stops poller → drains HTTP (15s) → closes WS clients →
checkpoints WAL → closes DB
4. Ingestor: stops tickers → disconnects MQTT (5s) → checkpoints WAL →
closes DB
5. Docker waits up to 30s total before SIGKILL

## Tests

All existing tests pass:
- `cd cmd/server && go test ./...` 
- `cd cmd/ingestor && go test ./...` 

---------

Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
2026-04-01 12:19:20 -07:00

41 lines
1.3 KiB
YAML

# All container config lives here. manage.sh is just a wrapper around docker compose.
# Override defaults via .env or environment variables.
# CRITICAL: All data mounts use bind mounts (~/path), NOT named volumes.
# This ensures the DB and theme are visible on the host filesystem for backup.
services:
prod:
build:
context: .
args:
APP_VERSION: ${APP_VERSION:-unknown}
GIT_COMMIT: ${GIT_COMMIT:-unknown}
BUILD_TIME: ${BUILD_TIME:-unknown}
image: corescope:latest
container_name: corescope-prod
restart: unless-stopped
stop_grace_period: 30s
stop_signal: SIGTERM
extra_hosts:
- "host.docker.internal:host-gateway"
ports:
- "${PROD_HTTP_PORT:-80}:80"
- "${PROD_HTTPS_PORT:-443}:443"
- "${PROD_MQTT_PORT:-1883}:1883"
volumes:
- ./caddy-config/Caddyfile:/etc/caddy/Caddyfile:ro
- ${PROD_DATA_DIR:-~/meshcore-data}:/app/data
- caddy-data:/data/caddy
environment:
- NODE_ENV=production
- DISABLE_MOSQUITTO=${DISABLE_MOSQUITTO:-false}
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:3000/api/stats"]
interval: 30s
timeout: 5s
retries: 3
volumes:
# Named volumes for Caddy TLS certificates (not user data — managed by Caddy internally)
caddy-data: