The deploy step used only 'docker compose down' which can't remove
containers created via 'docker run'. Now explicitly stops+removes
the named container first, then runs compose down as cleanup.
Permanent fix for the recurring CI deploy failure.
Removes linux/arm64 from multi-platform build and drops QEMU setup.
All infra (prod + staging) is x86. QEMU emulation was adding ~12min
to every CI run for an unused architecture.
## Consolidate CI Pipeline — Build + Publish to GHCR + Deploy Staging
### What
Merges the separate `publish.yml` workflow into `deploy.yml`, creating a
single CI/CD pipeline:
**`go-test → e2e-test → build-and-publish → deploy → publish-badges`**
### Why
- Two workflows doing overlapping builds was wasteful and error-prone
- `publish.yml` had a bug: `BUILD_TIME=$(date ...)` in a `with:` block
never executed (literal string)
- The old build job had duplicate/conflicting `APP_VERSION` assignments
### Changes
- **`build-and-publish` job** replaces old `build` job — builds locally
for staging, then does multi-arch GHCR push (gated to push events only,
PRs skip)
- **Build metadata** computed in a dedicated step, passed via
`GITHUB_OUTPUT` — no more shell expansion bugs
- **`APP_VERSION`** is `v1.2.3` on tag push, `edge` on master push
- **Deploy** now pulls the `edge` image from GHCR and tags for compose
compatibility, with fallback to local build
- **`publish.yml` deleted** — no duplicate workflow
- **Top-level `permissions`** block with `packages:write` for GHCR auth
- **Triggers** now include `tags: ['v*']` for release publishing
### Status
- ✅ Rebased onto master
- ✅ Self-reviewed (all checklist items pass)
- ✅ Ready for merge
Co-authored-by: you <you@example.com>
## Summary
Fixes#472
The Docker build job on the self-hosted runner fails with `no space left
on device` because Docker build cache and Go module downloads accumulate
between runs. The existing cleanup (line ~330) runs in the **deploy**
step *after* the build — too late to help.
## Changes
- Added a "Free disk space" step at the start of the build job,
**before** "Build Go Docker image":
- `docker system prune -af` — removes all unused images, containers,
networks
- `docker builder prune -af` — clears the build cache
- `df -h /` — logs available disk space for visibility
- Kept the existing post-deploy cleanup as belt-and-suspenders
---------
Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
## Problem
The self-hosted runner (`meshcore-runner-2`) filled its 29GB disk to
100%, blocking all CI runs:
```
Filesystem Size Used Avail Use%
/dev/root 29G 29G 2.3M 100%
Docker Images: 67 total, 2 active, 18.83GB reclaimable (99%)
```
Root cause: no Docker image cleanup after builds. Each CI run builds a
new image but never prunes old ones.
## Fix
### 1. Docker image cleanup after deploy (`deploy` job)
- Runs with `if: always()` so it executes even if deploy fails
- `docker image prune -af --filter "until=24h"` — removes images older
than 24h (safe: current build is minutes old)
- `docker builder prune -f --keep-storage=1GB` — caps build cache
- Logs before/after `docker system df` for visibility
### 2. Runner log cleanup at start of E2E job
- Prunes runner diagnostic logs older than 3 days (was 53MB and growing)
- Reports `df -h` for disk visibility in CI output
## Impact
After manual cleanup today, disk went from 100% → 35% (19GB free). This
PR prevents recurrence.
## Test plan
- [x] Manual cleanup verified on runner via `az vm run-command`
- [ ] Next CI run should show cleanup step output in deploy job logs
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removes the separate config.json file bind mount from both compose
files. The data directory mount already covers it, and the Go server
searches /app/data/config.json via LoadConfig.
- Entrypoint symlinks /app/data/config.json for ingestor compatibility
- manage.sh setup creates config in data dir, prompts admin if missing
- manage.sh start checks config exists before starting, offers to create
- deploy.yml simplified — no more sudo rm or directory cleanup
- Backup/restore updated to use data dir path
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removes the separate config.json file bind mount from both compose
files. The data directory mount already covers it, and the Go server
searches /app/data/config.json via LoadConfig.
- Entrypoint symlinks /app/data/config.json for ingestor compatibility
- manage.sh setup creates config in data dir, prompts admin if missing
- manage.sh start checks config exists before starting, offers to create
- deploy.yml simplified — no more sudo rm or directory cleanup
- Backup/restore updated to use data dir path
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root causes from CI logs:
1. 'read /app/config.json: is a directory' — Docker creates a directory
when bind-mounting a non-existent file. The entrypoint now detects
and removes directory config.json before falling back to example.
2. 'unable to open database file: out of memory (14)' — old container
(3GB) not fully exited when new one starts. Deploy now uses
'docker compose down' with timeout and waits for memory reclaim.
3. Supervisor gave up after 3 fast retries (FATAL in ~6s). Increased
startretries to 10 and startsecs to 2 for server and ingestor.
Additional:
- Deploy step ensures staging config.json exists before starting
- Healthcheck: added start_period=60s, increased timeout and retries
- No longer uses manage.sh (CI working dir != repo checkout dir)
## Summary
Complete CI pipeline restructure. Sequential fail-fast chain, E2E tests
against Go server with real staging data, all deprecated Node.js server
tests removed.
### Pipeline (PR):
1. **Go unit tests** — fail-fast, coverage + badges
2. **Playwright E2E** — against Go server with fixture DB, frontend
coverage, fail-fast on first failure
3. **Docker build** — verify containers build
### Pipeline (master merge):
Same chain + deploy to staging + badge publishing
### Removed:
- All Node.js server-side unit tests (deprecated JS server)
- `npm ci` / `npm run test` steps
- JS server coverage collection (`COVERAGE=1 node server.js`)
- Changed-files detection logic
- Docs-only CI skip logic
- Cancel-workflow API hacks
### Added:
- `test-fixtures/e2e-fixture.db` — real data from staging (200 nodes, 31
observers, 500 packets)
- `scripts/capture-fixture.sh` — refresh fixture from staging API
- Go server launches with `-port 13581 -db test-fixtures/e2e-fixture.db
-public public-instrumented`
---------
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
The Playwright E2E tests were starting `node server.js` (the deprecated
JS server) instead of the Go server, meaning E2E tests weren't testing
the production backend at all.
Changes:
- Add Go 1.22 setup and build steps to the node-test job
- Build the Go server binary before E2E tests run
- Replace `node server.js` with `./corescope-server` in both the
instrumented (coverage) and quick (no-coverage) E2E server starts
- Use `-port 13581` and `-public` flags to configure the Go server
- For coverage runs, serve from `public-instrumented/` directory
The Go server serves the same static files and exposes compatible
/api/* routes (stats, packets, health, perf) that the E2E tests hit.
* docs: remove letsmesh.net reference from README
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: remove paths-ignore from pull_request trigger
PR #233 only touches .md files, which were excluded by paths-ignore,
causing CI to be skipped entirely. Remove paths-ignore from the
pull_request trigger so all PRs get validated. Keep paths-ignore on
push to avoid unnecessary deploys for docs-only changes to master.
* ci: skip heavy CI jobs for docs-only PRs
Instead of using paths-ignore (which skips the entire workflow and
blocks required status checks), detect docs-only changes at the start
of each job and skip heavy steps while still reporting success.
This allows doc-only PRs to merge without waiting for Go builds,
Node.js tests, or Playwright E2E runs.
Reverts the approach from 7546ece (removing paths-ignore entirely)
in favor of a proper conditional skip within the jobs themselves.
* fix: update engine tests to match engine-badge HTML format
Tests expected [go]/[node] text but formatVersionBadge now renders
<span class="engine-badge">go</span>. Updated 6 assertions to
check for engine-badge class and engine name in HTML output.
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
Three optimizations to the CI frontend test pipeline:
1. Run E2E tests and coverage collection concurrently
- Previously sequential (E2E ~1.5min, then coverage ~5.75min)
- Now both run in parallel against the same instrumented server
- Expected savings: ~5 min (coverage runs alongside E2E instead of after)
2. Replace networkidle with domcontentloaded in coverage collector
- SPA uses hash routing — networkidle waits 500ms for network silence
on every navigation, adding ~10-15s of dead time across 23 navigations
- domcontentloaded fires immediately once HTML is parsed; JS initializes
the route handler synchronously
- For in-page hash changes, use 200ms setTimeout instead of
waitForLoadState (which would never re-fire for same-document nav)
3. Extract coverage from E2E tests too
- E2E tests already exercise the app against the instrumented server
- Now writes window.__coverage__ to .nyc_output/e2e-coverage.json
- nyc merges both coverage files for higher total coverage
Also:
- Split Playwright install into browser + deps steps (deps skip if present)
- Replace sleep 5 with health-check poll in quick E2E path
The Windows self-hosted runner picks up jobs and fails because bash
scripts run in PowerShell. Node.js tests need Chromium/Playwright
(Linux-only), and build/deploy/publish use Docker (Linux-only).
Changes:
- node-test: runs-on: [self-hosted, Linux]
- build: runs-on: [self-hosted, Linux]
- deploy: runs-on: [self-hosted, Linux]
- publish: runs-on: [self-hosted, Linux]
- go-test: unchanged (ubuntu-latest)
When go-test or node-test fails, the workflow run is now cancelled
via the GitHub API so the sibling job doesn't sit queued/running.
Also fixed build job to need both go-test AND node-test (was only
waiting on go-test despite the pipeline comment saying both gate it).
The flat 'deploy' concurrency group caused ALL PRs to share one queue,
so pushing to any PR would cancel CI runs on other PRs.
Changed to deploy-${{ github.event.pull_request.number || github.ref }}
so each PR gets its own concurrency group while re-pushes to the same
PR still cancel the previous run.
- Add pull_request trigger for PRs against master
- Add 'if: github.event_name == push' to build/deploy/publish jobs
- Test jobs (go-test, node-test) now run on both push and PRs
- Build/deploy/publish only run on push to master
This fixes the chicken-and-egg problem where branch protection requires
CI checks but CI doesn't run on PRs. Now PRs get test validation before
merge while keeping production deployments only on master pushes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>