Root causes from CI logs:
1. 'read /app/config.json: is a directory' — Docker creates a directory
when bind-mounting a non-existent file. The entrypoint now detects
and removes directory config.json before falling back to example.
2. 'unable to open database file: out of memory (14)' — old container
(3GB) not fully exited when new one starts. Deploy now uses
'docker compose down' with timeout and waits for memory reclaim.
3. Supervisor gave up after 3 fast retries (FATAL in ~6s). Increased
startretries to 10 and startsecs to 2 for server and ingestor.
Additional:
- Deploy step ensures staging config.json exists before starting
- Healthcheck: added start_period=60s, increased timeout and retries
- No longer uses manage.sh (CI working dir != repo checkout dir)
## Summary
Complete CI pipeline restructure. Sequential fail-fast chain, E2E tests
against Go server with real staging data, all deprecated Node.js server
tests removed.
### Pipeline (PR):
1. **Go unit tests** — fail-fast, coverage + badges
2. **Playwright E2E** — against Go server with fixture DB, frontend
coverage, fail-fast on first failure
3. **Docker build** — verify containers build
### Pipeline (master merge):
Same chain + deploy to staging + badge publishing
### Removed:
- All Node.js server-side unit tests (deprecated JS server)
- `npm ci` / `npm run test` steps
- JS server coverage collection (`COVERAGE=1 node server.js`)
- Changed-files detection logic
- Docs-only CI skip logic
- Cancel-workflow API hacks
### Added:
- `test-fixtures/e2e-fixture.db` — real data from staging (200 nodes, 31
observers, 500 packets)
- `scripts/capture-fixture.sh` — refresh fixture from staging API
- Go server launches with `-port 13581 -db test-fixtures/e2e-fixture.db
-public public-instrumented`
---------
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
The Playwright E2E tests were starting `node server.js` (the deprecated
JS server) instead of the Go server, meaning E2E tests weren't testing
the production backend at all.
Changes:
- Add Go 1.22 setup and build steps to the node-test job
- Build the Go server binary before E2E tests run
- Replace `node server.js` with `./corescope-server` in both the
instrumented (coverage) and quick (no-coverage) E2E server starts
- Use `-port 13581` and `-public` flags to configure the Go server
- For coverage runs, serve from `public-instrumented/` directory
The Go server serves the same static files and exposes compatible
/api/* routes (stats, packets, health, perf) that the E2E tests hit.
* docs: remove letsmesh.net reference from README
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: remove paths-ignore from pull_request trigger
PR #233 only touches .md files, which were excluded by paths-ignore,
causing CI to be skipped entirely. Remove paths-ignore from the
pull_request trigger so all PRs get validated. Keep paths-ignore on
push to avoid unnecessary deploys for docs-only changes to master.
* ci: skip heavy CI jobs for docs-only PRs
Instead of using paths-ignore (which skips the entire workflow and
blocks required status checks), detect docs-only changes at the start
of each job and skip heavy steps while still reporting success.
This allows doc-only PRs to merge without waiting for Go builds,
Node.js tests, or Playwright E2E runs.
Reverts the approach from 7546ece (removing paths-ignore entirely)
in favor of a proper conditional skip within the jobs themselves.
* fix: update engine tests to match engine-badge HTML format
Tests expected [go]/[node] text but formatVersionBadge now renders
<span class="engine-badge">go</span>. Updated 6 assertions to
check for engine-badge class and engine name in HTML output.
---------
Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
Three optimizations to the CI frontend test pipeline:
1. Run E2E tests and coverage collection concurrently
- Previously sequential (E2E ~1.5min, then coverage ~5.75min)
- Now both run in parallel against the same instrumented server
- Expected savings: ~5 min (coverage runs alongside E2E instead of after)
2. Replace networkidle with domcontentloaded in coverage collector
- SPA uses hash routing — networkidle waits 500ms for network silence
on every navigation, adding ~10-15s of dead time across 23 navigations
- domcontentloaded fires immediately once HTML is parsed; JS initializes
the route handler synchronously
- For in-page hash changes, use 200ms setTimeout instead of
waitForLoadState (which would never re-fire for same-document nav)
3. Extract coverage from E2E tests too
- E2E tests already exercise the app against the instrumented server
- Now writes window.__coverage__ to .nyc_output/e2e-coverage.json
- nyc merges both coverage files for higher total coverage
Also:
- Split Playwright install into browser + deps steps (deps skip if present)
- Replace sleep 5 with health-check poll in quick E2E path
The Windows self-hosted runner picks up jobs and fails because bash
scripts run in PowerShell. Node.js tests need Chromium/Playwright
(Linux-only), and build/deploy/publish use Docker (Linux-only).
Changes:
- node-test: runs-on: [self-hosted, Linux]
- build: runs-on: [self-hosted, Linux]
- deploy: runs-on: [self-hosted, Linux]
- publish: runs-on: [self-hosted, Linux]
- go-test: unchanged (ubuntu-latest)
When go-test or node-test fails, the workflow run is now cancelled
via the GitHub API so the sibling job doesn't sit queued/running.
Also fixed build job to need both go-test AND node-test (was only
waiting on go-test despite the pipeline comment saying both gate it).
The flat 'deploy' concurrency group caused ALL PRs to share one queue,
so pushing to any PR would cancel CI runs on other PRs.
Changed to deploy-${{ github.event.pull_request.number || github.ref }}
so each PR gets its own concurrency group while re-pushes to the same
PR still cancel the previous run.
- Add pull_request trigger for PRs against master
- Add 'if: github.event_name == push' to build/deploy/publish jobs
- Test jobs (go-test, node-test) now run on both push and PRs
- Build/deploy/publish only run on push to master
This fixes the chicken-and-egg problem where branch protection requires
CI checks but CI doesn't run on PRs. Now PRs get test validation before
merge while keeping production deployments only on master pushes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem 1 (Go staging timeout): Increased healthcheck from 60s to 120s to allow 50K+ packets to load into memory.
Problem 2 (Node staging timeout): Added forced cleanup of stale containers, volumes, and ports before starting staging containers to prevent conflicts.
Problem 3 (Proto validation WS timeout): Made WebSocket message capture non-blocking using timeout command. If no live packets are available, it now skips with a warning instead of failing the entire proto validation pipeline.
Problem 4 (Playwright E2E failures): Added forced cleanup of stale server on port 13581 before starting test server, plus better diagnostics on failure.
All health checks now include better logging (tail 50 instead of 30 lines) for debugging.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added 'set -e -o pipefail' to both Go test steps. Without pipefail, the exit code from 'go test' was being lost when piped to tee, causing test failures to appear as successes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add per-payload-type packet detail fixtures captured from production:
- packet-type-advert.json (payload_type=4, ADVERT)
- packet-type-grptxt-decrypted.json (payload_type=5, decrypted GRP_TXT)
- packet-type-grptxt-undecrypted.json (payload_type=5, decryption_failed GRP_TXT)
- packet-type-txtmsg.json (payload_type=1, TXT_MSG)
- packet-type-req.json (payload_type=0, REQ)
Update validate-protos.py to validate all 5 new fixtures against
PacketDetailResponse proto message.
Update CI deploy workflow to automatically capture per-type fixtures
on each deploy, including both decrypted and undecrypted GRP_TXT.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The proto validation infrastructure was added in commit e70ba44 but used
an invalid --syntax_check flag. Changed to use --descriptor_set_out=/dev/null
which validates syntax without generating files.
Proto validation flow (now complete):
1. go-test job: verify .proto files compile (syntax check) ✅
2. deploy-node job: validate protos match prod API responses ✅
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The build-node job was failing with 'node: not found' because it
runs scripts/validate.sh (which uses 'node -c' for syntax checking)
but didn't have the actions/setup-node@v4 step.
Added Node.js 22 setup before the validate step to match the pattern
used in other jobs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously only captured 19 simple endpoints. Now captures all 33:
- 19 simple endpoints (stats, health, nodes, etc.)
- 14 dynamic ID endpoints (node-detail, packet-detail, etc.)
Dynamic ID resolution:
- Extracts real pubkey from /api/nodes for node detail endpoints
- Extracts real hash from /api/packets for packet-detail
- Extracts real observer ID from /api/observers for observer endpoints
- Gracefully skips fixtures if DB is empty (no data yet)
WebSocket capture:
- Uses node -e with ws module to capture one live WS message
- Falls back gracefully if no live packets available
The validator already handles missing fixtures without failing, so this
will work even when staging container has no data yet.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added a CI step that:
- Refreshes Node fixtures from the staging container after deployment
- Runs tools/validate-protos.py to validate proto definitions match actual API responses
- Fails the pipeline if proto drift is detected
This ensures nobody can merge a Node change that breaks the Go proto contract
without updating the .proto definitions.
The step runs after the Node staging healthcheck, capturing fresh responses
from 19 API endpoints (stats, health, nodes, analytics/*, config/*, etc.).
Endpoints requiring parameters (node-detail, packet-detail) use existing
fixtures and aren't auto-refreshed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- build-node depends only on node-test
- build-go depends only on go-test
- deploy-node depends only on build-node
- deploy-go depends only on build-go
- publish job waits for both deploy-node and deploy-go to complete
- Badges and deployment summary moved to final publish step
Result: Go staging no longer waits for Node tests to complete.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Go staging now deploys immediately after build completes, in parallel
with Node staging. Both test suites still gate the build job.
Before:
go-test + node-test → build → deploy-node → deploy-go
After:
go-test + node-test → build → deploy-node (parallel)
deploy-go (parallel)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 env var for Node.js 20 deprecation
- Add cache-dependency-path for go.sum files in cmd/server and cmd/ingestor
- Add if-no-files-found: ignore to go-badges upload-artifact step
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split the monolithic 3-job pipeline (go-build, test, deploy) into 5
focused jobs that each do ONE thing:
go-test - Go Build & Test (coverage badges, runs on ubuntu-latest)
node-test - Node.js Tests (backend + Playwright E2E, coverage)
build - Build Docker Images (Node + Go, badge publishing)
deploy-node - Deploy Node Staging (port 81, healthcheck, smoke test)
deploy-go - Deploy Go Staging (port 82, healthcheck, smoke test)
Dependency chain: go-test + node-test (parallel) -> build -> deploy-node -> deploy-go
Every step now has a human-readable name describing exactly what it does.
Job names include emoji for visual scanning on GitHub Actions.
All existing functionality preserved - just reorganized for clarity.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Go server and ingestor tests now run with -coverprofile
- Coverage percentages parsed and printed in CI output
- Badge JSON files generated (.badges/go-server-coverage.json,
.badges/go-ingestor-coverage.json) matching existing format
- Badges uploaded as artifacts from go-build job, downloaded
in test job, and published alongside existing Node.js badges
- Coverage summary table added to GitHub Step Summary
fixes#141
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Node.js: reads version from package.json, commit from .git-commit file
or git rev-parse --short HEAD at runtime, with unknown fallback.
Go: uses -ldflags build-time variables (Version, Commit) with fallback
to .git-commit file and git command at runtime.
Dockerfile: copies .git-commit if present (CI bakes it before build).
Dockerfile.go: passes APP_VERSION and GIT_COMMIT as build args to ldflags.
deploy.yml: writes GITHUB_SHA to .git-commit before docker build steps.
docker-compose.yml: passes build args to Go staging build.
Tests updated to verify version and commit fields in both endpoints.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build and deploy the Go staging container (port 82) after Node staging
is healthy. Uses continue-on-error so Go staging failures don't block
the Node.js deploy. Health-checks the Go container for up to 60s and
verifies /api/stats returns the engine field.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>