Commit Graph

107 Commits

Author SHA1 Message Date
Kpa-clawbot 78dabd5bda feat(filter): timestamp predicates (after/before/between/age) — #289 (#1070)
Fixes #289.

Adds Wireshark-style timestamp predicates to the client-side packet
filter
engine (`public/packet-filter.js`).

## New syntax

| Form | Meaning |
| --- | --- |
| `time after "2024-01-01"` | packets with timestamp strictly after the
given datetime |
| `time before "2024-12-31T23:59:59Z"` | packets strictly before |
| `time between "2024-01-01" "2024-02-01"` | inclusive range
(order-insensitive) |
| `age < 1h` | packets newer than 1 hour |
| `age > 24h` | packets older than 24 hours |
| `age < 7d && type == ADVERT` | composes with existing predicates |

Duration units: `s` / `m` / `h` / `d` / `w`. Datetime values use
`Date.parse`
(ISO 8601 + bare `YYYY-MM-DD`). `time` is also accepted as `timestamp`.

## Implementation

- `OP_WORDS` extended with `after`, `before`, `between`.
- New `TK.DURATION` token: lexer recognises `<number><unit>` and
pre-converts
  to seconds at lex time (no per-evaluation parsing cost).
- `between` is a two-value op handled in `parseComparison`.
- Field resolver:
- `time` / `timestamp` → epoch-ms; falls back to `first_seen` then
`latest`
    so grouped rows from `/api/packets?groupByHash=true` work.
  - `age` → seconds since `Date.now()`.
- Parse-time validation rejects invalid datetimes and unknown duration
units
(silent-fail would have been a footgun — every packet would just
disappear).
- Null/missing timestamps → predicate returns `false`, consistent with
the
  existing null-field behaviour for `snr` / `rssi`.

## Open questions from the issue

- **UTC vs local**: defaults to whatever `Date.parse` returns. Bare
dates like
`"2024-01-01"` are interpreted as UTC midnight by the spec. Tying this
to
  the #286 timestamp display setting can be a follow-up.
- **URL query string**: out of scope for this PR.

## Tests

- New `test-packet-filter-time.js`: 20 tests covering
`after`/`before`/`between`,
ISO datetimes, all duration units, composition with `&&`, null-timestamp
safety,
  invalid-datetime / invalid-unit errors, and `first_seen` fallback.
- Wired into `.github/workflows/deploy.yml` JS unit-test step.
- Existing `test-packet-filter.js` (69 tests) and inline self-tests
still pass.

## Commits

- Red: `5ccfad3` — failing tests + lexer-only stub (compiles, asserts
fail)
- Green: `976d50f` — implementation

---------

Co-authored-by: OpenClaw Bot <bot@openclaw.local>
2026-05-05 01:13:48 -07:00
Kpa-clawbot 3aaa21bbc0 fix(channel-decrypt): pure-JS SHA-256/HMAC fallback for HTTP context (P0 follow-up to #1021) (#1027)
## P0: PSK channel decryption silently failed on HTTP origins

User reported PSK key `372a9c93260507adcbf36a84bec0f33d` "still doesn't
work" after PRs #1021 (AES-ECB pure-JS) and #1024 (PSK UX) merged.
Reproduced end-to-end and found the actual remaining bug.

### Root cause

PR #1021 fixed the AES-ECB path by vendoring a pure-JS core, but
**SHA-256 and HMAC-SHA256 in `public/channel-decrypt.js` are still
pinned to `crypto.subtle`**. `SubtleCrypto` is exposed **only in secure
contexts** (HTTPS / localhost); when CoreScope is served over plain HTTP
— common for self-hosted instances — `crypto.subtle` is `undefined`,
and:

- `computeChannelHash(key)` → `Cannot read properties of undefined
(reading 'digest')`
- `verifyMAC(...)` → `Cannot read properties of undefined (reading
'importKey')`

Both throws are swallowed by `addUserChannel`'s `try/catch`, so the only
user-visible signal is the toast `"Failed to decrypt"` with no
console-friendly explanation. Verdict: PR #1021 only fixed half of the
crypto-in-insecure-context problem.

### Reproduction (no browser required)

`test-channel-decrypt-insecure-context.js` loads the production
`public/channel-decrypt.js` in a `vm` sandbox where `crypto.subtle` is
undefined (mirrors HTTP browser). Pre-fix it failed 8/8 with the exact
error above; post-fix it passes 8/8.

### Fix

- New `public/vendor/sha256-hmac.js`: minimal pure-JS SHA-256 +
HMAC-SHA256 (FIPS-180-4 + RFC 2104, ~120 LOC, MIT). Verified against
Node `crypto` for SHA-256 (empty / "abc" / 1000 bytes) and RFC 4231
HMAC-SHA256 TC1.
- `public/channel-decrypt.js`: `hasSubtle()` guard. `deriveKey`,
`computeChannelHash`, and `verifyMAC` use `crypto.subtle` when available
and fall back to `window.PureCrypto` otherwise. Same API, same return
types, same async signatures.
- `public/index.html`: load `vendor/sha256-hmac.js` immediately before
`channel-decrypt.js` (mirrors the `vendor/aes-ecb.js` wiring from
#1021).

### TDD

- **Red** (`8075b55`): `test-channel-decrypt-insecure-context.js` — runs
the **unmodified** prod module in a no-`subtle` sandbox, asserts on the
known PSK key (hash byte `0xb7`) and synthetic encrypted packet
round-trip. Compiles, runs, **fails 8/8 on assertions** (not on import
errors).
- **Green** (`232add6`): vendor + delegate. Test passes 8/8.
- Wired into `test-all.sh` and `.github/workflows/deploy.yml` so CI
gates the regression.

### Validation (all green post-fix)

| Test | Result |
|---|---|
| `test-channel-decrypt-insecure-context.js` | 8/8 |
| `test-channel-decrypt-ecb.js` (#1021 KAT) | 7/7 |
| `test-channel-decrypt-m345.js` (existing) | 24/24 |
| `test-channel-psk-ux.js` (#1024) | 19/19 |
| `test-packet-filter.js` | 69/69 |

### Files changed

- `public/vendor/sha256-hmac.js` — **new** (~150 LOC, MIT, decrypt-side
only)
- `public/channel-decrypt.js` — `hasSubtle()` guard + fallback in
`deriveKey`/`computeChannelHash`/`verifyMAC`
- `public/index.html` — script tag for `vendor/sha256-hmac.js`
- `test-channel-decrypt-insecure-context.js` — **new** (8 assertions,
pure Node, no browser)
- `test-all.sh` + `.github/workflows/deploy.yml` — wire the test

### Risk / scope

- Frontend-only, decrypt-side only. No server, schema, or config changes
(Config Documentation Rule N/A).
- Secure-context behaviour unchanged (still uses Web Crypto when
present).
- HMAC `secret` building, MAC truncation (2 bytes), and AES-ECB
delegation untouched.
- Hash vector for the user's PSK key matches:
`SHA-256(372a9c93260507adcbf36a84bec0f33d) = b7ce04…`, channel hash byte
`0xb7` (183) — confirmed against Node `crypto` and against the new
pure-JS path.

### Note on the FIPS test data in the new test

The PSK `372a9c93260507adcbf36a84bec0f33d` is shared test data from the
bug report, not a real channel secret.

---------

Co-authored-by: openclaw-bot <bot@openclaw.local>
2026-05-03 21:06:59 -07:00
Kpa-clawbot f229e15869 feat(packet-filter): transport boolean + T_FLOOD/T_DIRECT route aliases (#339) (#1014)
## Summary

Adds Wireshark-style filter support for transport route type to the
packets-page filter engine, per #339.

## New filter syntax

| Filter | Matches |
|---|---|
| `transport == true` | route_type 0 (TRANSPORT_FLOOD) or 3
(TRANSPORT_DIRECT) |
| `transport == false` | route_type 1 (FLOOD) or 2 (DIRECT) |
| `transport` | bare truthy — same as `transport == true` |
| `route == T_FLOOD` | alias for `route == TRANSPORT_FLOOD` |
| `route == T_DIRECT` | alias for `route == TRANSPORT_DIRECT` |
| `route == TRANSPORT_FLOOD` / `TRANSPORT_DIRECT` | already worked —
canonical names |

Aliases are case-insensitive (`route == t_flood` works).

## Implementation

- `public/packet-filter.js`: new `transport` virtual boolean field
driven by `isTransportRouteType(rt)` which returns `rt === 0 || rt ===
3`, mirroring `isTransportRoute()` in `cmd/server/decoder.go`.
- `ROUTE_ALIASES = { t_flood: 'TRANSPORT_FLOOD', t_direct:
'TRANSPORT_DIRECT' }` resolved in the equality comparator, same pattern
as the existing `TYPE_ALIASES`.
- All client-side; no backend changes (issue noted this).

## Tests / TDD

Red commit: `9d8fdf0` — five new assertion-failing test cases + wires
`test-packet-filter.js` into CI (it existed but wasn't being executed).
Green commit: `c67612b` — implementation makes all 69 tests pass.

The CI wiring is part of the red commit on purpose: previously
`test-packet-filter.js` was never run by CI, so a frontend filter
regression couldn't fail the build. Now it can.

## CI gating proof

Run `git revert c67612b` locally → `node test-packet-filter.js` reports
5 assertion failures (not build/import errors). Re-applying the green
commit returns all tests to passing.

Fixes #339

---------

Co-authored-by: openclaw-bot <bot@openclaw.local>
2026-05-03 17:40:12 -07:00
Kpa-clawbot 7aef3c355c fix(ci): freshen fixture timestamps before E2E to avoid time-based filter exclusion (#955) (#957)
## Problem

The E2E fixture DB (`test-fixtures/e2e-fixture.db`) has static
timestamps from March 29, 2026. The map page applies a default
`lastHeard=30d` filter, so once the fixture ages past 30 days all nodes
are excluded from `/api/nodes?lastHeard=30d` — causing the "Map page
loads with markers" test to fail deterministically.

This started blocking all CI on ~April 28, 2026 (30 days after March
29).

Closes #955 (RCA #1: time-based fixture rot)

## Fix

Added `tools/freshen-fixture.sh` — a small script that shifts all
`last_seen`/`first_seen` timestamps forward so the newest is near
`now()`, preserving relative ordering between nodes. Runs in CI before
the Go server starts. Does **not** modify the checked-in fixture (no
binary blob churn).

## Verification

```
$ cp test-fixtures/e2e-fixture.db /tmp/fix4.db
$ bash tools/freshen-fixture.sh /tmp/fix4.db
Fixture timestamps freshened in /tmp/fix4.db
nodes: min=2026-05-01T07:10:00Z max=2026-05-01T14:51:33Z

$ ./corescope-server -port 13585 -db /tmp/fix4.db -public public &
$ curl -s "http://localhost:13585/api/nodes?limit=200&lastHeard=30d" | jq '{total, count: (.nodes | length)}'
{
  "total": 200,
  "count": 200
}
```

All 200 nodes returned with the 30-day filter after freshening (vs 0
without the fix).

Co-authored-by: you <you@example.com>
2026-05-01 08:06:19 -07:00
Kpa-clawbot 57e272494d feat(server): /api/healthz readiness endpoint gated on store load (#955) (#956)
## Summary

Fixes RCA #2 from #955: the HTTP listener and `/api/stats` go live
before background goroutines (pickBestObservation, neighbor graph build)
finish, causing CI readiness checks to pass prematurely.

## Changes

1. **`cmd/server/healthz.go`** — New `GET /api/healthz` endpoint:
- Returns `503 {"ready":false,"reason":"loading"}` while background init
is running
   - Returns `200 {"ready":true,"loadedTx":N,"loadedObs":N}` once ready

2. **`cmd/server/main.go`** — Added `sync.WaitGroup` tracking
pickBestObservation and neighbor graph build goroutines. A coordinator
goroutine sets `readiness.Store(1)` when all complete.
`backfillResolvedPathsAsync` is NOT gated (async by design, can take 20+
min).

3. **`cmd/server/routes.go`** — Wired `/api/healthz` before system
endpoints.

4. **`.github/workflows/deploy.yml`** — CI wait-for-ready loop now polls
`/api/healthz` instead of `/api/stats`.

5. **`cmd/server/healthz_test.go`** — Tests for 503-before-ready,
200-after-ready, JSON shape, and anti-tautology gate.

## Rule 18 Verification

Built and ran against `test-fixtures/e2e-fixture.db` (499 tx):
- With the small fixture DB, init completes in <300ms so both immediate
and delayed curls return 200
- Unit tests confirm 503 behavior when `readiness=0` (simulating slow
init)
- On production DBs with 100K+ txs, the 503 window would be 5-15s
(pickBestObservation processes in 5000-tx chunks with 10ms yields)

## Test Results

```
=== RUN   TestHealthzNotReady    --- PASS
=== RUN   TestHealthzReady       --- PASS  
=== RUN   TestHealthzAntiTautology --- PASS
ok  github.com/corescope/server  19.662s (full suite)
```

Co-authored-by: you <you@example.com>
2026-05-01 07:55:57 -07:00
Kpa-clawbot d81852736d ci: re-enable staging deploy now that VM is back (#932)
Reverts the `if: false` guard from #908.

## Why
- Azure subscription was blocked, staging VM `meshcore-runner-2`
deallocated.
- Subscription unblocked, VM started, runner online, smoke CI [run
#25117292530](https://github.com/Kpa-clawbot/CoreScope/actions/runs/25117292530)
passed.
- Time to resume automatic staging deploys on master pushes.

## Changes
- `deploy` job: `if: false` → `if: github.event_name == 'push'`
(original condition from before #908).
- `publish` job: `needs: [build-and-publish]` → `needs: [deploy]`
(original wiring restored).

## Verify after merge
- Next master push triggers the full chain: go-test → e2e-test →
build-and-publish → deploy → publish.
- `docker ps` on staging VM shows `corescope-staging-go` updated to the
new commit.

Co-authored-by: you <you@example.com>
2026-04-30 19:40:51 -07:00
Kpa-clawbot f4484adb52 ci: move to GitHub-hosted runners, disable staging deploy (#908)
## Why

The Azure staging VM (`meshcore-vm`) is offline. Self-hosted runners are
unavailable, blocking all CI.

## What changed (per job)

| Job | Change | Revert |
|-----|--------|--------|
| `e2e-test` | `runs-on: [self-hosted, Linux]` → `ubuntu-latest`;
removed self-hosted-specific "Free disk space" step | Change `runs-on`
back to `[self-hosted, Linux]`, restore disk cleanup step |
| `build-and-publish` | `runs-on: [self-hosted, meshcore-runner-2]` →
`ubuntu-latest`; removed "Free disk space" prune step (noop on fresh
GH-hosted runners) | Change `runs-on` back, restore prune step |
| `deploy` | `if: false # disabled` (was `github.event_name == 'push'`);
`runs-on` kept as-is | Change `if:` back to `github.event_name ==
'push'` |
| `publish` | `runs-on: [self-hosted, Linux]` → `ubuntu-latest`; `needs:
[deploy]` → `needs: [build-and-publish]` | Change both back |

## Notes

- `go-test` and `release-artifacts` were already on `ubuntu-latest` —
untouched.
- The `deploy` job is disabled via `if: false` for trivial one-line
revert when the VM returns.
- No new `setup-*` actions were needed — `setup-node`, `setup-go`,
`docker/setup-buildx-action`, and `docker/login-action` were already
present.

Co-authored-by: you <you@example.com>
2026-04-24 17:25:53 -07:00
Kpa-clawbot 99029e41aa ci(#768): publish multi-arch (amd64+arm64) Docker image (#869)
## Problem

`docker pull` on ARM devices fails because the published image is
amd64-only.

## Fix

Enable multi-arch Docker builds via `docker buildx`. **Builder stage
uses native Go cross-compilation; only the runtime-stage `RUN` steps use
QEMU emulation.**

### Changes

| File | Change |
|------|--------|
| `Dockerfile` | Pin builder stage to `--platform=$BUILDPLATFORM`
(always native), accept `ARG TARGETOS`/`ARG TARGETARCH` from buildx, set
`GOOS=$TARGETOS GOARCH=$TARGETARCH CGO_ENABLED=0` on every `go build` |
| `.github/workflows/deploy.yml` | Add `docker/setup-buildx-action@v3` +
`docker/setup-qemu-action@v3` (latter needed only for runtime-stage
RUNs), set `platforms: linux/amd64,linux/arm64` |

### Build architecture

- **Builder stage** (`FROM --platform=$BUILDPLATFORM
golang:1.22-alpine`) — runs natively on amd64. Go toolchain
cross-compiles the binaries to `$TARGETARCH` via `GOOS/GOARCH`. No
emulation, ~10× faster than emulated builds. Works because
`modernc.org/sqlite` is pure Go (no CGO).
- **Runtime stage** (`FROM alpine:3.20`) — buildx pulls the per-arch
base. RUN steps (`apk add`, `mkdir/chown`, `chmod`) execute inside the
target-arch image, so QEMU is required to interpret arm64 binaries on
the amd64 host. Only a handful of short shell commands run under
emulation, so the QEMU cost is small.

### Verify

After merge, on an ARM device:
```bash
docker pull ghcr.io/kpa-clawbot/corescope:edge
docker inspect ghcr.io/kpa-clawbot/corescope:edge --format '{{.Architecture}}'
# → arm64
```

> First arm64 image appears on the next push to master after this
merges.

Closes #768

---------

Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <agent@corescope.local>
2026-04-21 10:32:02 -07:00
Kpa-clawbot ff05db7367 ci: fix staging smoke test port — read STAGING_GO_HTTP_PORT, not hardcoded 82 (#854)
## Problem
The "Deploy Staging" job's Smoke Test always fails with `Staging
/api/stats did not return engine field`.

Root cause: the step hardcodes `http://localhost:82/api/stats`, but
`docker-compose.staging.yml:21` publishes the container on
`${STAGING_GO_HTTP_PORT:-80}:80`. Default is port 80, not 82. curl gets
ECONNREFUSED, `-sf` swallows the error, `grep -q engine` sees empty
input → failure.

Verified on staging VM: `ss -lntp` shows only `:80` listening; `docker
ps` confirms `0.0.0.0:80->80/tcp`. A `curl http://localhost:82` returns
connection-refused.

## Fix
Read `STAGING_GO_HTTP_PORT` (same default as compose) so the smoke test
tracks the port the container was actually launched on. Failure message
now includes the resolved port to make future port mismatches
self-diagnosing.

## Tested
Logic only — the curl + grep pattern is unchanged. If any CI env
override sets `STAGING_GO_HTTP_PORT`, the smoke test now follows it.

Co-authored-by: Kpa-clawbot <agent@corescope.local>
2026-04-21 16:23:50 +00:00
you fa348efe2a fix: force-remove staging container before deploy — handles both compose and docker-run containers
The deploy step used only 'docker compose down' which can't remove
containers created via 'docker run'. Now explicitly stops+removes
the named container first, then runs compose down as cleanup.

Permanent fix for the recurring CI deploy failure.
2026-04-17 05:08:32 +00:00
Kpa-clawbot c233c14156 feat: CLI tool to decrypt and export hashtag channel messages (#724)
## Summary

Adds `corescope-decrypt` — a standalone CLI tool that decrypts and
exports MeshCore hashtag channel messages from a CoreScope SQLite
database.

### What it does

MeshCore hashtag channels use symmetric encryption with keys derived
from the channel name. The CoreScope ingestor stores **all** GRP_TXT
packets, even those it can't decrypt. This tool enables retroactive
decryption — decrypt historical messages for any channel whose name you
learn after the fact.

### Architecture

- **`internal/channel/`** — Shared crypto package extracted from
ingestor logic:
  - `DeriveKey()` — `SHA-256("#name")[:16]`
  - `ChannelHash()` — 1-byte packet filter (`SHA-256(key)[0]`)
  - `Decrypt()` — HMAC-SHA256 MAC verify + AES-128-ECB
  - `ParsePlaintext()` — timestamp + flags + "sender: message" parsing

- **`cmd/decrypt/`** — CLI binary with three output formats:
  - `--format json` — Full metadata (observers, path, raw hex)
  - `--format html` — Self-contained interactive viewer with search/sort
  - `--format irc` (or `log`) — Plain-text IRC-style log, greppable

### Usage

```bash
# JSON export
corescope-decrypt --channel "#wardriving" --db meshcore.db

# Interactive HTML viewer
corescope-decrypt --channel wardriving --db meshcore.db --format html --output wardriving.html

# Greppable log
corescope-decrypt --channel "#wardriving" --db meshcore.db --format irc | grep "KE6QR"

# From Docker
docker exec corescope-prod /app/corescope-decrypt --channel "#wardriving" --db /app/data/meshcore.db
```

### Build & deployment

- Statically linked (`CGO_ENABLED=0`) — zero dependencies
- Added to Dockerfile (available at `/app/corescope-decrypt` in
container)
- CI: builds and tests in go-test job
- CI: attaches linux/amd64 and linux/arm64 binaries to GitHub Releases
on tags

### Testing

- `internal/channel/` — 9 tests: key derivation, encrypt/decrypt
round-trip, MAC rejection, wrong-channel rejection, plaintext parsing
- `cmd/decrypt/` — 7 tests: payload extraction, channel hash
consistency, all 3 output formats, JSON parseability, fixture DB
integration
- Verified against real fixture DB: successfully decrypts 17
`#wardriving` messages

### Limitations

- Hashtag channels only (name-derived keys). Custom PSK channels not
supported.
- No DM decryption (asymmetric, per-peer keys).
- Read-only database access.

Fixes #723

---------

Co-authored-by: you <you@example.com>
2026-04-12 22:07:41 -07:00
you 00953207fb ci: remove arm64 build + QEMU — amd64 only
Removes linux/arm64 from multi-platform build and drops QEMU setup.
All infra (prod + staging) is x86. QEMU emulation was adding ~12min
to every CI run for an unused architecture.
2026-04-08 05:23:41 +00:00
Kpa-clawbot 243de9fba1 fix: consolidate CI pipeline — build, publish to GHCR, then deploy staging (#636)
## Consolidate CI Pipeline — Build + Publish to GHCR + Deploy Staging

### What
Merges the separate `publish.yml` workflow into `deploy.yml`, creating a
single CI/CD pipeline:

**`go-test → e2e-test → build-and-publish → deploy → publish-badges`**

### Why
- Two workflows doing overlapping builds was wasteful and error-prone
- `publish.yml` had a bug: `BUILD_TIME=$(date ...)` in a `with:` block
never executed (literal string)
- The old build job had duplicate/conflicting `APP_VERSION` assignments

### Changes
- **`build-and-publish` job** replaces old `build` job — builds locally
for staging, then does multi-arch GHCR push (gated to push events only,
PRs skip)
- **Build metadata** computed in a dedicated step, passed via
`GITHUB_OUTPUT` — no more shell expansion bugs
- **`APP_VERSION`** is `v1.2.3` on tag push, `edge` on master push
- **Deploy** now pulls the `edge` image from GHCR and tags for compose
compatibility, with fallback to local build
- **`publish.yml` deleted** — no duplicate workflow
- **Top-level `permissions`** block with `packages:write` for GHCR auth
- **Triggers** now include `tags: ['v*']` for release publishing

### Status
-  Rebased onto master
-  Self-reviewed (all checklist items pass)
-  Ready for merge

Co-authored-by: you <you@example.com>
2026-04-05 18:09:20 -07:00
you af9754dbea ci: move staging build+deploy to meshcore-runner-2
Prod VM (meshcore-vm) is now prod-only. Staging builds and
deploys on the secondary runner.
2026-04-05 17:33:15 +00:00
you ddce26ff2d ci: pin build and deploy jobs to meshcore-vm runner 2026-04-04 04:21:48 +00:00
Kpa-clawbot 9f14c74b3e ci: add Docker cleanup before build to prevent disk space exhaustion (#473)
## Summary

Fixes #472

The Docker build job on the self-hosted runner fails with `no space left
on device` because Docker build cache and Go module downloads accumulate
between runs. The existing cleanup (line ~330) runs in the **deploy**
step *after* the build — too late to help.

## Changes

- Added a "Free disk space" step at the start of the build job,
**before** "Build Go Docker image":
- `docker system prune -af` — removes all unused images, containers,
networks
  - `docker builder prune -af` — clears the build cache
  - `df -h /` — logs available disk space for visibility
- Kept the existing post-deploy cleanup as belt-and-suspenders

---------

Co-authored-by: you <you@example.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
2026-04-01 22:27:12 -07:00
Kpa-clawbot 38e5f02a00 ci: add Docker image cleanup to prevent runner disk exhaustion (#333)
## Problem

The self-hosted runner (`meshcore-runner-2`) filled its 29GB disk to
100%, blocking all CI runs:

```
Filesystem  Size  Used Avail Use%
/dev/root    29G   29G  2.3M 100%

Docker Images: 67 total, 2 active, 18.83GB reclaimable (99%)
```

Root cause: no Docker image cleanup after builds. Each CI run builds a
new image but never prunes old ones.

## Fix

### 1. Docker image cleanup after deploy (`deploy` job)
- Runs with `if: always()` so it executes even if deploy fails
- `docker image prune -af --filter "until=24h"` — removes images older
than 24h (safe: current build is minutes old)
- `docker builder prune -f --keep-storage=1GB` — caps build cache
- Logs before/after `docker system df` for visibility

### 2. Runner log cleanup at start of E2E job
- Prunes runner diagnostic logs older than 3 days (was 53MB and growing)
- Reports `df -h` for disk visibility in CI output

## Impact

After manual cleanup today, disk went from 100% → 35% (19GB free). This
PR prevents recurrence.

## Test plan
- [x] Manual cleanup verified on runner via `az vm run-command`
- [ ] Next CI run should show cleanup step output in deploy job logs

Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-31 16:00:31 -07:00
Kpa-clawbot 5aa4fbb600 chore: normalize all files to LF line endings 2026-03-30 22:52:46 -07:00
Kpa-clawbot 92188e8c12 ci: add manual workflow_dispatch trigger (#302)
Adds `workflow_dispatch` trigger to the CI/CD pipeline so it can be
manually triggered from the Actions tab.

Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 19:28:46 -07:00
Kpa-clawbot 16a99159cc fix: config.json lives in data dir, not bind-mounted as file (#282)
Removes the separate config.json file bind mount from both compose
files. The data directory mount already covers it, and the Go server
searches /app/data/config.json via LoadConfig.

- Entrypoint symlinks /app/data/config.json for ingestor compatibility
- manage.sh setup creates config in data dir, prompts admin if missing
- manage.sh start checks config exists before starting, offers to create
- deploy.yml simplified — no more sudo rm or directory cleanup
- Backup/restore updated to use data dir path

Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 18:55:44 +00:00
Kpa-clawbot 61ff72fc80 Revert "fix: config.json lives in data dir, not bind-mounted as file"
This reverts commit 57ebd76070.
2026-03-30 10:00:17 -07:00
Kpa-clawbot 57ebd76070 fix: config.json lives in data dir, not bind-mounted as file
Removes the separate config.json file bind mount from both compose
files. The data directory mount already covers it, and the Go server
searches /app/data/config.json via LoadConfig.

- Entrypoint symlinks /app/data/config.json for ingestor compatibility
- manage.sh setup creates config in data dir, prompts admin if missing
- manage.sh start checks config exists before starting, offers to create
- deploy.yml simplified — no more sudo rm or directory cleanup
- Backup/restore updated to use data dir path

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 09:58:22 -07:00
Kpa-clawbot 726b041740 fix: staging config.json directory mount and wrong source file
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 09:16:07 -07:00
Kpa-clawbot 4cbb66d8e9 ci: fix badge publish — use admin PAT via Contents API to bypass branch protection [skip ci] 2026-03-29 20:03:59 -07:00
Kpa-clawbot 5777780fc8 refactor: parallel coverage collector (~30-60s vs 8min) (#272)
## Summary

Redesigned frontend coverage collector with 7 parallel browser contexts.
Coverage collector runs on master pushes only (skipped on PRs).

### Architecture
7 groups run simultaneously via `Promise.allSettled()`:
- G1: Home + Customizer
- G2: Nodes + Node Detail
- G3: Packets + Packet Detail
- G4: Map
- G5: Analytics + Channels + Observers
- G6: Live + Perf + Traces + Globals
- G7: Utility functions (page.evaluate)

### Speed gains
- `safeClick` 500ms → 100ms
- `navHash` 150ms → 50ms
- Removed redundant page visits and E2E-duplicate interactions
- Wall time = slowest group (~30-60s estimated)

### 821 lines → ~450 lines
Each group writes its own coverage JSON, nyc merges automatically.

### CI behavior
- **PRs:** Coverage collector skipped (fast CI)
- **Master:** Coverage collector runs (full synthetic user validation)

Co-authored-by: you <you@example.com>
2026-03-29 19:46:01 -07:00
Kpa-clawbot ada53ff899 ci: fix badge artifacts not uploading (include-hidden-files for .badges/) 2026-03-30 01:38:31 +00:00
you 3dd68d4418 fix: staging deploy failures — OOM + config.json directory mount
Root causes from CI logs:
1. 'read /app/config.json: is a directory' — Docker creates a directory
   when bind-mounting a non-existent file. The entrypoint now detects
   and removes directory config.json before falling back to example.
2. 'unable to open database file: out of memory (14)' — old container
   (3GB) not fully exited when new one starts. Deploy now uses
   'docker compose down' with timeout and waits for memory reclaim.
3. Supervisor gave up after 3 fast retries (FATAL in ~6s). Increased
   startretries to 10 and startsecs to 2 for server and ingestor.

Additional:
- Deploy step ensures staging config.json exists before starting
- Healthcheck: added start_period=60s, increased timeout and retries
- No longer uses manage.sh (CI working dir != repo checkout dir)
2026-03-29 23:16:46 +00:00
Kpa-clawbot 900cbf6392 fix: deploy uses manage.sh restart staging instead of raw compose
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-29 14:06:37 -07:00
Kpa-clawbot 067b101e14 fix: split prod/staging compose and harden deploy/manage staging control
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-29 14:01:29 -07:00
Kpa-clawbot c271093795 fix: use docker compose down (not stop) to properly tear down staging
stop leaves the container/network in place, blocking port rebind.
down removes everything cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-29 12:53:18 -07:00
Kpa-clawbot 424e4675ae ci: restrict staging deploy container cleanup
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-29 12:42:31 -07:00
Kpa-clawbot fd162a9354 fix: CI kills legacy meshcore-* containers before deploy (#261)
Old meshcore-analyzer container still running from pre-rename era. Freed
2.2GB by killing it. CI now cleans up both old and new container names.

Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-29 12:30:13 -07:00
Kpa-clawbot 075dcaed4d fix: CI staging OOM — wait for old container before starting new (#259)
Old staging container wasn't fully stopped before new one started. Both
loaded 300MB stores simultaneously → OOM. Now properly waits and
verifies. Ref:
https://github.com/Kpa-clawbot/CoreScope/actions/runs/23716535123/job/69084603590

Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-29 12:08:56 -07:00
you 2817877380 ci: pass BUILD_TIME to Docker build 2026-03-29 18:55:37 +00:00
you 251b7fa5c2 ci: rename frontend-tests badge to e2e-tests in README, remove copy hack 2026-03-29 18:49:01 +00:00
you f31e0b42a0 ci: clean up stale badges, add Go coverage placeholders, fix frontend-tests.json name 2026-03-29 18:48:04 +00:00
you 78e0347055 ci: fix staging deploy — only stop staging container, don't nuke prod 2026-03-29 18:46:33 +00:00
you 8ab195b45f ci: fix Go cache warnings on E2E step + fix staging deploy OOM (proper container cleanup) 2026-03-29 18:45:50 +00:00
you 6c7a3c1614 ci: clean Go module cache before setup to prevent tar extraction warnings 2026-03-29 18:37:59 +00:00
you a5a3a85fc0 ci: disable coverage collector — E2E extracts window.__coverage__ directly 2026-03-29 18:33:46 +00:00
Kpa-clawbot ec7ae19bb5 ci: restructure pipeline — sequential fail-fast, Go server E2E, remove deprecated JS tests (#256)
## Summary

Complete CI pipeline restructure. Sequential fail-fast chain, E2E tests
against Go server with real staging data, all deprecated Node.js server
tests removed.

### Pipeline (PR):
1. **Go unit tests** — fail-fast, coverage + badges
2. **Playwright E2E** — against Go server with fixture DB, frontend
coverage, fail-fast on first failure
3. **Docker build** — verify containers build

### Pipeline (master merge):
Same chain + deploy to staging + badge publishing

### Removed:
- All Node.js server-side unit tests (deprecated JS server)
- `npm ci` / `npm run test` steps
- JS server coverage collection (`COVERAGE=1 node server.js`)
- Changed-files detection logic
- Docs-only CI skip logic
- Cancel-workflow API hacks

### Added:
- `test-fixtures/e2e-fixture.db` — real data from staging (200 nodes, 31
observers, 500 packets)
- `scripts/capture-fixture.sh` — refresh fixture from staging API
- Go server launches with `-port 13581 -db test-fixtures/e2e-fixture.db
-public public-instrumented`

---------

Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
2026-03-29 11:24:22 -07:00
you 75637afcc8 ci: upgrade upload/download-artifact to v6 (Node.js 24) 2026-03-29 18:05:03 +00:00
you 97486cfa21 ci: temporarily disable node-test job (CI restructure in progress) 2026-03-29 17:32:07 +00:00
you bb43b5696c ci: use Go server instead of Node.js for E2E tests
The Playwright E2E tests were starting `node server.js` (the deprecated
JS server) instead of the Go server, meaning E2E tests weren't testing
the production backend at all.

Changes:
- Add Go 1.22 setup and build steps to the node-test job
- Build the Go server binary before E2E tests run
- Replace `node server.js` with `./corescope-server` in both the
  instrumented (coverage) and quick (no-coverage) E2E server starts
- Use `-port 13581` and `-public` flags to configure the Go server
- For coverage runs, serve from `public-instrumented/` directory

The Go server serves the same static files and exposes compatible
/api/* routes (stats, packets, health, perf) that the E2E tests hit.
2026-03-29 10:22:26 -07:00
Kpa-clawbot 5bb9bc146e docs: remove letsmesh.net reference from README (#233)
* docs: remove letsmesh.net reference from README

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: remove paths-ignore from pull_request trigger

PR #233 only touches .md files, which were excluded by paths-ignore,
causing CI to be skipped entirely. Remove paths-ignore from the
pull_request trigger so all PRs get validated. Keep paths-ignore on
push to avoid unnecessary deploys for docs-only changes to master.

* ci: skip heavy CI jobs for docs-only PRs

Instead of using paths-ignore (which skips the entire workflow and
blocks required status checks), detect docs-only changes at the start
of each job and skip heavy steps while still reporting success.

This allows doc-only PRs to merge without waiting for Go builds,
Node.js tests, or Playwright E2E runs.

Reverts the approach from 7546ece (removing paths-ignore entirely)
in favor of a proper conditional skip within the jobs themselves.

* fix: update engine tests to match engine-badge HTML format

Tests expected [go]/[node] text but formatVersionBadge now renders
<span class="engine-badge">go</span>. Updated 6 assertions to
check for engine-badge class and engine name in HTML output.

---------

Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>
Co-authored-by: you <you@example.com>
2026-03-29 16:25:51 +00:00
you 12d1174e39 perf: speed up frontend coverage tests (~3x faster)
Three optimizations to the CI frontend test pipeline:

1. Run E2E tests and coverage collection concurrently
   - Previously sequential (E2E ~1.5min, then coverage ~5.75min)
   - Now both run in parallel against the same instrumented server
   - Expected savings: ~5 min (coverage runs alongside E2E instead of after)

2. Replace networkidle with domcontentloaded in coverage collector
   - SPA uses hash routing — networkidle waits 500ms for network silence
     on every navigation, adding ~10-15s of dead time across 23 navigations
   - domcontentloaded fires immediately once HTML is parsed; JS initializes
     the route handler synchronously
   - For in-page hash changes, use 200ms setTimeout instead of
     waitForLoadState (which would never re-fire for same-document nav)

3. Extract coverage from E2E tests too
   - E2E tests already exercise the app against the instrumented server
   - Now writes window.__coverage__ to .nyc_output/e2e-coverage.json
   - nyc merges both coverage files for higher total coverage

Also:
- Split Playwright install into browser + deps steps (deps skip if present)
- Replace sleep 5 with health-check poll in quick E2E path
2026-03-29 09:12:23 -07:00
you 1b09c733f5 ci: restrict self-hosted jobs to Linux runners
The Windows self-hosted runner picks up jobs and fails because bash
scripts run in PowerShell. Node.js tests need Chromium/Playwright
(Linux-only), and build/deploy/publish use Docker (Linux-only).

Changes:
- node-test: runs-on: [self-hosted, Linux]
- build: runs-on: [self-hosted, Linux]
- deploy: runs-on: [self-hosted, Linux]
- publish: runs-on: [self-hosted, Linux]
- go-test: unchanged (ubuntu-latest)
2026-03-29 14:58:15 +00:00
Kpa-clawbot 553c0e4963 ci: bump GitHub Actions to Node 24 compatible versions
checkout v4→v5, setup-go v5→v6, setup-node v4→v5,
upload-artifact v4→v5, download-artifact v4→v5

Fixes the Node.js 20 deprecation warning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-29 07:51:48 -07:00
you 074f3d3760 ci: cancel workflow run immediately when any test job fails
When go-test or node-test fails, the workflow run is now cancelled
via the GitHub API so the sibling job doesn't sit queued/running.

Also fixed build job to need both go-test AND node-test (was only
waiting on go-test despite the pipeline comment saying both gate it).
2026-03-29 14:20:22 +00:00
you 206d9bd64a fix: use per-PR concurrency group to prevent cross-PR cancellation
The flat 'deploy' concurrency group caused ALL PRs to share one queue,
so pushing to any PR would cancel CI runs on other PRs.

Changed to deploy-${{ github.event.pull_request.number || github.ref }}
so each PR gets its own concurrency group while re-pushes to the same
PR still cancel the previous run.
2026-03-29 14:14:57 +00:00