2 Commits

Author SHA1 Message Date
efiten 7c40e24a35 feat(server): warn at startup when GOMEMLIMIT < 50% of container memory limit (#1264) (#1429)
## Summary

- Adds `readCgroupMemoryMB()` to detect container memory ceiling from
cgroup v2 (`/sys/fs/cgroup/memory.max`) and v1
(`/sys/fs/cgroup/memory.limit_in_bytes`)
- Adds `warnIfMemlimitUnderprovisioned()` called once from `main()`
after the existing memlimit block — logs a `[memlimit] WARN` at startup
if the effective GOMEMLIMIT is below 50% of the container limit
- Works whether the limit was set via `GOMEMLIMIT` env var or derived
from `packetStore.maxMemoryMB`
- Adds `readCgroupMemoryMBFn` package-level hook for test injection
(same pattern as `readProcSelfIOFn` in the ingestor)

Fixes #1264. In the reported incident, GOMEMLIMIT was 1536 MiB on a 7.7
GB container; GC consumed 82% of CPU and all endpoints were 3–100×
slower. This warning fires at startup so operators catch the
misconfiguration before it causes an incident.

## Test plan

- [ ] `TestWarnIfMemlimitUnderprovisioned_EmitsWarning` — warning fires
when effective < 50% of cgroup
- [ ] `TestWarnIfMemlimitUnderprovisioned_NoWarnWhenAdequate` — no
warning at boundary (effective = 1024 MiB, cgroup = 1536 MiB)
- [ ] `TestWarnIfMemlimitUnderprovisioned_NoCgroupNoLog` — silent on
non-container hosts
- [ ] `TestWarnIfMemlimitUnderprovisioned_NoneSource` — no warning when
`source="none"` (no limit configured, runtime returns math.MaxInt64)
- [ ] `TestMemlimitUnderprovisioned` — boundary table for the comparison
helper
- [ ] All existing `TestApplyMemoryLimit_*` still pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 15:06:30 -07:00
Kpa-clawbot d05e468598 feat(memlimit): GOMEMLIMIT support, derive from packetStore.maxMemoryMB (#836) (#1077)
## Summary

Implements **part 1** of #836 — `GOMEMLIMIT` support so the Go runtime
self-throttles GC under cgroup memory pressure instead of getting
SIGKILLed.

(Parts 2 & 3 — bounded cold-load batching + README ops docs — land in
follow-up PRs.)

## Behavior

On startup `cmd/server/main.go` now calls `applyMemoryLimit(maxMemoryMB,
envSet)`:

| Condition | Action | Log |
|---|---|---|
| `GOMEMLIMIT` env set | Honor the runtime's parse, do nothing |
`[memlimit] using GOMEMLIMIT from environment (...)` |
| env unset, `packetStore.maxMemoryMB > 0` | `debug.SetMemoryLimit(maxMB
* 1.5 MiB)` | `[memlimit] derived from packetStore.maxMemoryMB=512 → 768
MiB (1.5x headroom)` |
| env unset, `maxMemoryMB == 0` | No-op | `[memlimit] no soft memory
limit set ... recommend setting one to avoid container OOM-kill` |

The 1.5x headroom covers Go's NextGC trigger at ~2× live heap (per #836
heap profile: 680 MB live → 1.38 GB NextGC).

## Tests (TDD red→green visible in commit history)

- `TestApplyMemoryLimit_FromEnv` — env wins, function does not override
- `TestApplyMemoryLimit_DerivedFromMaxMemoryMB` — verifies bytes
computation + `debug.SetMemoryLimit` actually applied at runtime
- `TestApplyMemoryLimit_None` — no env, no config → reports `"none"`, no
side effect

Red commit: `7de3c62` (assertion failures, builds clean)
Green commit: `454516d`

## Config docs

`config.example.json` `packetStore._comment_gomemlimit` documents
env/derived/override behavior.

## Out of scope

- Cold-load transient bounding (item 2 in #836)
- README container-size table (item 3)
- QA §1.1 rewrite

Closes part 1 of #836.

---------

Co-authored-by: corescope-bot <bot@corescope>
2026-05-05 01:33:23 -07:00