Files
livekit/scripts/fec/README.md
2026-06-10 14:34:06 -07:00

136 lines
6.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# FlexFEC test harness (publisher → SFU)
Validates that FlexFEC-03 sent by a publisher recovers lost packets at the SFU
under cellular-like loss (continuous base loss plus periodic bursts), the kind
of uplink a robot on LTE/5G sees. The harness runs everything on loopback:
```
publisher (rust-sdks local_video, --test-pattern --flex-fec)
│ UDP → 127.0.0.1:7882 ← traffic shaper drops packets here
livekit-server (enable_flexfec: true, prometheus on :6789)
subscriber (rust-sdks local_video, --headless --log-frames)
```
The publisher attaches a wall-clock timestamp and a frame id to every frame
via the packet trailer feature; the subscriber logs them per received frame.
That gives ground truth for end-to-end frame latency and frame loss, while
the SFU's prometheus counters (`livekit_flexfec_packet_total{state=...}`,
`livekit_nack_total`, `livekit_packet_loss_total`) show what FEC did.
## Requirements
- `go`, `cargo`, `curl`, `python3` with `matplotlib` (`pip3 install matplotlib`)
- a `rust-sdks` checkout with the local_video example
(default `../rust-sdks` relative to this repo, override with `RUST_SDKS_DIR`)
- `sudo` for traffic shaping:
- **macOS**: dummynet (`dnctl` + `pfctl`). If dummynet is unavailable on
your macOS build, run with `--no-shaping` and use Network Link Conditioner
manually, or test on Linux.
- **Linux**: `tc` with the `netem` qdisc (`iproute2`).
## Usage
```bash
# A/B comparison: baseline (NACK only) vs FlexFEC, 2 minutes each
./run_fec_test.sh --mode ab --duration 120
# single FEC run with a harsher profile
./run_fec_test.sh --mode fec --duration 60 --base-loss 0.05 --burst-loss 0.4
# sanity check without shaping (expect ~0 loss, ~0 recoveries)
./run_fec_test.sh --mode fec --duration 30 --no-shaping
# no sudo available: drop 4% of received packets inside the SFU instead of
# OS shaping (uniform loss only, also useful for CI)
./run_fec_test.sh --mode ab --duration 60 --debug-drop 4
```
Outputs land in `scripts/fec/out/<timestamp>/`:
- `fec_report.png` — stacked time series: per-frame latency with frame-gap
markers and burst shading, FEC received/recovered/failed rates, NACK and
packet-loss rates, delivered fps
- `summary.txt` — per-run table (frames lost, latency percentiles, FEC
counters, NACK totals) and the A/B comparison
- per run (`baseline/`, `fec/`): `server.log`, `publisher.log`,
`subscriber.log`, `frames.csv`, `prom.tsv`, `events.csv`, `meta.env`
## Loss profile
Defaults simulate a robot uplink over cellular: 2% continuous loss
(Gilbert-Elliott on Linux for realistic correlation, uniform on macOS) with a
3 s burst of 25% loss every 15 s. Tune via `--base-loss`, `--burst-loss`,
`--burst-every`, `--burst-len`. The shaper matches **UDP destined to port
7882** on loopback, which is the publisher→SFU media leg; the subscriber's
upstream RTCP shares that port and is shaped too, which is acceptable for A/B
comparisons since both runs see identical conditions.
The publisher defaults to a fixed 30% FEC protection rate with the bursty
mask (`--fec-rate`, `--fec-mask-type`); pass `--fec-rate 0` to let libwebrtc
adapt the rate to its loss estimate instead.
`--debug-drop PCT` is an unprivileged alternative to OS shaping: the SFU
drops PCT% of received packets before processing (uniform, all SSRCs,
enabled via the `LIVEKIT_DEBUG_RX_DROP_PCT` env var). Use the OS shapers for
the cellular burst profile, this knob for quick checks and CI. Note that with
uniform loss and a fixed protection rate, blocks losing two or more packets
are not FEC-recoverable and fall back to NACK/RTX, so expect partial
recovery; bursty loss with the bursty mask is the scenario FlexFEC targets.
## What to expect
- Sanity run (no shaping): `state="received"` grows, `recovered` ≈ 0, no
frame gaps.
- FEC run under bursts: `recovered` spikes inside burst windows, the
subscriber sees few or no frame-id gaps, latency stays near baseline.
- Baseline under the same bursts: frame gaps and latency spikes during
bursts (NACK/RTX needs a round trip per loss; FEC repairs immediately).
- Publisher bitrate should not collapse in the FEC run — FEC packets carry
transport-wide CC sequence numbers and the SFU reports them, so the
publisher's bandwidth estimate stays intact.
## Parameter sweep
`sweep_fec.sh` runs `run_fec_test.sh` across a matrix of loss levels and FEC
configurations (one baseline plus one FEC run per `(rate, mask)` at each loss
level), builds the binaries once, then `aggregate_fec.py` combines all cells
into a single comparison: `sweep_report.png` (frame loss, p99 latency, FEC
recovery rate, and recovered-packet count, each vs loss level, baseline vs
every FEC config), `sweep_summary.csv`, and a markdown table.
```bash
# uniform-loss sweep, no sudo: 4 loss levels x 2 FEC rates x 1 mask
./sweep_fec.sh --loss-mode debug --loss-list "2 5 10 15" --fec-rate-list "20 50"
# cellular-burst sweep: vary burst intensity, compare the two masks at 30%
sudo -v && ./sweep_fec.sh --loss-mode shaped --loss-list "0.15 0.30 0.50" \
--fec-rate-list "30" --mask-list "random bursty" --duration 90
# re-aggregate an existing sweep without re-running it
python3 aggregate_fec.py --sweep out/sweep_<ts>
```
Runtime ≈ cells × (~20s setup + `--duration`). Each cell is a directory under
the sweep output; `manifest.tsv` maps cell → parameters. In `debug` mode
`--loss-list` is packet-drop percentages; in `shaped` mode it is the burst
loss fraction (base loss and cadence fixed by `--base-loss`/`--burst-every`/
`--burst-len`). Note the caveat above: uniform `debug` loss with a fixed FEC
rate only recovers blocks that lose a single packet, so `shaped` bursts with
the `bursty` mask show FlexFEC at its best.
## Direct script use
```bash
sudo ./shape_macos.sh start --port 7882 --base-loss 0.02 --burst-loss 0.25 \
--burst-every 15 --burst-len 3 --events /tmp/events.csv
sudo ./shape_macos.sh stop # force cleanup (also: shape_linux.sh)
./prom_poll.sh 6789 /tmp/prom.tsv
# single run / A/B pair
python3 plot_fec.py --run out/<ts>/baseline --run out/<ts>/fec --out out/<ts>
```