924 Commits

Author SHA1 Message Date
David Chen 743d9c8b3a add support for client capabilities (#4461)
* update protocol version

* only check for client capabiltiy to strip packet trailer
2026-04-27 17:58:36 -07:00
Raja Subramanian fc47e47866 Close peer connection unconditionally to unblock set local/remote (#4485)
* Close peer connection unconditionally to unblock set local/remote
description operations.

Have been chasing a leak where participants have a lot of connectivity
issues and analysed a goref with Claude. Output below.

Jo Turk quickly patched sctp for reported issue -
https://github.com/pion/sctp/pull/465.

This PR moves the peer connection close to before waiting for events
queue to be drained as event queue could be blocked on
`SetLocal/RemoteDescription` hanging.

The scenario is a bit far-fetched as a lot of things have to happen, but
it does point to a scenario where things could hang. Remains to be seen
if this helps. Note that closing the peer connection early could mean
the contained objects (like data channels) could all be closed as part
of the peer connection close. But, still keeping the explicit clean up
path (which should effectively become no-op) to minimise changes.

------------------------------------------------------------------

The wedge is in pion/sctp's blocking-write gate, called synchronously from inside the PC's operations queue. Five things have to be true at the same time, and on this build they all are:

  1. SCTPTransport.Start is synchronous in the SetRemoteDescription op

  The stuck stack:
  PeerConnection.SetRemoteDescription.func2  (peerconnection.go:1363)
    → startRTP → startSCTP
      → SCTPTransport.Start         (sctptransport.go:141)
        → DataChannel.open          (datachannel.go:178)
          → datachannel.Dial → Client → Stream.WriteSCTP
            → Association.sendPayloadData    (association.go:3141)  ← blocks here
  SCTPTransport.Start synchronously sends the DCEP "OPEN" for each pre-negotiated channel. The operations.start goroutine runs SetRemoteDescription's logic; it does not return until Start does.

  2. The wait has no deadline

  Stream.WriteSCTP (stream.go:289) calls sendPayloadData(s.writeDeadline, ...). s.writeDeadline is the default zero-value deadline.Deadline — never armed, because DataChannel.Dial doesn't call Stream.SetWriteDeadline. So the <-ctx.Done() arm of the wait select can
  never fire.

  3. EnableDataChannelBlockWrite(true) puts SCTP into a serialized-write gate

  At livekit-server/pkg/rtc/transport.go:362 livekit calls se.EnableDataChannelBlockWrite(true). That flips the sendPayloadData path to:
  // association.go:3138-3148
  if a.blockWrite {
      for a.writePending {
          a.lock.Unlock()
          select {
          case <-ctx.Done():        // never (no deadline)
          case <-a.writeNotify:     // only fires when writeLoop fully drains pendingQueue
          }
          a.lock.Lock()
      }
      a.writePending = true
  }

  4. writeNotify only fires after the writeLoop drains everything

  The only place notifyBlockWritable is called is gatherOutbound (association.go:3085-3088), and only when len(chunks) > 0 && a.pendingQueue.size() == 0 — i.e., the writeLoop actually managed to move all pending chunks to inflight. If cwnd is full and SACKs stop
  arriving, the writeLoop wakes up, sees zero room, sends nothing, and writePending stays true.

  5. There is no association-level abort timer for data writes

  At association.go:764:
  assoc.t3RTX = newRTXTimer(timerT3RTX, assoc, noMaxRetrans, rtoMax)
  noMaxRetrans means the retransmission timer never gives up. INIT has maxInitRetrans, but data does not. There is no equivalent of TCP's tcp_retries2 → ETIMEDOUT → ABORT. So once the path is dead post-handshake, t3RTX keeps firing into the void and the association
   never transitions out of established on its own.

  What it takes to wake it up

  Only an external close: somebody has to terminate the underlying DTLS conn (which makes Association.readLoop's netConn.Read fail, which closes closeWriteLoopCh, which lets timerLoop exit). But — and this is the kicker — readLoop's defer at association.go:976-996
  closes everything except it does not call notifyBlockWritable. So even if readLoop unwinds, any goroutine parked on <-a.writeNotify stays parked unless it was watching ctx (which here it isn't).

  So the trigger sequence on this pod was almost certainly:
  1. Peer establishes ICE+DTLS+SCTP, association goes established.
  2. Peer disappears (ICE silently fails, NAT rebinding, OS sleep, kill -9, etc.).
  3. The first DCEP-OPEN for one of livekit's pre-negotiated channels is queued; cwnd never opens because no SACKs return.
  4. writePending is now true for the lifetime of the process, with no deadline, no ctx, no kill.
  5. The PC's operations queue is wedged, SetRemoteDescription never returns, livekit-server's handleRemoteOfferReceived event handler is parked, the participant is never torn down, and the SCTP timerLoop pins the entire participant graph in memory until OOM-kill.

  Realistic fixes (in order of how clean they are)

  1. Upstream: in pion/sctp, broadcast notifyBlockWritable() (or close writeNotify) inside readLoop's defer cleanup, so a closed association unblocks any pending writers. This is the right fix.
  2. livekit-server: wrap pc.SetRemoteDescription(...) with a timeout, and on timeout call pc.Close() — Close ultimately tears down the DTLS conn, which lets readLoop exit (point 1 still needs to be true for the writer goroutine to actually unblock, though).
  3. Workaround: call stream.SetWriteDeadline(...) on the SCTP stream before issuing the DCEP open, so the ctx arm of the select can fire. Requires reaching past webrtc.DataChannel though.
  4. Heaviest hammer: don't pre-negotiate the data channels inline with SetRemoteDescription — open them lazily after PC reaches connected so a stuck open never blocks signaling.

  Without (1), even (2) leaves the writer goroutine itself parked forever — but at least the PC and its participant-side state would be released; only the SCTP goroutine subtree (much smaller) would leak.

* revert probe stop change

* handle nil offer
2026-04-27 21:38:46 +05:30
renovate[bot] 639406eb96 Update module github.com/pion/ice/v4 to v4.2.3 (#4481)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2026-04-26 22:33:26 -07:00
Anunay Maheshwari 9ee06635d6 feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules (#4466)
* feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules

* update deps
2026-04-22 12:49:36 +05:30
renovate[bot] ea7b9c6fe1 Update module github.com/livekit/protocol to v1.45.3 (#4435)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2026-04-11 16:17:10 -07:00
renovate[bot] 97378368dd Update go deps (major) (#3179)
* Update go deps

Generated by renovateBot

* update api usage

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: David Zhao <dz@livekit.io>
2026-04-11 14:28:33 -07:00
renovate[bot] d6aef547ce Update go deps (#3862)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2026-04-11 13:59:25 -07:00
cnderrauber 5dc4e90d82 Apply IPFilter when get local ip (#4440)
Fix #4437
See https://github.com/livekit/mediatransportutil/pull/80
2026-04-10 08:59:53 +08:00
Anunay Maheshwari ff7fd7ed56 feat(agent-dispatch): add job restart policy (#4401)
* feat(agent-dispatch): add job restart policy

* deps
2026-03-27 21:32:04 +05:30
cnderrauber 9474c807c0 route participant reads through PSRPC instead of Redis (#4387)
rel: #4373
2026-03-24 16:25:11 +08:00
Raja Subramanian bc3aeaf3f8 Update grpc to address CVE-2026-33186 (#4381)
That updates Go version/
2026-03-23 22:01:17 +05:30
Théo Monnom 89410df74c handle AGENT_ERROR disconnect reason (#4339) 2026-03-17 23:00:16 -07:00
Denys Smirnov 4a9e004555 Update protocol. (#4367) 2026-03-16 17:26:56 +02:00
cnderrauber e963953817 Refine ipv6 support (#4352)
* Refine ipv6 support

* go mod

* check ipv4 is set in turn
2026-03-09 20:43:00 +08:00
He Chen cb7dc2d02a TEL-405: support originating calls from custom domains (#4349) 2026-03-06 12:25:40 -08:00
Raja Subramanian f51b27328f Update pion/webrtc and deps to update dtls (#4326)
DTLS v3.1.0 was retracted - https://github.com/pion/dtls/pull/800
2026-02-19 13:43:20 +05:30
Benjamin Pracht 03e90dd762 Fix for some CodeQL reported issues (#4314)
One of the fix is in updated protocol (logging of a request message that includes the turn server settings, including password).
2026-02-11 10:15:12 -08:00
Raja Subramanian 195b17f62f Populate client_protocol field in ParticipantInfo (#4293) 2026-02-06 00:34:56 +05:30
Anunay Maheshwari 0c33b8c671 chore: move codecs/mime stuff to protocol (#4255) 2026-01-20 20:54:32 +05:30
Ryan Gaus 165c17358a Update livekit protocol to v1.44.0 (#4254) 2026-01-19 14:12:37 +11:00
Raja Subramanian a35a6ae751 Add participant option for data track auto-subscribe. (#4240)
* Add participant option for data track auto-subscribe.

Default disabled.

* protocol update to use data track auto subscribe setting

* deps
2026-01-14 13:22:43 +05:30
Denys Smirnov 843d8c3ea1 Update Pion transport package. (#4237)
* Update Pion transport package.

* Update mediatransportutil package.
2026-01-13 19:56:41 +02:00
Denys Smirnov 4ec0f8f4ce Support OpenTelemetry tracing. Add Jaeger support. (#4222) 2026-01-06 17:22:21 +02:00
Raja Subramanian ed8e6afcd7 Handle repair SSRC of simulcast tracks during migration. (#4193)
* Handle repair SSRC of simulcast tracks during migration.

* fix

* fix comment
2025-12-25 14:45:48 +05:30
Raja Subramanian 3cb9abb615 Update pion/webrtc to v4.2.1 (#4191) 2025-12-24 17:19:15 +05:30
Raja Subramanian 7c8ea11505 Refactor receiver and buffer into Base and higher layer. (#4185)
* Refactor receiver and buffer into Base and higher layer.

To be able to share code/functionality with relay.

* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

* clean up

* deps

* fix test

* fix test
2025-12-23 21:35:48 +05:30
cnderrauber 4104b8270b update protocol (#4183)
* update protocol

* fix test
2025-12-22 12:54:11 +08:00
Anunay Maheshwari c28e5e450f fix(kindToProto): update protocol (#4171) 2025-12-17 23:32:02 +05:30
Paul Wells 898ebe058c clean up manual roomservice log redaction (#4165)
* clean up manual roomservice log redaction

* deps
2025-12-16 23:02:45 -08:00
Raja Subramanian 97aba5e77b Consistently undo update to sequence number and timestamp when the (#4156)
incoming packet cannot be sequenced.
2025-12-13 15:46:04 +05:30
Raja Subramanian 35b0e2bc21 update webrtc to 4.1.8 to pick up DTLS fingerprint check during handshake (#4140) 2025-12-10 00:03:32 +05:30
Raja Subramanian ea9b217738 protocol deps to get inactive file adjusted memory usage. (#4137) 2025-12-08 22:07:24 +05:30
Raja Subramanian 3eef869a68 Do not pause rid in SDP (#4129) 2025-12-05 15:57:31 +05:30
cnderrauber fa0633aa3e move utils.WrapAround to mediatransportutil (#4124) 2025-12-04 17:45:11 +08:00
Raja Subramanian f8706cd470 Update pion/ice to stop gather first on close (#4123)
* Update pion/ice to stop gather first on close

* fix data race in test
2025-12-04 12:22:52 +05:30
Raja Subramanian 7954748d7a Data tracks (#4089)
* WIP

* WIP

* Starting to add some signalling integration testing.

* Working tests.

* fix tests

* Forward data packets (#4096)

* WIP commit

* WIP

* WIP

* fix forwarding

* address PR comments

* move some methods from LocalParticipant to Participant interface

* handle subscription update

* add extensions and tests

* more packet tests

* add test for replace extension and fix a bug

* update protocol and add config
2025-12-04 10:44:34 +05:30
Raja Subramanian ebdcead511 Update mediatransportutil to get bucket packet size limit. (#4120) 2025-12-01 11:31:37 +05:30
Raja Subramanian 411b09f6ca Release v1.9.5 (#4119) 2025-12-01 10:51:07 +05:30
Raja Subramanian 8dcf235a02 Update pion/ice - attempt to address tcp packet conn close hang (#4116) 2025-11-30 20:06:19 +05:30
Raja Subramanian 64c651431e Update mediatransportutil (#4115)
- New bucket API to pass in max packet size and sequence number offset
  and seequence number size generic type
- Move OWD estimator to mediatransportutil.
2025-11-28 21:51:53 +05:30
Raja Subramanian 9c483a693a Use released version v1.8.41 of pion/sctp (#4113) 2025-11-27 21:01:00 +05:30
Raja Subramanian 35c79a57d7 Update SCTP hash, had the wrong one in previous PR (#4111) 2025-11-27 15:52:45 +05:30
Raja Subramanian e0fbbef1cd Update pion/sctp with RFC9260 revert (#4110) 2025-11-27 15:32:10 +05:30
Raja Subramanian f3c8091797 Try SCTP with read deadline to unblock abort. (#4109) 2025-11-27 13:18:06 +05:30
Raja Subramanian 6d4154b8a7 Update pion/ice. (#4107)
Hopefully this solves the peer connection close hang.
2025-11-27 00:54:49 +05:30
Raja Subramanian 7f10e18bac Record join/publish/subscribe cancellations. (#4102)
To get better picture of success/failure rate.
2025-11-25 14:06:02 +05:30
Raja Subramanian 37a06821e2 logger proto redaction. (#4090)
Unfortunately, this could not be used for twirp/analytics redaction.

Probably worth writing a proto clone utility which will filter out based
on tags.
2025-11-18 14:15:17 +05:30
Raja Subramanian 5175c1afa1 Lock x/tools at 0.37.0 (#4085) 2025-11-15 19:14:03 +05:30
Raja Subramanian d510fff1e7 Downgrade x/tools to be able to make a release (#4084) 2025-11-15 18:56:22 +05:30
Alex 3a128e61c1 protocol bump for SIP error mapping and validation (#4081) 2025-11-14 10:54:26 -08:00