Raja Subramanian fc47e47866 Close peer connection unconditionally to unblock set local/remote (#4485)
* Close peer connection unconditionally to unblock set local/remote
description operations.

Have been chasing a leak where participants have a lot of connectivity
issues and analysed a goref with Claude. Output below.

Jo Turk quickly patched sctp for reported issue -
https://github.com/pion/sctp/pull/465.

This PR moves the peer connection close to before waiting for events
queue to be drained as event queue could be blocked on
`SetLocal/RemoteDescription` hanging.

The scenario is a bit far-fetched as a lot of things have to happen, but
it does point to a scenario where things could hang. Remains to be seen
if this helps. Note that closing the peer connection early could mean
the contained objects (like data channels) could all be closed as part
of the peer connection close. But, still keeping the explicit clean up
path (which should effectively become no-op) to minimise changes.

------------------------------------------------------------------

The wedge is in pion/sctp's blocking-write gate, called synchronously from inside the PC's operations queue. Five things have to be true at the same time, and on this build they all are:

  1. SCTPTransport.Start is synchronous in the SetRemoteDescription op

  The stuck stack:
  PeerConnection.SetRemoteDescription.func2  (peerconnection.go:1363)
    → startRTP → startSCTP
      → SCTPTransport.Start         (sctptransport.go:141)
        → DataChannel.open          (datachannel.go:178)
          → datachannel.Dial → Client → Stream.WriteSCTP
            → Association.sendPayloadData    (association.go:3141)  ← blocks here
  SCTPTransport.Start synchronously sends the DCEP "OPEN" for each pre-negotiated channel. The operations.start goroutine runs SetRemoteDescription's logic; it does not return until Start does.

  2. The wait has no deadline

  Stream.WriteSCTP (stream.go:289) calls sendPayloadData(s.writeDeadline, ...). s.writeDeadline is the default zero-value deadline.Deadline — never armed, because DataChannel.Dial doesn't call Stream.SetWriteDeadline. So the <-ctx.Done() arm of the wait select can
  never fire.

  3. EnableDataChannelBlockWrite(true) puts SCTP into a serialized-write gate

  At livekit-server/pkg/rtc/transport.go:362 livekit calls se.EnableDataChannelBlockWrite(true). That flips the sendPayloadData path to:
  // association.go:3138-3148
  if a.blockWrite {
      for a.writePending {
          a.lock.Unlock()
          select {
          case <-ctx.Done():        // never (no deadline)
          case <-a.writeNotify:     // only fires when writeLoop fully drains pendingQueue
          }
          a.lock.Lock()
      }
      a.writePending = true
  }

  4. writeNotify only fires after the writeLoop drains everything

  The only place notifyBlockWritable is called is gatherOutbound (association.go:3085-3088), and only when len(chunks) > 0 && a.pendingQueue.size() == 0 — i.e., the writeLoop actually managed to move all pending chunks to inflight. If cwnd is full and SACKs stop
  arriving, the writeLoop wakes up, sees zero room, sends nothing, and writePending stays true.

  5. There is no association-level abort timer for data writes

  At association.go:764:
  assoc.t3RTX = newRTXTimer(timerT3RTX, assoc, noMaxRetrans, rtoMax)
  noMaxRetrans means the retransmission timer never gives up. INIT has maxInitRetrans, but data does not. There is no equivalent of TCP's tcp_retries2 → ETIMEDOUT → ABORT. So once the path is dead post-handshake, t3RTX keeps firing into the void and the association
   never transitions out of established on its own.

  What it takes to wake it up

  Only an external close: somebody has to terminate the underlying DTLS conn (which makes Association.readLoop's netConn.Read fail, which closes closeWriteLoopCh, which lets timerLoop exit). But — and this is the kicker — readLoop's defer at association.go:976-996
  closes everything except it does not call notifyBlockWritable. So even if readLoop unwinds, any goroutine parked on <-a.writeNotify stays parked unless it was watching ctx (which here it isn't).

  So the trigger sequence on this pod was almost certainly:
  1. Peer establishes ICE+DTLS+SCTP, association goes established.
  2. Peer disappears (ICE silently fails, NAT rebinding, OS sleep, kill -9, etc.).
  3. The first DCEP-OPEN for one of livekit's pre-negotiated channels is queued; cwnd never opens because no SACKs return.
  4. writePending is now true for the lifetime of the process, with no deadline, no ctx, no kill.
  5. The PC's operations queue is wedged, SetRemoteDescription never returns, livekit-server's handleRemoteOfferReceived event handler is parked, the participant is never torn down, and the SCTP timerLoop pins the entire participant graph in memory until OOM-kill.

  Realistic fixes (in order of how clean they are)

  1. Upstream: in pion/sctp, broadcast notifyBlockWritable() (or close writeNotify) inside readLoop's defer cleanup, so a closed association unblocks any pending writers. This is the right fix.
  2. livekit-server: wrap pc.SetRemoteDescription(...) with a timeout, and on timeout call pc.Close() — Close ultimately tears down the DTLS conn, which lets readLoop exit (point 1 still needs to be true for the writer goroutine to actually unblock, though).
  3. Workaround: call stream.SetWriteDeadline(...) on the SCTP stream before issuing the DCEP open, so the ctx arm of the select can fire. Requires reaching past webrtc.DataChannel though.
  4. Heaviest hammer: don't pre-negotiate the data channels inline with SetRemoteDescription — open them lazily after PC reaches connected so a stuck open never blocks signaling.

  Without (1), even (2) leaves the writer goroutine itself parked forever — but at least the PC and its participant-side state would be released; only the SCTP goroutine subtree (much smaller) would leak.

* revert probe stop change

* handle nil offer
2026-04-27 21:38:46 +05:30
2026-04-11 13:31:41 -07:00
2026-04-11 14:28:33 -07:00
2023-01-11 14:49:50 -07:00
2023-07-27 16:43:19 -07:00
2026-04-17 21:38:13 +05:30
2026-04-17 21:38:13 +05:30
2021-06-03 23:22:19 -07:00
2023-07-27 16:43:19 -07:00
2026-03-08 23:47:54 -07:00

The LiveKit icon, the name of the repository and some sample code in the background.

LiveKit: Real-time video, audio and data for developers

LiveKit is an open source project that provides scalable, multi-user conferencing based on WebRTC. It's designed to provide everything you need to build real-time video audio data capabilities in your applications.

LiveKit's server is written in Go, using the awesome Pion WebRTC implementation.

GitHub stars Slack community Twitter Follow Ask DeepWiki GitHub release (latest SemVer) GitHub Workflow Status License

Features

Documentation & Guides

https://docs.livekit.io

Live Demos

Ecosystem

  • Agents: build real-time multimodal AI applications with programmable backend participants
  • Egress: record or multi-stream rooms and export individual tracks
  • Ingress: ingest streams from external sources like RTMP, WHIP, HLS, or OBS Studio

SDKs & Tools

Client SDKs

Client SDKs enable your frontend to include interactive, multi-user experiences.

Language Repo Declarative UI Links
JavaScript (TypeScript) client-sdk-js React docs | JS example | React example
Swift (iOS / MacOS) client-sdk-swift Swift UI docs | example
Kotlin (Android) client-sdk-android Compose docs | example | Compose example
Flutter (all platforms) client-sdk-flutter native docs | example
Unity WebGL client-sdk-unity-web docs
React Native (beta) client-sdk-react-native native
Rust client-sdk-rust

Server SDKs

Server SDKs enable your backend to generate access tokens, call server APIs, and receive webhooks. In addition, the Go SDK includes client capabilities, enabling you to build automations that behave like end-users.

Language Repo Docs
Go server-sdk-go docs
JavaScript (TypeScript) server-sdk-js docs
Ruby server-sdk-ruby
Java (Kotlin) server-sdk-kotlin
Python (community) python-sdks
PHP (community) agence104/livekit-server-sdk-php

Tools

Install

Tip

We recommend installing LiveKit CLI along with the server. It lets you access server APIs, create tokens, and generate test traffic.

The following will install LiveKit's media server:

MacOS

brew install livekit

Linux

curl -sSL https://get.livekit.io | bash

Windows

Download the latest release here

Getting Started

Starting LiveKit

Start LiveKit in development mode by running livekit-server --dev. It'll use a placeholder API key/secret pair.

API Key: devkey
API Secret: secret

To customize your setup for production, refer to our deployment docs

Creating access token

A user connecting to a LiveKit room requires an access token. Access tokens (JWT) encode the user's identity and the room permissions they've been granted. You can generate a token with our CLI:

lk token create \
    --api-key devkey --api-secret secret \
    --join --room my-first-room --identity user1 \
    --valid-for 24h

Test with example app

Head over to our example app and enter a generated token to connect to your LiveKit server. This app is built with our React SDK.

Once connected, your video and audio are now being published to your new LiveKit instance!

Simulating a test publisher

lk room join \
    --url ws://localhost:7880 \
    --api-key devkey --api-secret secret \
    --identity bot-user1 \
    --publish-demo \
    my-first-room

This command publishes a looped demo video to a room. Due to how the video clip was encoded (keyframes every 3s), there's a slight delay before the browser has sufficient data to begin rendering frames. This is an artifact of the simulation.

Deployment

Use LiveKit Cloud

LiveKit Cloud is the fastest and most reliable way to run LiveKit. Every project gets free monthly bandwidth and transcoding credits.

Sign up for LiveKit Cloud.

Self-host

Read our deployment docs for more information.

Building from source

Pre-requisites:

  • Go 1.23+ is installed
  • GOPATH/bin is in your PATH

Then run

git clone https://github.com/livekit/livekit
cd livekit
./bootstrap.sh
mage

Contributing

We welcome your contributions toward improving LiveKit! Please join us on Slack to discuss your ideas and/or PRs.

License

LiveKit server is licensed under Apache License v2.0.


LiveKit Ecosystem
Agents SDKsPython · Node.js
LiveKit SDKsBrowser · Swift · Android · Flutter · React Native · Rust · Node.js · Python · Unity · Unity (WebGL) · ESP32 · C++
Starter AppsPython Agent · TypeScript Agent · React App · SwiftUI App · Android App · Flutter App · React Native App · Web Embed
UI ComponentsReact · Android Compose · SwiftUI · Flutter
Server APIsNode.js · Golang · Ruby · Java/Kotlin · Python · Rust · PHP (community) · .NET (community)
ResourcesDocs · Docs MCP Server · CLI · LiveKit Cloud
LiveKit Server OSSLiveKit server · Egress · Ingress · SIP
CommunityDeveloper Community · Slack · X · YouTube

S
Description
Languages
Go 99.8%
Shell 0.1%