livekit

mirror of https://github.com/livekit/livekit.git synced 2026-04-29 08:15:29 +00:00

Author	SHA1	Message	Date
Paul Wells	d7c2daf1ac	report all simulcast layers (#4491 )	2026-04-28 10:45:32 -07:00
Jacob Gelman	19b9e8c00a	Additional data tracks logging (#4489 ) * Additional data track logging * Track total bytes published * Rename field	2026-04-28 21:26:07 +09:00
David Chen	743d9c8b3a	add support for client capabilities (#4461 ) * update protocol version * only check for client capabiltiy to strip packet trailer	2026-04-27 17:58:36 -07:00
Raja Subramanian	fc47e47866	Close peer connection unconditionally to unblock set local/remote (#4485 ) * Close peer connection unconditionally to unblock set local/remote description operations. Have been chasing a leak where participants have a lot of connectivity issues and analysed a goref with Claude. Output below. Jo Turk quickly patched sctp for reported issue - https://github.com/pion/sctp/pull/465. This PR moves the peer connection close to before waiting for events queue to be drained as event queue could be blocked on `SetLocal/RemoteDescription` hanging. The scenario is a bit far-fetched as a lot of things have to happen, but it does point to a scenario where things could hang. Remains to be seen if this helps. Note that closing the peer connection early could mean the contained objects (like data channels) could all be closed as part of the peer connection close. But, still keeping the explicit clean up path (which should effectively become no-op) to minimise changes. ------------------------------------------------------------------ The wedge is in pion/sctp's blocking-write gate, called synchronously from inside the PC's operations queue. Five things have to be true at the same time, and on this build they all are: 1. SCTPTransport.Start is synchronous in the SetRemoteDescription op The stuck stack: PeerConnection.SetRemoteDescription.func2 (peerconnection.go:1363) → startRTP → startSCTP → SCTPTransport.Start (sctptransport.go:141) → DataChannel.open (datachannel.go:178) → datachannel.Dial → Client → Stream.WriteSCTP → Association.sendPayloadData (association.go:3141) ← blocks here SCTPTransport.Start synchronously sends the DCEP "OPEN" for each pre-negotiated channel. The operations.start goroutine runs SetRemoteDescription's logic; it does not return until Start does. 2. The wait has no deadline Stream.WriteSCTP (stream.go:289) calls sendPayloadData(s.writeDeadline, ...). s.writeDeadline is the default zero-value deadline.Deadline — never armed, because DataChannel.Dial doesn't call Stream.SetWriteDeadline. So the <-ctx.Done() arm of the wait select can never fire. 3. EnableDataChannelBlockWrite(true) puts SCTP into a serialized-write gate At livekit-server/pkg/rtc/transport.go:362 livekit calls se.EnableDataChannelBlockWrite(true). That flips the sendPayloadData path to: // association.go:3138-3148 if a.blockWrite { for a.writePending { a.lock.Unlock() select { case <-ctx.Done(): // never (no deadline) case <-a.writeNotify: // only fires when writeLoop fully drains pendingQueue } a.lock.Lock() } a.writePending = true } 4. writeNotify only fires after the writeLoop drains everything The only place notifyBlockWritable is called is gatherOutbound (association.go:3085-3088), and only when len(chunks) > 0 && a.pendingQueue.size() == 0 — i.e., the writeLoop actually managed to move all pending chunks to inflight. If cwnd is full and SACKs stop arriving, the writeLoop wakes up, sees zero room, sends nothing, and writePending stays true. 5. There is no association-level abort timer for data writes At association.go:764: assoc.t3RTX = newRTXTimer(timerT3RTX, assoc, noMaxRetrans, rtoMax) noMaxRetrans means the retransmission timer never gives up. INIT has maxInitRetrans, but data does not. There is no equivalent of TCP's tcp_retries2 → ETIMEDOUT → ABORT. So once the path is dead post-handshake, t3RTX keeps firing into the void and the association never transitions out of established on its own. What it takes to wake it up Only an external close: somebody has to terminate the underlying DTLS conn (which makes Association.readLoop's netConn.Read fail, which closes closeWriteLoopCh, which lets timerLoop exit). But — and this is the kicker — readLoop's defer at association.go:976-996 closes everything except it does not call notifyBlockWritable. So even if readLoop unwinds, any goroutine parked on <-a.writeNotify stays parked unless it was watching ctx (which here it isn't). So the trigger sequence on this pod was almost certainly: 1. Peer establishes ICE+DTLS+SCTP, association goes established. 2. Peer disappears (ICE silently fails, NAT rebinding, OS sleep, kill -9, etc.). 3. The first DCEP-OPEN for one of livekit's pre-negotiated channels is queued; cwnd never opens because no SACKs return. 4. writePending is now true for the lifetime of the process, with no deadline, no ctx, no kill. 5. The PC's operations queue is wedged, SetRemoteDescription never returns, livekit-server's handleRemoteOfferReceived event handler is parked, the participant is never torn down, and the SCTP timerLoop pins the entire participant graph in memory until OOM-kill. Realistic fixes (in order of how clean they are) 1. Upstream: in pion/sctp, broadcast notifyBlockWritable() (or close writeNotify) inside readLoop's defer cleanup, so a closed association unblocks any pending writers. This is the right fix. 2. livekit-server: wrap pc.SetRemoteDescription(...) with a timeout, and on timeout call pc.Close() — Close ultimately tears down the DTLS conn, which lets readLoop exit (point 1 still needs to be true for the writer goroutine to actually unblock, though). 3. Workaround: call stream.SetWriteDeadline(...) on the SCTP stream before issuing the DCEP open, so the ctx arm of the select can fire. Requires reaching past webrtc.DataChannel though. 4. Heaviest hammer: don't pre-negotiate the data channels inline with SetRemoteDescription — open them lazily after PC reaches connected so a stuck open never blocks signaling. Without (1), even (2) leaves the writer goroutine itself parked forever — but at least the PC and its participant-side state would be released; only the SCTP goroutine subtree (much smaller) would leak. * revert probe stop change * handle nil offer	2026-04-27 21:38:46 +05:30
Raja Subramanian	3a7f2628b0	Turn off transceiver re-use on Safari. (#4474 ) There are issues with insertable streams + Safari which causes tracks to go missing mid-stream sometimes.	2026-04-23 19:04:10 +05:30
Raja Subramanian	701a37c2d1	Convert sort.Slice -> slices.SortFunc (#4472 ) * Convert sort.Slice -> slices.SortFunc * active speaker loudness in descending order	2026-04-23 15:12:24 +05:30
Raja Subramanian	31083307ec	do not log data track stats if not started (#4468 )	2026-04-23 10:46:33 +05:30
Anunay Maheshwari	9ee06635d6	feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules (#4466 ) * feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules * update deps	2026-04-22 12:49:36 +05:30
Raja Subramanian	dbf5cf6196	Store concrete ICE candidate for remote candidates. (#4458 )	2026-04-17 13:14:47 +05:30
Raja Subramanian	3cfb71e7ca	Use Muted in TrackInfo to propagated published track muted. (#4453 ) * Use Muted in TrackInfo to propagated published track muted. When the track is muted as a receiver is created, the receiver potentially was not getting the muted property. That would result in quality scorer expecting packets. Use TrackInfo consistently for mute and apply the mute on start up of a receiver. * update mute of subscriptions	2026-04-16 01:03:40 +05:30
Raja Subramanian	69aa94797b	Some drive-by clean up (#4452 )	2026-04-15 12:23:33 +05:30
Raja Subramanian	6c81f67858	Add subscriber stream start event notification (#4449 )	2026-04-14 22:08:31 +05:30
cnderrauber	ce1bf47b5c	Revert "fix: ensure num_participants is accurate in webhook events (#4265 ) (#…" (#4448 ) This reverts commit `cdb0769c38`.	2026-04-13 22:21:22 +08:00
Onyeka Obi	cdb0769c38	fix: ensure num_participants is accurate in webhook events (#4265 ) (#4422 ) * fix: ensure num_participants is accurate in webhook events (#4265) Three fixes for stale/incorrect num_participants in webhook payloads: 1. Move participant map insertion before MarkDirty in join path so updateProto() counts the new participant. 2. Use fresh room.ToProto() for participant_joined webhook instead of a stale snapshot captured at session start. 3. Remove direct NumParticipants-- in leave path (inconsistent with updateProto's IsDependent check), force immediate proto update, and wait for completion before triggering onClose callbacks. * fix: use ToProtoConsistent for webhook events instead of forcing immediate updates	2026-04-13 09:26:14 +08:00
Raja Subramanian	c91e79af35	Switch to stdlib maps, slices (#4445 ) * Switch to stdlib maps, slices * slices	2026-04-13 00:11:48 +05:30
David Zhao	4b3856125c	chore: pin GH commits and switch to golangci-lint (#4444 ) * chore: pin GH commits * switch to golangci-lint-action * fix lint issues	2026-04-11 13:04:22 -07:00
Raja Subramanian	2974ba879f	Unsubscribe from data track on close (#4443 ) * Unsubscribe from data track on close * clean up	2026-04-10 15:29:25 +05:30
Raja Subramanian	0a503a57f6	Add `Close` method for UpDataTrackManager and call it on participant (#4432 ) * Add `Close` method for UpDataTrackManager and call it on participant close. * include out-of-order packets in total packets	2026-04-04 17:09:02 +05:30
Raja Subramanian	55912dff7e	Add some simple data track stats (#4431 )	2026-04-04 15:23:49 +05:30
Raja Subramanian	050909e627	Enable data tracks by default. (#4429 )	2026-04-04 00:54:48 +05:30
Raja Subramanian	8a67dd1b9f	Do not close publisher peer connection to aid migration. (#4427 )	2026-04-03 21:50:59 +05:30
Raja Subramanian	91e90c1020	Add some more logging around migration. (#4426 ) Some e2e is failing due to subscriptions happening late and the expected order of m-lines is different. Not a hard failure, but logging more to make seeing this easie.	2026-04-03 13:07:32 +05:30
Raja Subramanian	7d06cfca8b	Keep subscription synchronous when publisher is expected to resume. (#4424 ) Subscription can switch between remote track and local track or vice-versa. When that happens, closing the subscribed track of one or the other asynchronously means the re-subscribe could race with subscribed track closing. Keeping the case of `isExpectedToResume` sync to prevent the race. Would be good to support multiple subscribed tracks per subscription. So, when subscribed track closes, subscription manager can check and close the correct subscribed track. But, it gets complex to clearly determine if a subccription is pending or not and other events. So, keeping it sync.	2026-04-02 19:54:14 +05:30
Raja Subramanian	934f8598e2	Clean up data track observers on unsubscribe. (#4421 ) Media track clean up fixed some leaks. There are more when the participants thrash. This is not the issue, but doing this to match media tracks.	2026-04-02 11:55:46 +05:30
Paul Wells	9ab8c1d522	clear track notifier observers on subscription teardown (#4413 ) When a subscriber disconnects, observer closures registered on the publisher's TrackChangedNotifier and TrackRemovedNotifier were never removed. These closures capture the SubscriptionManager, which holds the ParticipantImpl, preventing the entire participant object graph (PCTransport, SDPs, RTP stats, DownTracks) from being garbage collected. In rooms with many participants that disconnect and reconnect frequently, this causes unbounded memory growth proportional to the number of disconnect events. The leaked memory is not recoverable while the room remains open. Clear notifiers in both handleSubscribedTrackClose (individual subscription teardown) and SubscriptionManager.Close (full participant teardown), matching the existing cleanup in handleSourceTrackRemoved. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:48:48 -07:00
Anunay Maheshwari	ff7fd7ed56	feat(agent-dispatch): add job restart policy (#4401 ) * feat(agent-dispatch): add job restart policy * deps	2026-03-27 21:32:04 +05:30
Paul Wells	13d02ee9a8	add deadline to dtls connect context (#4395 )	2026-03-25 21:13:23 -07:00
Raja Subramanian	9e0a7e545f	Close both peer connections to aid migration. (#4382 ) * Close both peer connections to aid migration. In single peer connection case, that would close publisher peer connection. @cnderrauber I don't remember why we only closed subscriber peer connection. I am thinking it is okay to close both (or the publisher peer connection in single peer connection mode). Please let me know if I am missing something. * log change only	2026-03-24 14:19:46 +05:30
David Chen	a5333a86bb	add packet trailer stripping support (#4361 ) * bump protocol version to 17 to enable packet trailer stripping functionality * check subscriber protocol version for trailer stripping	2026-03-23 13:33:42 -07:00
Théo Monnom	89410df74c	handle AGENT_ERROR disconnect reason (#4339 )	2026-03-17 23:00:16 -07:00
Paul Wells	c8bb2578be	Rename log field "pID" to "participantID" for consistency (#4365 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 04:32:02 -07:00
Raja Subramanian	90a46fabb1	Do not kick off migration of closed participant (#4363 )	2026-03-15 10:39:55 +05:30
Raja Subramanian	5dc2e7b180	Switch data track extension to 1-byte ID/length. (#4362 ) And match design to RTP header extension, i. e. the padding for extensions is not at per extension level (which was the case before), but has been changed to padding the aggregate of all extensions in this PR.	2026-03-14 13:29:40 +05:30
Raja Subramanian	7323ad02b7	Sample data send error logging. (#4358 ) There are cases where data channel is not created potentially and logging on every one of those errors is verbose.	2026-03-12 12:02:18 +05:30
Raja Subramanian	0d34e45572	Add option to not re-use transceiver in e2ee. (#4356 )	2026-03-11 17:41:13 +05:30
cnderrauber	95225ff2e1	don't require media section for dual peerconnection mode (#4354 )	2026-03-09 20:56:55 +08:00
Milos Pesic	b34b047247	Add StopEgress function to the EgressLauncher interface (#4353 ) This allows for abstracting away how the stop is implemented - default implementation stays the same - the existing OSS egress launcher just calls the existing Stop method on the client.	2026-03-09 13:17:05 +01:00
Raja Subramanian	db1a804696	defensive check for peer connection instance (#4350 )	2026-03-08 08:34:53 +05:30
cnderrauber	caa47522fb	Add option to require media sections when participant joining (#4347 ) Negotiate media tracks at first sdp round	2026-03-06 03:59:00 +08:00
Raja Subramanian	516aeabf45	Use ParticipantTelemetryListener of LocalParticipant. (#4342 ) Had made a change in remote participant case to not have telemetry listener as telemetry does not apply to remote participant. But, that listener ended up getting used for subscriber and became a null listener. Use the listener of the subscriber participant for subscribed tracks.	2026-03-05 11:24:48 +05:30
cnderrauber	b35105656c	Exclude ice restart case from offer answer id mismatch warning (#4341 )	2026-03-05 13:05:37 +08:00
cnderrauber	9d418689c6	Send participant left event after track unpublished for moved (#4334 ) participant	2026-02-25 13:22:33 +08:00
Raja Subramanian	b81bac0ec3	Key telemetry stats worker using combination of roomID, participantID (#4323 ) Test / test (push) Failing after 17s Release to Docker / docker (push) Failing after 3m42s * Key telemetry stats work using combination of roomID, participantID With forwarded participant, the same participantID can existing in two rooms. NOTE: This does not yet allow a participant session to report its events/track stats into multiple rooms. That would require regitering multiple listeners (from rooms a participant is forwarded to). * missed file * data channel stats * PR comments + pass in room name so that telemetry events have proper room name also	2026-02-16 13:56:13 +05:30
Raja Subramanian	a9b8d40de4	Publish is always on publisher peer connection. (#4307 )	2026-02-10 13:43:30 +05:30
Raja Subramanian	195b17f62f	Populate client_protocol field in ParticipantInfo (#4293 )	2026-02-06 00:34:56 +05:30
Raja Subramanian	f3e9b68854	Do not increase max expected layer on track info update. (#4285 ) * Do not increase max expected layer on track info update. When max expected layer increases, the corresponding trackers are reset so that first packets from those layers can trigger a layer detected change enabling quick detection of layer start. A track info update changing max to what is in track info could set the max expected to be higher without resetting the tracker. And that would cause dynacast induced max layer change to miss tracker reset too. Sequence - dynacast sets max expected to 0 - track info update sets it to 2 - dynacast sets it to 1 --> this should have reset tracker on layer 1, but because it is less than current max (2), it is skipped. * thank you CodeRabbit * force update on start	2026-02-04 12:19:41 +05:30
Raja Subramanian	d2bae34d53	refresh telemetry guard on participant move (#4280 )	2026-02-03 15:34:16 +05:30
Raja Subramanian	d1bab17b76	Add session duration and participant kind to closing log. (#4277 ) To allow using "participant closing" log entry for calculating things like session duration by paricipant kind or some other client SDK based attribute.	2026-02-03 07:07:24 +05:30
Raja Subramanian	1e689e1a24	Reducing some info level logs. (#4274 ) * Reducing some info level logs. Also, relaxing the check for runaway RTCP receiver report to allow for rollover to catch up if it is not too far away. * set logger	2026-02-02 10:54:03 +05:30
Paul Wells	700e1788f2	require participant broadcast when metadata/attributes are set in token (#4266 )	2026-01-26 14:22:30 -08:00

1 2 3 4 5 ...

1503 Commits