livekit

mirror of https://github.com/livekit/livekit.git synced 2026-07-20 13:31:16 +00:00

Author	SHA1	Message	Date
Paul WellsandGitHub	d7c2daf1ac	report all simulcast layers (#4491 )	2026-04-28 10:45:32 -07:00
Raja SubramanianandGitHub	c1ad2b22e6	Misc optimisations. (#4490 ) - prevent some escape to heap - avoid copying by using a ring buffer for receiver reports (probably should remove this as this is for debugging only and data so far has shown clients sending bad data and nothing more.)	2026-04-28 20:51:04 +05:30
Jacob GelmanandGitHub	19b9e8c00a	Additional data tracks logging (#4489 ) * Additional data track logging * Track total bytes published * Rename field	2026-04-28 21:26:07 +09:00
David ChenandGitHub	743d9c8b3a	add support for client capabilities (#4461 ) * update protocol version * only check for client capabiltiy to strip packet trailer	2026-04-27 17:58:36 -07:00
Raja SubramanianandGitHub	fc47e47866	Close peer connection unconditionally to unblock set local/remote (#4485 ) * Close peer connection unconditionally to unblock set local/remote description operations. Have been chasing a leak where participants have a lot of connectivity issues and analysed a goref with Claude. Output below. Jo Turk quickly patched sctp for reported issue - https://github.com/pion/sctp/pull/465. This PR moves the peer connection close to before waiting for events queue to be drained as event queue could be blocked on `SetLocal/RemoteDescription` hanging. The scenario is a bit far-fetched as a lot of things have to happen, but it does point to a scenario where things could hang. Remains to be seen if this helps. Note that closing the peer connection early could mean the contained objects (like data channels) could all be closed as part of the peer connection close. But, still keeping the explicit clean up path (which should effectively become no-op) to minimise changes. ------------------------------------------------------------------ The wedge is in pion/sctp's blocking-write gate, called synchronously from inside the PC's operations queue. Five things have to be true at the same time, and on this build they all are: 1. SCTPTransport.Start is synchronous in the SetRemoteDescription op The stuck stack: PeerConnection.SetRemoteDescription.func2 (peerconnection.go:1363) → startRTP → startSCTP → SCTPTransport.Start (sctptransport.go:141) → DataChannel.open (datachannel.go:178) → datachannel.Dial → Client → Stream.WriteSCTP → Association.sendPayloadData (association.go:3141) ← blocks here SCTPTransport.Start synchronously sends the DCEP "OPEN" for each pre-negotiated channel. The operations.start goroutine runs SetRemoteDescription's logic; it does not return until Start does. 2. The wait has no deadline Stream.WriteSCTP (stream.go:289) calls sendPayloadData(s.writeDeadline, ...). s.writeDeadline is the default zero-value deadline.Deadline — never armed, because DataChannel.Dial doesn't call Stream.SetWriteDeadline. So the <-ctx.Done() arm of the wait select can never fire. 3. EnableDataChannelBlockWrite(true) puts SCTP into a serialized-write gate At livekit-server/pkg/rtc/transport.go:362 livekit calls se.EnableDataChannelBlockWrite(true). That flips the sendPayloadData path to: // association.go:3138-3148 if a.blockWrite { for a.writePending { a.lock.Unlock() select { case <-ctx.Done(): // never (no deadline) case <-a.writeNotify: // only fires when writeLoop fully drains pendingQueue } a.lock.Lock() } a.writePending = true } 4. writeNotify only fires after the writeLoop drains everything The only place notifyBlockWritable is called is gatherOutbound (association.go:3085-3088), and only when len(chunks) > 0 && a.pendingQueue.size() == 0 — i.e., the writeLoop actually managed to move all pending chunks to inflight. If cwnd is full and SACKs stop arriving, the writeLoop wakes up, sees zero room, sends nothing, and writePending stays true. 5. There is no association-level abort timer for data writes At association.go:764: assoc.t3RTX = newRTXTimer(timerT3RTX, assoc, noMaxRetrans, rtoMax) noMaxRetrans means the retransmission timer never gives up. INIT has maxInitRetrans, but data does not. There is no equivalent of TCP's tcp_retries2 → ETIMEDOUT → ABORT. So once the path is dead post-handshake, t3RTX keeps firing into the void and the association never transitions out of established on its own. What it takes to wake it up Only an external close: somebody has to terminate the underlying DTLS conn (which makes Association.readLoop's netConn.Read fail, which closes closeWriteLoopCh, which lets timerLoop exit). But — and this is the kicker — readLoop's defer at association.go:976-996 closes everything except it does not call notifyBlockWritable. So even if readLoop unwinds, any goroutine parked on <-a.writeNotify stays parked unless it was watching ctx (which here it isn't). So the trigger sequence on this pod was almost certainly: 1. Peer establishes ICE+DTLS+SCTP, association goes established. 2. Peer disappears (ICE silently fails, NAT rebinding, OS sleep, kill -9, etc.). 3. The first DCEP-OPEN for one of livekit's pre-negotiated channels is queued; cwnd never opens because no SACKs return. 4. writePending is now true for the lifetime of the process, with no deadline, no ctx, no kill. 5. The PC's operations queue is wedged, SetRemoteDescription never returns, livekit-server's handleRemoteOfferReceived event handler is parked, the participant is never torn down, and the SCTP timerLoop pins the entire participant graph in memory until OOM-kill. Realistic fixes (in order of how clean they are) 1. Upstream: in pion/sctp, broadcast notifyBlockWritable() (or close writeNotify) inside readLoop's defer cleanup, so a closed association unblocks any pending writers. This is the right fix. 2. livekit-server: wrap pc.SetRemoteDescription(...) with a timeout, and on timeout call pc.Close() — Close ultimately tears down the DTLS conn, which lets readLoop exit (point 1 still needs to be true for the writer goroutine to actually unblock, though). 3. Workaround: call stream.SetWriteDeadline(...) on the SCTP stream before issuing the DCEP open, so the ctx arm of the select can fire. Requires reaching past webrtc.DataChannel though. 4. Heaviest hammer: don't pre-negotiate the data channels inline with SetRemoteDescription — open them lazily after PC reaches connected so a stuck open never blocks signaling. Without (1), even (2) leaves the writer goroutine itself parked forever — but at least the PC and its participant-side state would be released; only the SCTP goroutine subtree (much smaller) would leak. * revert probe stop change * handle nil offer	2026-04-27 21:38:46 +05:30
renovate[bot]GitHubrenovate[bot] <29139614+renovate[bot]@users.noreply.github.com>	639406eb96	Update module github.com/pion/ice/v4 to v4.2.3 (#4481 ) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>	2026-04-26 22:33:26 -07:00
Raja SubramanianandGitHub	dc6b75058e	reduce some heap use in packet path (#4478 )	2026-04-25 14:24:23 +05:30
Fabian StehleandGitHub	f3b80b2886	fix: wrap IPv6 addresses in brackets in UDP TURN URLs (RFC 3986) (#4476 ) `iceServersForParticipant` builds UDP TURN URLs by interpolating the node IP directly into a format string: fmt.Sprintf("turn:%s:%d?transport=udp", ip, port) When `NodeIP.V6` is set, `ToStringSlice()` includes the bare IPv6 address, producing URLs like: turn:2a05:d014:ee4:1201:7039:38c:f652:a252:443?transport=udp RFC 3986 §3.2.2 requires IPv6 addresses in URIs to be enclosed in square brackets. Without them the port is ambiguous and WebRTC clients (e.g. libdatachannel) reject the URL with "Invalid ICE server port". Use `net.JoinHostPort` which handles bracketing for IPv6 and is a no-op for IPv4, producing well-formed URLs: turn:[2a05:d014:ee4:1201:7039:38c:f652:a252]:443?transport=udp turn:1.2.3.4:443?transport=udp	2026-04-24 14:28:25 +05:30
Raja SubramanianandGitHub	3a7f2628b0	Turn off transceiver re-use on Safari. (#4474 ) There are issues with insertable streams + Safari which causes tracks to go missing mid-stream sometimes.	2026-04-23 19:04:10 +05:30
Raja SubramanianandGitHub	d84f3d7a4e	add more types to signum (#4473 )	2026-04-23 15:41:55 +05:30
Raja SubramanianandGitHub	701a37c2d1	Convert sort.Slice -> slices.SortFunc (#4472 ) * Convert sort.Slice -> slices.SortFunc * active speaker loudness in descending order	2026-04-23 15:12:24 +05:30
Raja SubramanianandGitHub	85be9d70fb	Avoid stream allocator event data cast to interface and back. (#4471 )	2026-04-23 13:33:11 +05:30
Raja SubramanianandGitHub	b43685e88c	Keep a shadow copy of tracks for use by different stream allocator state (#4470 ) changes. SHould help in cases where the congestion controller is active and is managing a bunch of video tracks.	2026-04-23 12:45:40 +05:30
Raja SubramanianandGitHub	27c2b149d7	Consolidate RTCP packets and do RTCP callback outside lock. (#4469 ) Planning to do some find grained changes based on analysis by different models. Will keep them as small as possible and focused.	2026-04-23 12:20:16 +05:30
Raja SubramanianandGitHub	31083307ec	do not log data track stats if not started (#4468 )	2026-04-23 10:46:33 +05:30
Anunay MaheshwariandGitHub	9ee06635d6	feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules (#4466 ) * feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules * update deps	2026-04-22 12:49:36 +05:30
Raja SubramanianandGitHub	8ccad68d76	Release v1.11.0 (#4459 ) v1.11.0	2026-04-17 21:38:13 +05:30
Raja SubramanianandGitHub	dbf5cf6196	Store concrete ICE candidate for remote candidates. (#4458 )	2026-04-17 13:14:47 +05:30
Paul WellsandGitHub	2a04bc3ca8	fix publisher frame count reporting for simulcast streams (#4457 )	2026-04-16 11:08:33 -07:00
Anunay MaheshwariandGitHub	1d804737f9	fix: limit join request and WHIP request body to http.DefaultMaxHeaderBytes (#4450 ) * fix: CS-1665 * cleanup * cleanup and testes * updates	2026-04-16 01:12:33 +05:30
Raja SubramanianandGitHub	3cfb71e7ca	Use Muted in TrackInfo to propagated published track muted. (#4453 ) * Use Muted in TrackInfo to propagated published track muted. When the track is muted as a receiver is created, the receiver potentially was not getting the muted property. That would result in quality scorer expecting packets. Use TrackInfo consistently for mute and apply the mute on start up of a receiver. * update mute of subscriptions	2026-04-16 01:03:40 +05:30
Raja SubramanianandGitHub	69aa94797b	Some drive-by clean up (#4452 )	2026-04-15 12:23:33 +05:30
Raja SubramanianandGitHub	6c81f67858	Add subscriber stream start event notification (#4449 )	2026-04-14 22:08:31 +05:30
cnderrauberandGitHub	ce1bf47b5c	Revert "fix: ensure num_participants is accurate in webhook events (#4265 ) (#…" (#4448 ) This reverts commit `cdb0769c38`.	2026-04-13 22:21:22 +08:00
Onyeka ObiandGitHub	cdb0769c38	fix: ensure num_participants is accurate in webhook events (#4265 ) (#4422 ) * fix: ensure num_participants is accurate in webhook events (#4265) Three fixes for stale/incorrect num_participants in webhook payloads: 1. Move participant map insertion before MarkDirty in join path so updateProto() counts the new participant. 2. Use fresh room.ToProto() for participant_joined webhook instead of a stale snapshot captured at session start. 3. Remove direct NumParticipants-- in leave path (inconsistent with updateProto's IsDependent check), force immediate proto update, and wait for completion before triggering onClose callbacks. * fix: use ToProtoConsistent for webhook events instead of forcing immediate updates	2026-04-13 09:26:14 +08:00
Raja SubramanianandGitHub	c91e79af35	Switch to stdlib maps, slices (#4445 ) * Switch to stdlib maps, slices * slices	2026-04-13 00:11:48 +05:30
renovate[bot]GitHubrenovate[bot] <29139614+renovate[bot]@users.noreply.github.com>	ea7b9c6fe1	Update module github.com/livekit/protocol to v1.45.3 (#4435 ) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>	2026-04-11 16:17:10 -07:00
renovate[bot]GitHubrenovate[bot] <29139614+renovate[bot]@users.noreply.github.com>David Zhao	97378368dd	Update go deps (major) (#3179 ) * Update go deps Generated by renovateBot * update api usage --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: David Zhao <dz@livekit.io>	2026-04-11 14:28:33 -07:00
renovate[bot]GitHubrenovate[bot] <29139614+renovate[bot]@users.noreply.github.com>	d6aef547ce	Update go deps (#3862 ) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>	2026-04-11 13:59:25 -07:00
renovate[bot]GitHubrenovate[bot] <29139614+renovate[bot]@users.noreply.github.com>	afc9feaebe	Update github workflows (#4331 ) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>	2026-04-11 13:31:41 -07:00
David ZhaoandGitHub	4b3856125c	chore: pin GH commits and switch to golangci-lint (#4444 ) * chore: pin GH commits * switch to golangci-lint-action * fix lint issues	2026-04-11 13:04:22 -07:00
Raja SubramanianandGitHub	2974ba879f	Unsubscribe from data track on close (#4443 ) * Unsubscribe from data track on close * clean up	2026-04-10 15:29:25 +05:30
cnderrauberandGitHub	5dc4e90d82	Apply IPFilter when get local ip (#4440 ) Fix #4437 See https://github.com/livekit/mediatransportutil/pull/80	2026-04-10 08:59:53 +08:00
Paul WellsandGitHub	88c77dc666	compute agent dispatch affinity from target load (#4442 ) * compute agent dispatch affinity from target load * fix test config	2026-04-09 13:49:43 -07:00
Raja SubramanianandGitHub	8fe9937770	Log join duration. (#4433 ) * Log join duration. Also revert the "unresolved" init. Defeated the purpose of log resolver as it was resolving with those values even if not forced. Instead set it to "unresolved" if not set when forced. Join duration is not reset if resolver is reset as that happens on moving a participant and there is no new join duration in that case. * explode	2026-04-05 14:01:43 +05:30
Raja SubramanianandGitHub	0a503a57f6	Add `Close` method for UpDataTrackManager and call it on participant (#4432 ) * Add `Close` method for UpDataTrackManager and call it on participant close. * include out-of-order packets in total packets	2026-04-04 17:09:02 +05:30
Raja SubramanianandGitHub	55912dff7e	Add some simple data track stats (#4431 )	2026-04-04 15:23:49 +05:30
Raja SubramanianandGitHub	050909e627	Enable data tracks by default. (#4429 )	2026-04-04 00:54:48 +05:30
David ZhaoandGitHub	72c7e65c25	chore: log API key during worker registration (#4428 )	2026-04-03 09:48:42 -07:00
Raja SubramanianandGitHub	8a67dd1b9f	Do not close publisher peer connection to aid migration. (#4427 )	2026-04-03 21:50:59 +05:30
Raja SubramanianandGitHub	91e90c1020	Add some more logging around migration. (#4426 ) Some e2e is failing due to subscriptions happening late and the expected order of m-lines is different. Not a hard failure, but logging more to make seeing this easie.	2026-04-03 13:07:32 +05:30
Raja SubramanianandGitHub	c6ddc879e7	isExpectedToResume is based on whether flushing or not. (#4425 ) For a participant migrating out, the track could be resumed on a different node, but ending on the migrating out node. So, `flush` should be used to indicate if track is going to be resumed.	2026-04-03 00:49:12 +05:30
Raja SubramanianandGitHub	7d06cfca8b	Keep subscription synchronous when publisher is expected to resume. (#4424 ) Subscription can switch between remote track and local track or vice-versa. When that happens, closing the subscribed track of one or the other asynchronously means the re-subscribe could race with subscribed track closing. Keeping the case of `isExpectedToResume` sync to prevent the race. Would be good to support multiple subscribed tracks per subscription. So, when subscribed track closes, subscription manager can check and close the correct subscribed track. But, it gets complex to clearly determine if a subccription is pending or not and other events. So, keeping it sync.	2026-04-02 19:54:14 +05:30
Raja SubramanianandGitHub	934f8598e2	Clean up data track observers on unsubscribe. (#4421 ) Media track clean up fixed some leaks. There are more when the participants thrash. This is not the issue, but doing this to match media tracks.	2026-04-02 11:55:46 +05:30
Raja SubramanianandGitHub	9674ac48ab	Cleaning up some logs and standardising log frequency. (#4420 ) Removing some logs which have not been useful in terms of insights other than saying that there are a bunch of packets missing. Going to start looking at gaps in terms of time if the inter-packet gap is too high. Also, using logging these events as first 20 and then every 200.	2026-04-01 21:17:43 +05:30
Raja SubramanianandGitHub	7b92530461	Drop time inverted packets in RED -> Opus conversion. (#4418 ) A bunch of edges to note here RED packet does not have sequence number for redundant blocks. It only has timestamp offset compared to the primary payload. The receivers are supposed to use just timestamp to sequence the payload and decode. But, when converting from RED -> Opus, the packets extracted from RED packet should be assigned a sequence number before they can be forwarded. The simple rule is, if packet N contains X redundant payloads, they are assigned sequence number of N - X to N - 1. However there are cases like the following sequence (with 1 packet redundancy) - Seq num 10, timestamp 2000, forwarded - Seq num 11 is lost - Seq num 12 has a redundant payload. Seq num 12 has timestamp of 4000. Ideally would expect the redundant payload to have a timestamp offset of 1000, so the redundant payload can be mapped to sequence number 11 and timestamp 3000 (4000 - 1000). But, in the problematic case, it has an offset of 3000 resulting in sequence number 11 and timestamp of 1000 causing an inversion with packet at sequence number 10. Unclear if this a publisher issue, i. e. packing RED wrong or if this is some expected behaviour with DTX. i. e. the DTX packets are not included in redundant payload. For example, the sequence - Seq num 10 -> DTX - Seq num 11 -> DTX -> lost - Seq num 12 -> Regular packet and include sequence num 9 as that is the last regular packet. Anyhow, detect this condition and drop the time inverted packet. Note however this handles only inversion against the highest sent packet sequence number and timestamp. So, some old packet inverted with some other old packet getting forwarded will get through. That has been the case always though and detecting that would be expensive and complicated. At least for egress, will also look at adding a check for inversion so that it can catch it before sending it down the gstreamer pipeline. As the egress uses a jitter buffer with ordered sequence number emits, it will be simpler to detect timestamp going back when sequence number is moving forward (of course the mute/dtx challenege is there).	2026-04-01 11:40:01 +05:30
Paul WellsandGitHub	4d8d232a19	ensure participant init is correctly serialized for logging (#4417 )	2026-03-31 19:33:57 -07:00
Raja SubramanianandGitHub	4fe80877df	Log time inversion between incoming packets (#4415 ) * Log time inversion between incoming packets Log of timestamp inversion within a red packet did not show anything. Log across packets. Not dropping till there is more evidence of the cause. * save * comment	2026-03-31 20:09:07 +05:30
Raja SubramanianandGitHub	248d73948d	Guard against timestamp inversion in RED -> Opus conversion. (#4414 ) * Guard against timestamp inversion in RED -> Opus conversion. Seeing timestamp inversion (sequence number is +1, but timestamp is -960, i.e. 20ms) in the RED -> Opus conversion path. Not able to spot any bugs in code. So, logging details upon detection and also dropping the packet. If not dropped, downstream components like Egress treat it as a big timestamp jump (because sequence number is moving forward) and try to adjust pts which ends up causing drops. * do not log time reversal at the start * typo	2026-03-31 17:08:13 +05:30
Paul Wells GitHub Claude Opus 4.6	9ab8c1d522	clear track notifier observers on subscription teardown (#4413 ) When a subscriber disconnects, observer closures registered on the publisher's TrackChangedNotifier and TrackRemovedNotifier were never removed. These closures capture the SubscriptionManager, which holds the ParticipantImpl, preventing the entire participant object graph (PCTransport, SDPs, RTP stats, DownTracks) from being garbage collected. In rooms with many participants that disconnect and reconnect frequently, this causes unbounded memory growth proportional to the number of disconnect events. The leaked memory is not recoverable while the room remains open. Clear notifiers in both handleSubscribedTrackClose (individual subscription teardown) and SubscriptionManager.Close (full participant teardown), matching the existing cleanup in handleSourceTrackRemoved. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:48:48 -07:00

1 2 3 4 5 ...