Commit Graph

3455 Commits

Author SHA1 Message Date
Raja Subramanian
e0fbbef1cd Update pion/sctp with RFC9260 revert (#4110) 2025-11-27 15:32:10 +05:30
Raja Subramanian
f3c8091797 Try SCTP with read deadline to unblock abort. (#4109) 2025-11-27 13:18:06 +05:30
Raja Subramanian
bd5382daaa Splitting transport close timeout logs. (#4108)
After adding more fields in
https://github.com/livekit/livekit/pull/4105/files, it was not even
logging. Access to one of the added fields must have ended up waiting on
a lock and blocked.

Unfotunately, the deadlock fix in https://github.com/pion/ice/pull/840
did not address the peer connection close hang.

Splitting the logs so that the base log still happens. Ordering after
looking at the code and guessing what could still log to see if we get
more of the logs and learn more about the state and which lock ends up
the first blocking one.
2025-11-27 10:02:01 +05:30
Raja Subramanian
6d4154b8a7 Update pion/ice. (#4107)
Hopefully this solves the peer connection close hang.
2025-11-27 00:54:49 +05:30
Raja Subramanian
a6418ae219 Log more peer conenction state on close timeout. (#4105) 2025-11-26 19:58:31 +05:30
Raja Subramanian
06d999748f Check for cancel on unsubscription/source track going away. (#4104) 2025-11-25 21:32:21 +05:30
Raja Subramanian
7f10e18bac Record join/publish/subscribe cancellations. (#4102)
To get better picture of success/failure rate.
2025-11-25 14:06:02 +05:30
Raja Subramanian
402936324c Clear stereo=1 if stereo is not enabled. (#4101) 2025-11-24 21:31:56 +05:30
Raja Subramanian
70f6def39d Add checks for participant and sub-components close. (#4100)
* Add checks for participant and sub-components close.

Looks like there might be some memory leak with participant sessions not
getting closed properly. Adding checks (to be cleaned up later) to see
if there is a consistent place where things might hang.

* init with right type

* Remove unnecessary goroutine, thank you @milos-lk

* clean up
2025-11-24 18:07:33 +05:30
Raja Subramanian
ffbabcc772 Switch forwarding latency log to Debugw (#4098) 2025-11-23 11:22:10 +05:30
aleb_the_flash
27d82a724e Fix "address" typo in transport logs (addddress → address) (#4097)
Correct triple-d spelling of "address" field in transport logs.

I’m not sure whether this was intentional, but I noticed it
while creating Grafana queries and filters. This matters because
anyone filtering logs using the correct spelling may
unintentionally miss relevant data, leading to incomplete or
misleading analysis.
2025-11-22 21:30:02 +05:30
Raja Subramanian
37a06821e2 logger proto redaction. (#4090)
Unfortunately, this could not be used for twirp/analytics redaction.

Probably worth writing a proto clone utility which will filter out based
on tags.
2025-11-18 14:15:17 +05:30
cnderrauber
54cf7d46c8 Control latency of lossy data channel (#4088)
* Control latency of lossy data channel

* remove log

* test
2025-11-18 16:30:16 +08:00
Raja Subramanian
5175c1afa1 Lock x/tools at 0.37.0 (#4085) 2025-11-15 19:14:03 +05:30
Raja Subramanian
d510fff1e7 Downgrade x/tools to be able to make a release (#4084) v1.9.4 2025-11-15 18:56:22 +05:30
Raja Subramanian
c3ea5890d5 Prepare release v1.9.4. (#4083)
- Removing x/tools replace for release script to work. Will add it after
  the release.
2025-11-15 17:08:17 +05:30
Alex
3a128e61c1 protocol bump for SIP error mapping and validation (#4081) 2025-11-14 10:54:26 -08:00
Raja Subramanian
c3964ba2eb Use sync.Pool for objects in packet path. (#4066)
* Use sync.Pool for objects in packet path.

Seeing cases of forwarding latency spikes that aling with GC.

This might be a bit overkill, but using sync.Pool for small +
short-lived objects in packet path.

Before this, all these were increasing in alloc_space heap profile
samples over time. With these, there is no increase (actually the lines
corresponding to geting from pool does not even show up in heap
accounting when doing `list` in `pprof`)

* merge

* Paul feedback
2025-11-14 16:13:23 +05:30
Raja Subramanian
f8b994d491 Forwarding latency measurement tweaks. (#4080)
* Forwarding latency measurement tweaks.

- prom transmission type public
- do not measure short term values as it is not used and saves some lock
  contention time in packet path potentially. Adding a separate method
  for that.
- Change latency/jitter summary reporting to `ns` also to match the
  histogram.

* add GetShortStats
2025-11-13 18:39:49 +05:30
Raja Subramanian
f4929f099e Revert "Revert pion/transpor to v3.0.8 (#4073)" (#4074)
This reverts commit a04d9c48a5.
2025-11-12 13:10:05 +05:30
Raja Subramanian
a04d9c48a5 Revert pion/transpor to v3.0.8 (#4073)
Will put this back after propagating some other changes without this
change as this is core and needs more soak time.
2025-11-12 12:49:28 +05:30
cnderrauber
2d5054ad01 kind details for connector (#4072) 2025-11-11 21:50:48 +08:00
Raja Subramanian
a272e28ae0 Log raeson for subscriber not being to determine codec. (#4071) 2025-11-11 16:42:42 +05:30
Raja Subramanian
b9b4eec991 Update pion/transport to v3.1.1 (#4070)
To pick up ping-pong batch write.
2025-11-11 10:55:01 +05:30
Paul Wells
b23d093c2f update protocol (#4069)
* update protocol

* deps
2025-11-09 19:42:08 -08:00
Raja Subramanian
4ce07bedeb Higher resolution forwarding latency histogram. (#4067)
* Higher resolution forwarding latency histogram.

Was using the average latency/jitter of last second to populate
forwarding latency/jitter histogram. But, it is too coarse, i. e. the
average value of latency/jitter is very low and those summarised samples
end up in the lowest bucket always.

A few things to address it
- record per packet forwarding latency in histogram
- adjust histogram bins to include smaller values
- Drop jitter histogram

This is a per packet call, but prometheus histogram is supposedly
fast/light weight. Would be good to get better resolution histograms.
Hence doing this. Please let me know if there are performance concerns.

* typo

* one more typo
2025-11-09 17:29:40 +05:30
renovate[bot]
858db7ab7a fix(deps): update module github.com/livekit/protocol to v1.43.0 (#4015)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2025-11-08 01:50:43 -08:00
Raja Subramanian
1dc9b8fc5c Use buffered indicator to exclude from forwarding latency. (#4062)
* Debug high forwarding latency missing.

* log highest

* log condition

* update log

* log

* log

* change log

* Track start up delay.

Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
   would be in the queue till bind. There are two ways it is showing up
   a. Bind itself is delayed and releasing queued packets causes the
      high forwarding latency.
   b. There is a significant gap between bind and first packet being
      pulled off the queue to be forwarded, in one example 100ms.

(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.

(b) looks like go scheduling latency? Unsure.

Logging more to understand this better.

* log start

* Use buffered indicator to exclude from forwarding latency.

Buffered packets live the queue for a while before Bind releases them.
They have high(ish) queuing latency and not true representation of
forwarding latency.
2025-11-07 21:46:14 +05:30
Raja Subramanian
f117ee511f Track start up delay. (#4061)
* Debug high forwarding latency missing.

* log highest

* log condition

* update log

* log

* log

* change log

* Track start up delay.

Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
   would be in the queue till bind. There are two ways it is showing up
   a. Bind itself is delayed and releasing queued packets causes the
      high forwarding latency.
   b. There is a significant gap between bind and first packet being
      pulled off the queue to be forwarded, in one example 100ms.

(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.

(b) looks like go scheduling latency? Unsure.

Logging more to understand this better.

* log start
2025-11-07 16:55:18 +05:30
Raja Subramanian
4872f2051d Return write count from WriteRTP. (#4059)
* Log write count atomic.

* Return write count from WriteRTP.

Apologies for the frequent changes on this. With relays, the down track
could write to several targets. So, use count to have an accurate
indication of how may subscribers were written to.
2025-11-06 13:29:21 +05:30
Raja Subramanian
d0ba46b460 Log write count atomic. (#4057) 2025-11-06 13:00:08 +05:30
Raja Subramanian
ae5fb7e882 Add packet to forwarding stats only if packet is forwarded. (#4056)
Packets not being forwarded were getting included in forwarding stats
calculation and skewing the measurement towards a smaller number.

The latency measurement does not include the batch IO of packets on
send. With a 2ms batching, that will add an average latency of 1ms.
2025-11-06 12:31:49 +05:30
Raja Subramanian
f6909192bb Update PsRPC to get redis pipeliner implementation. (#4055)
* Update PsRPC to get redis pipeliner implementation.

* clean up
2025-11-05 22:42:21 +05:30
Raja Subramanian
ca3c507b3f Prevent invalid track access while peer connection is shutting down. (#4054) 2025-11-05 17:48:27 +05:30
Raja Subramanian
9ca6ee0077 Use replace so that x/tools does not get overridden (#4048) 2025-11-02 17:58:01 +05:30
Anunay Maheshwari
b9323eab39 chore(deps): downgrade x/tools for counterfeiter (#4047) v1.9.3 2025-11-02 17:16:06 +05:30
Raja Subramanian
2f1e6c363c Prep release v1.9.3 (#4046) 2025-11-02 16:01:41 +05:30
Raja Subramanian
9d5c351d36 Fix prom units for forwarding latency/jitter. (#4045) 2025-11-02 14:38:25 +05:30
Raja Subramanian
e183657cff Add prom histogram for forwarding latency and jitter. (#4044)
* Add prom histogram for forwarding latency and jitter.

Using short term stats for histogram.

An example setting is
1s - short term
1m - long term

Using the 1s (short term) data for histogram. In that 1 second, all
packet forwarding latencies are averaged for latency and std. dev. of
the collection is used as jitter.

* try different staticcheck
2025-11-01 23:25:03 +05:30
Trey Hakanson
1eefeb3089 Enable AbsCaptureTimeURI in RTC configuration (#4043)
Enable absolute capture time RTP extension. This logic was added a while back, but was disabled.
2025-10-31 09:42:36 +05:30
cnderrauber
075a7576ed Use simulcast codec as default policy for audio track (#4040) 2025-10-29 21:39:20 +08:00
cnderrauber
c264b504c4 Don't warn 0 payload type for PCMU (#4039) 2025-10-28 23:11:51 +08:00
Raja Subramanian
32fc35254e Broadcast cond var on RTX write. (#4038)
* Broadcast cond var on RTX write.

High forwarding latency logs all show high queuing delay so far. From
code inspection, RTX writes were not signaling the cond var. Not sure if
that is the reason, but adding a signal there for further tests.

* Remove return values from writeRTX as they are not used
2025-10-28 11:27:02 +05:30
Raja Subramanian
061eb8b4e8 AddDownTrack to regressed codec after restarting forwarder. (#4037)
Without that the new codec was skipping through with old selector and
not working correctly.
2025-10-27 20:14:33 +05:30
Artur Melanchyk
c87eb8ed11 fix: add missing Unlock() in AddReceiver (#4036)
Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>
Co-authored-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>
2025-10-27 18:45:44 +05:30
Matthew Brown
704449247e if RingingTimeout is provided, deadline should be set to that timeout. (#4018)
* if RingingTimeout is provided, deadline should be set to that timeout.

This is because the SIP bridge will not return until RingingTimeout
which may be longer than the 30 second default deadline.

* handle Deadline being "before" timeout.
2025-10-27 15:03:03 +02:00
Raja Subramanian
ab906d710c Prevent leakage of previous codec after codec regression. (#4035)
* Prevent leakage of previous codec after codec regression.

In the window between forwarder restart and determining codec, the old
codec packet could leak through. Prevent tha by doing the restart and
codec determination atomically on a codec regression.

* tidy

* use locked function
2025-10-27 17:40:39 +05:30
Raja Subramanian
79b03f97a2 Log queueing latency when encountering high forwarding latency (#4034) 2025-10-27 15:27:03 +05:30
Raja Subramanian
29117b1422 set max layer in allocation (#4033) 2025-10-26 17:51:35 +05:30
Raja Subramanian
15b19ccd26 Remove ~ from rid which indicates disabled layer to get the actual rid (#4032) 2025-10-26 15:44:32 +05:30