Commit Graph

1110 Commits

Author SHA1 Message Date
Raja Subramanian 64c651431e Update mediatransportutil (#4115)
- New bucket API to pass in max packet size and sequence number offset
  and seequence number size generic type
- Move OWD estimator to mediatransportutil.
2025-11-28 21:51:53 +05:30
Raja Subramanian ffbabcc772 Switch forwarding latency log to Debugw (#4098) 2025-11-23 11:22:10 +05:30
cnderrauber 54cf7d46c8 Control latency of lossy data channel (#4088)
* Control latency of lossy data channel

* remove log

* test
2025-11-18 16:30:16 +08:00
Raja Subramanian c3964ba2eb Use sync.Pool for objects in packet path. (#4066)
* Use sync.Pool for objects in packet path.

Seeing cases of forwarding latency spikes that aling with GC.

This might be a bit overkill, but using sync.Pool for small +
short-lived objects in packet path.

Before this, all these were increasing in alloc_space heap profile
samples over time. With these, there is no increase (actually the lines
corresponding to geting from pool does not even show up in heap
accounting when doing `list` in `pprof`)

* merge

* Paul feedback
2025-11-14 16:13:23 +05:30
Raja Subramanian f8b994d491 Forwarding latency measurement tweaks. (#4080)
* Forwarding latency measurement tweaks.

- prom transmission type public
- do not measure short term values as it is not used and saves some lock
  contention time in packet path potentially. Adding a separate method
  for that.
- Change latency/jitter summary reporting to `ns` also to match the
  histogram.

* add GetShortStats
2025-11-13 18:39:49 +05:30
Raja Subramanian 4ce07bedeb Higher resolution forwarding latency histogram. (#4067)
* Higher resolution forwarding latency histogram.

Was using the average latency/jitter of last second to populate
forwarding latency/jitter histogram. But, it is too coarse, i. e. the
average value of latency/jitter is very low and those summarised samples
end up in the lowest bucket always.

A few things to address it
- record per packet forwarding latency in histogram
- adjust histogram bins to include smaller values
- Drop jitter histogram

This is a per packet call, but prometheus histogram is supposedly
fast/light weight. Would be good to get better resolution histograms.
Hence doing this. Please let me know if there are performance concerns.

* typo

* one more typo
2025-11-09 17:29:40 +05:30
Raja Subramanian 1dc9b8fc5c Use buffered indicator to exclude from forwarding latency. (#4062)
* Debug high forwarding latency missing.

* log highest

* log condition

* update log

* log

* log

* change log

* Track start up delay.

Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
   would be in the queue till bind. There are two ways it is showing up
   a. Bind itself is delayed and releasing queued packets causes the
      high forwarding latency.
   b. There is a significant gap between bind and first packet being
      pulled off the queue to be forwarded, in one example 100ms.

(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.

(b) looks like go scheduling latency? Unsure.

Logging more to understand this better.

* log start

* Use buffered indicator to exclude from forwarding latency.

Buffered packets live the queue for a while before Bind releases them.
They have high(ish) queuing latency and not true representation of
forwarding latency.
2025-11-07 21:46:14 +05:30
Raja Subramanian f117ee511f Track start up delay. (#4061)
* Debug high forwarding latency missing.

* log highest

* log condition

* update log

* log

* log

* change log

* Track start up delay.

Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
   would be in the queue till bind. There are two ways it is showing up
   a. Bind itself is delayed and releasing queued packets causes the
      high forwarding latency.
   b. There is a significant gap between bind and first packet being
      pulled off the queue to be forwarded, in one example 100ms.

(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.

(b) looks like go scheduling latency? Unsure.

Logging more to understand this better.

* log start
2025-11-07 16:55:18 +05:30
Raja Subramanian 4872f2051d Return write count from WriteRTP. (#4059)
* Log write count atomic.

* Return write count from WriteRTP.

Apologies for the frequent changes on this. With relays, the down track
could write to several targets. So, use count to have an accurate
indication of how may subscribers were written to.
2025-11-06 13:29:21 +05:30
Raja Subramanian d0ba46b460 Log write count atomic. (#4057) 2025-11-06 13:00:08 +05:30
Raja Subramanian ae5fb7e882 Add packet to forwarding stats only if packet is forwarded. (#4056)
Packets not being forwarded were getting included in forwarding stats
calculation and skewing the measurement towards a smaller number.

The latency measurement does not include the batch IO of packets on
send. With a 2ms batching, that will add an average latency of 1ms.
2025-11-06 12:31:49 +05:30
cnderrauber c264b504c4 Don't warn 0 payload type for PCMU (#4039) 2025-10-28 23:11:51 +08:00
Raja Subramanian 32fc35254e Broadcast cond var on RTX write. (#4038)
* Broadcast cond var on RTX write.

High forwarding latency logs all show high queuing delay so far. From
code inspection, RTX writes were not signaling the cond var. Not sure if
that is the reason, but adding a signal there for further tests.

* Remove return values from writeRTX as they are not used
2025-10-28 11:27:02 +05:30
Raja Subramanian 061eb8b4e8 AddDownTrack to regressed codec after restarting forwarder. (#4037)
Without that the new codec was skipping through with old selector and
not working correctly.
2025-10-27 20:14:33 +05:30
Raja Subramanian ab906d710c Prevent leakage of previous codec after codec regression. (#4035)
* Prevent leakage of previous codec after codec regression.

In the window between forwarder restart and determining codec, the old
codec packet could leak through. Prevent tha by doing the restart and
codec determination atomically on a codec regression.

* tidy

* use locked function
2025-10-27 17:40:39 +05:30
Raja Subramanian 79b03f97a2 Log queueing latency when encountering high forwarding latency (#4034) 2025-10-27 15:27:03 +05:30
Raja Subramanian 29117b1422 set max layer in allocation (#4033) 2025-10-26 17:51:35 +05:30
Raja Subramanian 34e16a8709 Check more conditions for opportunistic alloc. (#4031) 2025-10-26 14:03:26 +05:30
Raja Subramanian 81fbd3551a Use the optimal allocation function for opportunistic allocation. (#4030)
* Use the optimal allocation function for opportunistic allocation.

Allocation functions set the `lastAllocation` state also.
This might have been causing an e2e failure with v1 client on migration.

* annotate args
2025-10-26 00:27:41 +05:30
Raja Subramanian a2ce73e0d0 Do not bind buffer if codec is invalid. (#4028)
Seeing cases of codec with zero clock rate. Do not bind to those.
2025-10-25 14:30:30 +05:30
Raja Subramanian 5042c06cb2 Use rtp converter from protocol/utils/rtputil (#4020)
* Use rtp converter from protocol/utils/rtputil

* lock x/tools as counterfeiter needs it
2025-10-22 15:15:46 +05:30
Raja Subramanian 5a426d15e1 Use rtp converter from protocol/utils (#4019) 2025-10-22 14:09:33 +05:30
Alexey Sokolov c039769607 Issue #1 only: Fix spatial layer initialization in Forwarder (#4003)
When SetMaxSpatialLayer() is called with target/current layers in
InvalidLayerSpatial state, opportunistically initialize the target
layer to avoid dropped packets during async stream allocator
initialization.

Guards:
- Only sets target if not congestion-throttled (isDeficientLocked)
- Does not set current layer (deferred to keyframe-based forwarder start)
- Logs at Debug level to avoid log noise

This prevents undefined layer state during manual subscription
with immediate quality upgrades (WithAutoSubscribe(false) +
SetVideoQuality(HIGH)).
2025-10-21 12:54:05 +05:30
Raja Subramanian 2afbf0e8ca Some golang modernisation bits. (#4016)
Mainly doing this to check CI static check failures.
2025-10-21 12:53:18 +05:30
Raja Subramanian dd62eb0072 Resort to full search for requested quality is not available. (#4000)
When doing code changes for dynamic rid, inadventently relied on
ordering of quality in track info layers to pick the highest layer if
the requested quality is higher than available qualities.
@cnderrauber addressed it in
https://github.com/livekit/livekit/pull/3998. Just adding some more
robustness behind that by doing a full search when requested quality is
not available.

Tested using JS SDK demo app and picking different qualities from
subscriber side with adaptive streaming turned off.
2025-10-14 10:05:33 +05:30
Raja Subramanian f6ca82d177 Revert to using silence packets for audio dummy start. (#3999)
Effectively reverts https://github.com/livekit/livekit/pull/3984.
Using padding only packets for audio dummy start introduces dependencies
on other services and is not a necessary change. Would have been good to
use padding only for audio also from t=0. We can re-visit this for
better compatbility down the line.
2025-10-14 10:05:16 +05:30
Raja Subramanian a20bbe34fa Log RPC details. (#3991)
Seeing cases of `ConnectionTimeout` and `ResponseTimeout`.
So, logging destination identity in RPC request and also logging ACK and
response. Will pare back logs/log level of these messages after gettnig
some data.

Also a small change I noticed and had sitting in my local tree to set
the previous RTP marker on a padding packet.
2025-10-09 00:16:56 +05:30
Raja Subramanian 158496bca1 Increment RTP timestamp on padding when using dummy start. (#3989)
* Increment RTP timestamp on padding when using dummy start.

This allows things like egress to have proper sequence to start
the pipeline.

* test
2025-10-07 23:39:51 +05:30
Raja Subramanian 4f6ed65d61 Limit check to red + opus when looking for primary codec match. (#3988)
For codec regression, even if track is encrypted, should be able to fall
back to a backup codec and trigger a regression.
2025-10-07 23:28:26 +05:30
Raja Subramanian bf06596fcb Support Opus mixed with RED when encrypted. (#3986)
Even when encrypted, can set up opus as the second codec to support the
case of RED interspersed with Opus packets when the RED packet is too
big to fit in one packet.

The change here is to not go through all up stream codecs when trying to
find a match in DownTrack.Bind when source is encrypted. When encrypted,
the down track codec should match the primary upstream codec, i. e. the
codec at index 0.
2025-10-07 16:23:28 +05:30
Raja Subramanian 2a6adbe80e Use padding only packets for dummy start of audio. (#3984)
If egress does not need silence packets to start audio, this will
simplify dummy start by using the same mechanism for video and audio.
2025-10-07 10:11:15 +05:30
Raja Subramanian 01337ba730 Do not start forawarding on out-of-order packet. (#3985)
It is posible that a subscriber joins when a publisher has reconnected
and has received a flood of retransmitted packets due to NACKing the
gap caused by the publisher reconnecting. Starting on that spurt means
the subscriber gets a burst of unpaced packets that could lead to issues
with calculating render time (especially obvious in cases like egress).
2025-10-06 13:16:48 +05:30
Raja Subramanian 3bd20ddb28 Revert unintentional change to not handle transport fallback on (#3970)
publisher peer connection.

While cleaning up during single peer connection changes, unintentionally
removed handler.

Also, another small change to log first packet time adjustment after
increment.
2025-09-30 10:24:26 +05:30
Raja Subramanian 735c663adc Update protocol for EventKey helper. (#3963) 2025-09-29 11:42:18 +05:30
Raja Subramanian 0bf7b178eb avoid logging on small values (#3958) 2025-09-28 10:46:41 +05:30
Raja Subramanian 00ff2ab941 Adjust for hold time when fowarding RTCP report. (#3956)
* Adjust for hold time when fowarding RTCP report.

When passing through RTCP sender report, holding it for some time before
sending means the remote receiver could see varying amount of
propagation delay if the remote uses something like local_clock -
ntp_sender_report_time and adapting to it.

Ideally, SFU should just forward RTCP Sender Report, but the current pull model to
group RTCP sender reports makes it a bigger change. So, adjust it by
hold time.

Also add a initial condition for one-way-delay estimator which can init
with a smaller value of latency if the first sample to measure
one-way-delay itself experienced higher delay than the prevailing
conditions.

* variable name

* log as duration
2025-09-26 18:57:21 +05:30
Raja Subramanian bfba6feed4 Adjust stream allocator ping interval based on state. (#3951)
* Adjust stream allocator ping interval based on state.

In steady state, does a 15 second ping.
While deficient, to be able to react to probes faster, it pings at 100ms
interval.

* clean up

* log ops queue not able to wake up
2025-09-24 14:45:57 +05:30
Raja Subramanian 49f9b9c8bd Flush stats when there are no packets. (#3947)
With no packets flowing through, the stat gets stuck.
Flush the pipe if there have been no packets in the report interval.
2025-09-23 16:57:41 +05:30
Raja Subramanian e6a3df1edc ForwarStats.GetStats needs to be public (#3946)
* ForwarStats.GetStats needs to be public

* prevent deadlock
2025-09-23 15:46:12 +05:30
Raja Subramanian 824d116bfe Tweaks tresholds for logging high forwarding latency/jitter. (#3945)
* Tweaks tresholds for logging high forwarding latency/jitter.

Previous attempt showed skewed jitter (i. e. more than 10x latency),
But, no large latency.

So, reducing the latency treshold to declare high latency.
And also keeping track of lowest/highest per reporting window and
logging those along with short term and long term measurements.

NOTE: previously short term and long term were separate calls with locks
acquired. Now, it is all in one lock. So, it does increase the lock
duration a bit, but hopefully not by too much as the welford merge for
short term would go over 20 samples (at 50 ms sampling interval and 1 s
reporting window).

* revert skew factor
2025-09-23 14:46:43 +05:30
Raja Subramanian 408492e030 Log some information around high forwarding latency. (#3944)
* Log some information around high forwarding latency.

Latency is not 0 after switching to microseconds resolution.
But, still seeing high jitter. Logging a bit more to understand under
what conditions it happens.

More notes inline.

* compact
2025-09-23 12:37:09 +05:30
Raja Subramanian 6a41fae548 Use microseconds for forwarding stats. (#3943)
Latency is always 0, but jitter is high.
Not sure how that happens as latency is the welford mean and jitter is
welford standard deviation. Feels like some mis-labeling.

Anyhow, switching to microseconds units to get better resolution.
2025-09-23 02:28:19 +05:30
Raja Subramanian b07e7a3828 Use difference in key frame counter to stop seeder. (#3936)
The key frame seeder could be started multiple times.
Use difference to detect stop condition.
2025-09-19 15:15:26 +05:30
Raja Subramanian 56fb28858a Do DD restart only if DD structure is present. (#3935) 2025-09-19 02:39:08 +05:30
Raja Subramanian 86facce9f4 More debugging of DD jump (#3934) 2025-09-19 01:29:28 +05:30
Raja Subramanian 6058a3f622 Add debugging from DD frame number wrap around. (#3933)
* Add debugging from DD frame number wrap around.

On a DD parser restart, the extended highest sequence number oes not
seem to be updated. Adding some debug to understand it better.

* more logs

* log incoming sequence number and frame number
2025-09-19 00:17:45 +05:30
Raja Subramanian 6489237e33 Simulcast audio fixes (#3925)
* Simulcast audio fixes

* clean up
2025-09-14 09:41:40 +05:30
Raja Subramanian eee2001a31 Set publisher codec preferences after setting remote description (#3913)
* Set publisher codec preferences after setting remote description

Munging SDP prior to setting remote description was becoming problematic
in single peer connection mode. In that mode, it is possible that a
subscribe track m-section is added which sets the fmtp of H.265 to a
value that is different from when that client publishes. That gets
locked in as negotiated codecs when pion processes remote description.
Later when the client publishes H.265, the H.265 does only partial
match. So, if we munge offer and send it to SetRemoteDescription, the
H.265 does only a partial match due to different fmtp line and that gets
put at the end of the list. So, the answer does not enforce the
preferred codec. Changing pion to put partial match up front is more
risky given other projects. So, switch codec preferences to after remote
description is set and directly operate on transceiver which is a better
place to make these changes without munging SDP.

This fixes the case of
- firefox joins first
- Chrome preferring H.265 joining next. This causes a subscribe track
  m-section (for firefox's tracks) to be created first. So, the
  preferred codec munging was not working. Works after this change.

* clean up

* mage generate

* test

* clean up
2025-09-10 18:28:36 +05:30
cnderrauber 76645fad5e Rpcs for ingress proxy WHIP (#3911)
See https://github.com/livekit/protocol/pull/1194
2025-09-09 22:49:42 +08:00
Raja Subramanian 991a4a4f53 Refactor subscribedTrack + mediaTrackSubscriptions. (#3908)
- Move downTrack instantiation to SubscribedTrack as it should own that
  DownTrack. Still more to do here as `DownTrack` is fetched from
  `SubscribedTrack` in a few places and used. Would like to avoid that,
  but doing this initially.
- Use an interface from sfu.Downtrack and replace a bunch of callbacks.
  SubscribedTrack is the implementation for DownTrackListener.
2025-09-08 18:20:19 +05:30