- New bucket API to pass in max packet size and sequence number offset
and seequence number size generic type
- Move OWD estimator to mediatransportutil.
* Use sync.Pool for objects in packet path.
Seeing cases of forwarding latency spikes that aling with GC.
This might be a bit overkill, but using sync.Pool for small +
short-lived objects in packet path.
Before this, all these were increasing in alloc_space heap profile
samples over time. With these, there is no increase (actually the lines
corresponding to geting from pool does not even show up in heap
accounting when doing `list` in `pprof`)
* merge
* Paul feedback
* Forwarding latency measurement tweaks.
- prom transmission type public
- do not measure short term values as it is not used and saves some lock
contention time in packet path potentially. Adding a separate method
for that.
- Change latency/jitter summary reporting to `ns` also to match the
histogram.
* add GetShortStats
* Higher resolution forwarding latency histogram.
Was using the average latency/jitter of last second to populate
forwarding latency/jitter histogram. But, it is too coarse, i. e. the
average value of latency/jitter is very low and those summarised samples
end up in the lowest bucket always.
A few things to address it
- record per packet forwarding latency in histogram
- adjust histogram bins to include smaller values
- Drop jitter histogram
This is a per packet call, but prometheus histogram is supposedly
fast/light weight. Would be good to get better resolution histograms.
Hence doing this. Please let me know if there are performance concerns.
* typo
* one more typo
* Debug high forwarding latency missing.
* log highest
* log condition
* update log
* log
* log
* change log
* Track start up delay.
Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
would be in the queue till bind. There are two ways it is showing up
a. Bind itself is delayed and releasing queued packets causes the
high forwarding latency.
b. There is a significant gap between bind and first packet being
pulled off the queue to be forwarded, in one example 100ms.
(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.
(b) looks like go scheduling latency? Unsure.
Logging more to understand this better.
* log start
* Use buffered indicator to exclude from forwarding latency.
Buffered packets live the queue for a while before Bind releases them.
They have high(ish) queuing latency and not true representation of
forwarding latency.
* Debug high forwarding latency missing.
* log highest
* log condition
* update log
* log
* log
* change log
* Track start up delay.
Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
would be in the queue till bind. There are two ways it is showing up
a. Bind itself is delayed and releasing queued packets causes the
high forwarding latency.
b. There is a significant gap between bind and first packet being
pulled off the queue to be forwarded, in one example 100ms.
(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.
(b) looks like go scheduling latency? Unsure.
Logging more to understand this better.
* log start
* Log write count atomic.
* Return write count from WriteRTP.
Apologies for the frequent changes on this. With relays, the down track
could write to several targets. So, use count to have an accurate
indication of how may subscribers were written to.
Packets not being forwarded were getting included in forwarding stats
calculation and skewing the measurement towards a smaller number.
The latency measurement does not include the batch IO of packets on
send. With a 2ms batching, that will add an average latency of 1ms.
* Broadcast cond var on RTX write.
High forwarding latency logs all show high queuing delay so far. From
code inspection, RTX writes were not signaling the cond var. Not sure if
that is the reason, but adding a signal there for further tests.
* Remove return values from writeRTX as they are not used
* Prevent leakage of previous codec after codec regression.
In the window between forwarder restart and determining codec, the old
codec packet could leak through. Prevent tha by doing the restart and
codec determination atomically on a codec regression.
* tidy
* use locked function
* Use the optimal allocation function for opportunistic allocation.
Allocation functions set the `lastAllocation` state also.
This might have been causing an e2e failure with v1 client on migration.
* annotate args
When SetMaxSpatialLayer() is called with target/current layers in
InvalidLayerSpatial state, opportunistically initialize the target
layer to avoid dropped packets during async stream allocator
initialization.
Guards:
- Only sets target if not congestion-throttled (isDeficientLocked)
- Does not set current layer (deferred to keyframe-based forwarder start)
- Logs at Debug level to avoid log noise
This prevents undefined layer state during manual subscription
with immediate quality upgrades (WithAutoSubscribe(false) +
SetVideoQuality(HIGH)).
When doing code changes for dynamic rid, inadventently relied on
ordering of quality in track info layers to pick the highest layer if
the requested quality is higher than available qualities.
@cnderrauber addressed it in
https://github.com/livekit/livekit/pull/3998. Just adding some more
robustness behind that by doing a full search when requested quality is
not available.
Tested using JS SDK demo app and picking different qualities from
subscriber side with adaptive streaming turned off.
Effectively reverts https://github.com/livekit/livekit/pull/3984.
Using padding only packets for audio dummy start introduces dependencies
on other services and is not a necessary change. Would have been good to
use padding only for audio also from t=0. We can re-visit this for
better compatbility down the line.
Seeing cases of `ConnectionTimeout` and `ResponseTimeout`.
So, logging destination identity in RPC request and also logging ACK and
response. Will pare back logs/log level of these messages after gettnig
some data.
Also a small change I noticed and had sitting in my local tree to set
the previous RTP marker on a padding packet.
Even when encrypted, can set up opus as the second codec to support the
case of RED interspersed with Opus packets when the RED packet is too
big to fit in one packet.
The change here is to not go through all up stream codecs when trying to
find a match in DownTrack.Bind when source is encrypted. When encrypted,
the down track codec should match the primary upstream codec, i. e. the
codec at index 0.
It is posible that a subscriber joins when a publisher has reconnected
and has received a flood of retransmitted packets due to NACKing the
gap caused by the publisher reconnecting. Starting on that spurt means
the subscriber gets a burst of unpaced packets that could lead to issues
with calculating render time (especially obvious in cases like egress).
publisher peer connection.
While cleaning up during single peer connection changes, unintentionally
removed handler.
Also, another small change to log first packet time adjustment after
increment.
* Adjust for hold time when fowarding RTCP report.
When passing through RTCP sender report, holding it for some time before
sending means the remote receiver could see varying amount of
propagation delay if the remote uses something like local_clock -
ntp_sender_report_time and adapting to it.
Ideally, SFU should just forward RTCP Sender Report, but the current pull model to
group RTCP sender reports makes it a bigger change. So, adjust it by
hold time.
Also add a initial condition for one-way-delay estimator which can init
with a smaller value of latency if the first sample to measure
one-way-delay itself experienced higher delay than the prevailing
conditions.
* variable name
* log as duration
* Adjust stream allocator ping interval based on state.
In steady state, does a 15 second ping.
While deficient, to be able to react to probes faster, it pings at 100ms
interval.
* clean up
* log ops queue not able to wake up
* Tweaks tresholds for logging high forwarding latency/jitter.
Previous attempt showed skewed jitter (i. e. more than 10x latency),
But, no large latency.
So, reducing the latency treshold to declare high latency.
And also keeping track of lowest/highest per reporting window and
logging those along with short term and long term measurements.
NOTE: previously short term and long term were separate calls with locks
acquired. Now, it is all in one lock. So, it does increase the lock
duration a bit, but hopefully not by too much as the welford merge for
short term would go over 20 samples (at 50 ms sampling interval and 1 s
reporting window).
* revert skew factor
* Log some information around high forwarding latency.
Latency is not 0 after switching to microseconds resolution.
But, still seeing high jitter. Logging a bit more to understand under
what conditions it happens.
More notes inline.
* compact
Latency is always 0, but jitter is high.
Not sure how that happens as latency is the welford mean and jitter is
welford standard deviation. Feels like some mis-labeling.
Anyhow, switching to microseconds units to get better resolution.
* Add debugging from DD frame number wrap around.
On a DD parser restart, the extended highest sequence number oes not
seem to be updated. Adding some debug to understand it better.
* more logs
* log incoming sequence number and frame number
* Set publisher codec preferences after setting remote description
Munging SDP prior to setting remote description was becoming problematic
in single peer connection mode. In that mode, it is possible that a
subscribe track m-section is added which sets the fmtp of H.265 to a
value that is different from when that client publishes. That gets
locked in as negotiated codecs when pion processes remote description.
Later when the client publishes H.265, the H.265 does only partial
match. So, if we munge offer and send it to SetRemoteDescription, the
H.265 does only a partial match due to different fmtp line and that gets
put at the end of the list. So, the answer does not enforce the
preferred codec. Changing pion to put partial match up front is more
risky given other projects. So, switch codec preferences to after remote
description is set and directly operate on transceiver which is a better
place to make these changes without munging SDP.
This fixes the case of
- firefox joins first
- Chrome preferring H.265 joining next. This causes a subscribe track
m-section (for firefox's tracks) to be created first. So, the
preferred codec munging was not working. Works after this change.
* clean up
* mage generate
* test
* clean up
- Move downTrack instantiation to SubscribedTrack as it should own that
DownTrack. Still more to do here as `DownTrack` is fetched from
`SubscribedTrack` in a few places and used. Would like to avoid that,
but doing this initially.
- Use an interface from sfu.Downtrack and replace a bunch of callbacks.
SubscribedTrack is the implementation for DownTrackListener.