* Debug high forwarding latency missing.
* log highest
* log condition
* update log
* log
* log
* change log
* Track start up delay.
Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
would be in the queue till bind. There are two ways it is showing up
a. Bind itself is delayed and releasing queued packets causes the
high forwarding latency.
b. There is a significant gap between bind and first packet being
pulled off the queue to be forwarded, in one example 100ms.
(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.
(b) looks like go scheduling latency? Unsure.
Logging more to understand this better.
* log start
* Use buffered indicator to exclude from forwarding latency.
Buffered packets live the queue for a while before Bind releases them.
They have high(ish) queuing latency and not true representation of
forwarding latency.
* Debug high forwarding latency missing.
* log highest
* log condition
* update log
* log
* log
* change log
* Track start up delay.
Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
would be in the queue till bind. There are two ways it is showing up
a. Bind itself is delayed and releasing queued packets causes the
high forwarding latency.
b. There is a significant gap between bind and first packet being
pulled off the queue to be forwarded, in one example 100ms.
(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.
(b) looks like go scheduling latency? Unsure.
Logging more to understand this better.
* log start
* Log write count atomic.
* Return write count from WriteRTP.
Apologies for the frequent changes on this. With relays, the down track
could write to several targets. So, use count to have an accurate
indication of how may subscribers were written to.
Packets not being forwarded were getting included in forwarding stats
calculation and skewing the measurement towards a smaller number.
The latency measurement does not include the batch IO of packets on
send. With a 2ms batching, that will add an average latency of 1ms.
* Add prom histogram for forwarding latency and jitter.
Using short term stats for histogram.
An example setting is
1s - short term
1m - long term
Using the 1s (short term) data for histogram. In that 1 second, all
packet forwarding latencies are averaged for latency and std. dev. of
the collection is used as jitter.
* try different staticcheck
* Broadcast cond var on RTX write.
High forwarding latency logs all show high queuing delay so far. From
code inspection, RTX writes were not signaling the cond var. Not sure if
that is the reason, but adding a signal there for further tests.
* Remove return values from writeRTX as they are not used
* if RingingTimeout is provided, deadline should be set to that timeout.
This is because the SIP bridge will not return until RingingTimeout
which may be longer than the 30 second default deadline.
* handle Deadline being "before" timeout.
* Prevent leakage of previous codec after codec regression.
In the window between forwarder restart and determining codec, the old
codec packet could leak through. Prevent tha by doing the restart and
codec determination atomically on a codec regression.
* tidy
* use locked function
* Use the optimal allocation function for opportunistic allocation.
Allocation functions set the `lastAllocation` state also.
This might have been causing an e2e failure with v1 client on migration.
* annotate args
When SetMaxSpatialLayer() is called with target/current layers in
InvalidLayerSpatial state, opportunistically initialize the target
layer to avoid dropped packets during async stream allocator
initialization.
Guards:
- Only sets target if not congestion-throttled (isDeficientLocked)
- Does not set current layer (deferred to keyframe-based forwarder start)
- Logs at Debug level to avoid log noise
This prevents undefined layer state during manual subscription
with immediate quality upgrades (WithAutoSubscribe(false) +
SetVideoQuality(HIGH)).
Pion does not protect the stats getter and using it after close could
cause nil de-reference. Do a couple of things
1. Stop timer that access peer connection stats before closing peer
connection.
2. Do not access stats if peer connection is already closed
Currently, the signal requests are counted on media side and signal
responses are counted on controller side. This does not provide the
granularity to check how many response messages each media node is
sending.
Seeing some cases where track subscriptions are slow under load. This
would be good to see if the media node is doing a lot of signal response
messages.
When doing code changes for dynamic rid, inadventently relied on
ordering of quality in track info layers to pick the highest layer if
the requested quality is higher than available qualities.
@cnderrauber addressed it in
https://github.com/livekit/livekit/pull/3998. Just adding some more
robustness behind that by doing a full search when requested quality is
not available.
Tested using JS SDK demo app and picking different qualities from
subscriber side with adaptive streaming turned off.
Effectively reverts https://github.com/livekit/livekit/pull/3984.
Using padding only packets for audio dummy start introduces dependencies
on other services and is not a necessary change. Would have been good to
use padding only for audio also from t=0. We can re-visit this for
better compatbility down the line.
* Added optional "Power of Two Random Choices" algorithm for the node selector sort_by feature. The current, default behavior of picking the lowest-valued node remains.
Seeing cases of `ConnectionTimeout` and `ResponseTimeout`.
So, logging destination identity in RPC request and also logging ACK and
response. Will pare back logs/log level of these messages after gettnig
some data.
Also a small change I noticed and had sitting in my local tree to set
the previous RTP marker on a padding packet.