When a participant is closing, RTCP readers should be cleaned up from
factory even if the participant is expected to resume. The resumed
participant will be a new participant session and peer connection(s) and
everything will be set up again.
- New bucket API to pass in max packet size and sequence number offset
and seequence number size generic type
- Move OWD estimator to mediatransportutil.
After adding more fields in
https://github.com/livekit/livekit/pull/4105/files, it was not even
logging. Access to one of the added fields must have ended up waiting on
a lock and blocked.
Unfotunately, the deadlock fix in https://github.com/pion/ice/pull/840
did not address the peer connection close hang.
Splitting the logs so that the base log still happens. Ordering after
looking at the code and guessing what could still log to see if we get
more of the logs and learn more about the state and which lock ends up
the first blocking one.
* Add checks for participant and sub-components close.
Looks like there might be some memory leak with participant sessions not
getting closed properly. Adding checks (to be cleaned up later) to see
if there is a consistent place where things might hang.
* init with right type
* Remove unnecessary goroutine, thank you @milos-lk
* clean up
Correct triple-d spelling of "address" field in transport logs.
I’m not sure whether this was intentional, but I noticed it
while creating Grafana queries and filters. This matters because
anyone filtering logs using the correct spelling may
unintentionally miss relevant data, leading to incomplete or
misleading analysis.
* Use sync.Pool for objects in packet path.
Seeing cases of forwarding latency spikes that aling with GC.
This might be a bit overkill, but using sync.Pool for small +
short-lived objects in packet path.
Before this, all these were increasing in alloc_space heap profile
samples over time. With these, there is no increase (actually the lines
corresponding to geting from pool does not even show up in heap
accounting when doing `list` in `pprof`)
* merge
* Paul feedback
* Forwarding latency measurement tweaks.
- prom transmission type public
- do not measure short term values as it is not used and saves some lock
contention time in packet path potentially. Adding a separate method
for that.
- Change latency/jitter summary reporting to `ns` also to match the
histogram.
* add GetShortStats
* Higher resolution forwarding latency histogram.
Was using the average latency/jitter of last second to populate
forwarding latency/jitter histogram. But, it is too coarse, i. e. the
average value of latency/jitter is very low and those summarised samples
end up in the lowest bucket always.
A few things to address it
- record per packet forwarding latency in histogram
- adjust histogram bins to include smaller values
- Drop jitter histogram
This is a per packet call, but prometheus histogram is supposedly
fast/light weight. Would be good to get better resolution histograms.
Hence doing this. Please let me know if there are performance concerns.
* typo
* one more typo
* Debug high forwarding latency missing.
* log highest
* log condition
* update log
* log
* log
* change log
* Track start up delay.
Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
would be in the queue till bind. There are two ways it is showing up
a. Bind itself is delayed and releasing queued packets causes the
high forwarding latency.
b. There is a significant gap between bind and first packet being
pulled off the queue to be forwarded, in one example 100ms.
(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.
(b) looks like go scheduling latency? Unsure.
Logging more to understand this better.
* log start
* Use buffered indicator to exclude from forwarding latency.
Buffered packets live the queue for a while before Bind releases them.
They have high(ish) queuing latency and not true representation of
forwarding latency.
* Debug high forwarding latency missing.
* log highest
* log condition
* update log
* log
* log
* change log
* Track start up delay.
Digging into forwarding latency, there are a few things
1. Seems to be caused due to forwarding packets queued before bind. They
would be in the queue till bind. There are two ways it is showing up
a. Bind itself is delayed and releasing queued packets causes the
high forwarding latency.
b. There is a significant gap between bind and first packet being
pulled off the queue to be forwarded, in one example 100ms.
(a) is understandable if the signalling delays things. Can drop these
packets without forwarding or indicate in the packet that it is a queued
packet and drop it from forwarding latency calculation. Dropping is
probably better as down stream components like egress will see a burst
in these situations.
(b) looks like go scheduling latency? Unsure.
Logging more to understand this better.
* log start
* Log write count atomic.
* Return write count from WriteRTP.
Apologies for the frequent changes on this. With relays, the down track
could write to several targets. So, use count to have an accurate
indication of how may subscribers were written to.
Packets not being forwarded were getting included in forwarding stats
calculation and skewing the measurement towards a smaller number.
The latency measurement does not include the batch IO of packets on
send. With a 2ms batching, that will add an average latency of 1ms.
* Add prom histogram for forwarding latency and jitter.
Using short term stats for histogram.
An example setting is
1s - short term
1m - long term
Using the 1s (short term) data for histogram. In that 1 second, all
packet forwarding latencies are averaged for latency and std. dev. of
the collection is used as jitter.
* try different staticcheck
* Broadcast cond var on RTX write.
High forwarding latency logs all show high queuing delay so far. From
code inspection, RTX writes were not signaling the cond var. Not sure if
that is the reason, but adding a signal there for further tests.
* Remove return values from writeRTX as they are not used
* if RingingTimeout is provided, deadline should be set to that timeout.
This is because the SIP bridge will not return until RingingTimeout
which may be longer than the 30 second default deadline.
* handle Deadline being "before" timeout.
* Prevent leakage of previous codec after codec regression.
In the window between forwarder restart and determining codec, the old
codec packet could leak through. Prevent tha by doing the restart and
codec determination atomically on a codec regression.
* tidy
* use locked function
* Use the optimal allocation function for opportunistic allocation.
Allocation functions set the `lastAllocation` state also.
This might have been causing an e2e failure with v1 client on migration.
* annotate args