Enabled by default.
Also, tweak the long term propagation delay a bit. The first propagation
delay itself was too high and the long term initialized with a high
value. Prevent that and also ensure large negtaives do not have an
effect by using a lower bound of 0. Lower bound of 0 is okay as the main
purpose is to track sustained high positive values.
Seeing cases (mostly across relay) of large first packet time adjustment
getting ignored. From data, it looks like the first packet is extremely
delayed (some times of the order of minutes) which does not make sense.
Adding some checks against media path, i. e. compare RTP timestamp from
sender report against expected RTP timestamp based on media path
arrivals and log deviations more than 5 seconds.
Another puzzling case. Trying to understand more.
Also, refactoring SetRtcpSenderReportData() function as it was getting
unwieldy.
* Handle cases of long mute/rollover of time stamp.
There are cases where the track is muted for long enough for timestamp
roll over to happen. There are no packets in that window (typically
there should be black frames (for video) or silence (for audio)). But,
maybe the pause based implementation of mute is causing this.
Anyhow, use time since last packet to gauge how much roll over should
have happened and use that to update time stamp. There will be really
edge cases where this could also fail (for e. g. packet time is affected
by propagation delay, so it could theoretically happen that mute/unmute
+ packet reception could happen exactly around that rollover point and
miscalculate, but should be rare).
As this happen per packet on receive side, changing time to `UnixNano()`
to make it more efficient to check this.
* spelling
* tests
* test util
* tests
With RTX, some clients use very old packets for probing. Check for
duplicate before logging warning about old packet/negative sequence
number jump.
Also, double the history so that duplicate tracking is better. Adds
about 1/2 KB per RTP stream.
* Log first time adjustment total.
Seeing cases where the first time is 400ms+ before start time.
Possible it is getting that much adjustment, but would be good to see
how much total adjustment happens.
* log propagation delay
Seeing some unexplained large jumps on remotes across relay. Unclear if
there was a jump on origin side at some point. Reducing threshold for
large jump so that we can catch unexpected jumps more.
* Simplify time stamp calculation on switches.
Trying to simplify time stamp calculation on restarts.
The additional checks take effect rarely and it not worth the extra
complication.
Also, doing the reference time stamp in extended range.
The challenge with that is when publisher migrates the extended
timestamp could change post migration (i. e. post migration would not
know about rollovers). To address that, maintain an offset that is
updated on resync.
* WIP
* Revert to resume threshold
* typo
* clean up
* Handle large jumps in RTCP sender report timestamp.
Seeing cases of RTCP Sender Report spaced apart by more than half the
RTP Timestamp range. Maybe a case of laptop going to sleep and waking
up. Handle it using time diff from last report and calculating expected
timestamp.
* try go 1.22
* Connection quality LOST only if RTCP is also not available.
It is possible that sender stops all layers of video due to some
constraint (CPU or bandwidth). Packet reception going dry due to
that should not trigger `LOST` quality.
Add last received RTCP time also to distinguish the case
of real `LOST` and sender stopping traffic.
Some bits to watch for
- With audio, RTCP reports could be more than 5 seconds apart (5 seconds
is the default interval for connection quality scorer), but audio
senders usually send silence packets even when there is no input.
So audio completely stopping can be considered `LOST`.
- With video, have to observe if all clients continue to send RTCP even
if all layers are stopped.
- RTCP bandwidth is not supposed to exceed the primary stream bandwidth.
libwebrtc calculates that and spaces out RTCP reports accordingly.
That is the reason why audio reports are that far apart. If a video
stream is encoded at a very low bit rate, it could also be sending
RTCP rarely. So, there is the case of LOST being indistinguishable
from sender stopping all layers. But, this should be a rare case.
* typo
* Prevent large spikes in propagation delay
A few tweaks
- Large spike in propagation delay due to congested channel results in
long term estimate getting high value. Ignore outliers in long term
estimate.
- Introduce a new field for adjusted arrival time as adjusting the
arrival time in place meant it got applied again across the relay and
that caused different propagation delay on remote nodes.
- Reset path change counters as long as there is any sample that is not
higher than the multiple of long term. There was a case of
o Sample with high value that triggered path change start.
o Then some samples with high enough delta, but did not meet the
criteria for increasing counter further.
o Some time later, another sample met the threshold and that triggered
a path change re-init.
* do not adapt to large delta
* Tweak adaptation to increase in propagation delay.
A couple of issues
- RTCP Sender Reports rate will vary based on underying track bitrate.
(at least in theory, not all entities will do it though, for example
SFU does standard rate of one per three seconds irrespective of track
bit rate). So, adapt the long term estimate of propagation delay delta
based on spacing of reports.
- Re-init of propagation delay to adapt to path change was taking the
last value before the switch. But, that one value could have been an
outlier and accepting it is not great. So, adapt spike time
propagation delay in a smoother fashion to ensure that all values
during spike contribute to the final value.
* clean up
- When audio is muted, server injects silence frames which moves the
time stamp forward and adjusts offset. That cannot be used against
publisher side sender report. Use a pinned version.
- Ignore small changes to propagation delay even while checking for
sharp increase. That is spamming a lot for small changes, i.e.
existing delta is 100 micro seconds or so and the new one is 300 micro
seconds. Also rename to `longTerm` from `smoothed` as it is a slow
varying long term estimate of propagation delay delta. And slow down
that adaptation more.
* Forward publisher sender report.
Publisher side RTCP sernfer report is rebased to SFU time base
and used to send sender rerport to subscriber.
Will wait to merge till previous versions are out as this will require a
bunch of testing.
* - Add rebased report drift
- update protocol dep
- fix path change check, it has to check against delta of propagation
delay and not propagation delay as the two side clocks could be way
off.
* Use start time stamp to calculate down stream sender report.
With first packet time adjustment, using the first time stamp is more
accurate.
This still suffers if the up stream clock rate changes (happens in cases
like noise suppression which is not well understood). Will be looking at
pass through of sender report from publisher to subscriber.
* similar log strings
* avoid early sender reports
* log messages
* Reduce first packet adjustment threshold to 15 seconds
* Do not restart on receiver side.
Restart with wrap back causes issues in the forwarding path
as the subscriber assumes the extended type from receiver side does
not restart.
Restart was an attempt to include as many packets as possible, but
in practice is not super useful. So, taking it out. Can clean up
a bit more stuff, but want to run this first and check for any oddities.
* fix test
The buffer is not for padding packets. So, calculate
adjusted sequence numbers before comparing against size.
Also, it is possible that invalidated slot is accessed
due to not being able to exclude padding range. This was
causing time stamp reset to 0. Will remove the error log
after this goes out and the condition does not show up
for a few days.
* More fine grained filtering NACKs after a key frame.
There are applications with periodic key frame.
So, a packet lost before a key frame will not be retransmitted.
But, decoder could wait (jitter buffer, play out time) and cause
a stutter.
Idea behind disabling NACKs after key frame was another knob to
throttle retransmission bit rate. But, with spaced out retransmissions
and max retransmissions per sequence number, there are throttles.
This would provide more throttling, but affects some applications.
So, disabling filtering NACKs after a key frame.
Introducing another flag to disallow layers. This would still be quite
useful, i. e. under congestion the stream allocator would move the
target lower. But, because of congestion, higher layer would have lost
a bunch of packets. Client would NACK those. Retransmitting those higher
layer packets would congest the channel more. The new flag (default
enabled) would disallow higher layers retransmission. This was happening
before this change also, just splitting out the flag for more control.
* split flag
* Log skew in clock rate.
Remember seeing sender report time stamp moving backward
across mute with replaceTrack(null). Not able to reproduce
it in JS sample app, but have seen it elsewhere.
Logging to understand it better. Wondering if the sender report
should be reset on time stamp moving backward or if we should drop
backwards moving reports.
* set threshold at 20%
* Use 32-bit time stamp to get reference time stamp on a switch.
With relay and dyncast and migration, it is possible that different
layers of a simulcast get out of sync in terms of extended type,
i. e. layer 0 could keep running and its timestamp could have
wrapped around and bumped the extended timestamp. But, another layer
could start and stop.
One possible solution is sending the extended timestamp across relay.
But, that breaks down during migration if publisher has started afresh.
Subscriber could still be using extended range.
So, use 32-bit timestamp to infer reference timestamp and patch it with
expected extended time stamp to derive the extended reference.
* use calculated value
* make it test friendly
Seeing a large positive gap which I am not able to explain.
Wondering if at some other time, a large negative is happening
and the large positive is just a correction.