* Push/pull for connection stats/quality scoring.
Was not happy with pure pull method missing a window because
of RTCP RR timing is slightly off for audio and using a much
larger window of data in the next update.
That also resulted in RTP stats getting some bits of code.
As that is per-packet processing, was not a good idea.
Switching to push-pull method.
For up track, it is pull, i. e. connection stats worker will pull stats.
For down track, there is a new notification about receiver report
reception. Using this to check for time to run stats. And adding a bit
of tolerance for processing window (currently set so that as long as it
is > 95% of usual processing interval). This allows two things
- for video, RTCP RR are more frequent, but we will still not process
till enough time has passed
- for audio, RTCP RR could be once in 5 seconds or so. Can process when
it is available rather than miss a window and use a much larger window
later.
* uber atomic
* Connectino quality misc changes
1. Call scorer.Update() with nil stat when no data available so that
scorer can synthesise window with proper window time.
2. Substract out loss in interval to account for packets not sent at
all.
3. Fix `packetsNotFound` variable in `getIntervalStats`. I remember this
working at some point. Not sure if I fat fingered in another PR and
deleted the increment line.
4. Logging a bit more when no packets expected. Those can get noisy
especially when track is muted. But, seeing some unexplained
instances of no packets leading to quality drop. So, temporary logging
to get a bit more information.
* correct spelling
* Limit packet score minimum to 0.0
* Make connection quality not too optimistic.
With score normalization, the quality indicator showed good
under conditions which should have normally showed some badness.
So, a few things in this PR
- Do not normalize scores
- Pick the weakest link as the representative score (moving away from
averaging)
- For down track direction, when reporting delta stats, take the number
of packets sent actually. If there are holes in the feed (upstream
packet loss), down tracks should not be penalised for that loss.
State of things in connection quality feature
- Audio uses rtcscore-go (with a change to accommodate RED codec). This
follows the E-model.
- Camera uses rtcscore-go. No change here. NOTE: THe rtscore here is
purely based on bits per pixel per frame (bpf). This has the following
existing issues (no change, these were already there)
o Does not take packet loss, jitter, rtt into account
o Expected frame rate is not available. So, measured frame rate is
used as expected frame rate also. If expected frame rate were available,
the score could be reduced for lower frame rates.
- Screen share tracks: No change. This uses the very old simple loss
based thresholding for scoring. As the bit rate varies a lot based on
content and rtcscore video algorithm used for camera relies on
bits per pixel per frame, this could produce a very low value
(large width/height encoded in a small number of bits because of static content)
and hence a low score. So, the old loss based thresholding is used.
* clean up
* update rtcscore pointer
* fix tests
* log lines reformat
* WIP commit
* WIP commit
* update mute of receiver
* WIP commit
* WIP commit
* start adding tests
* take min score if quality matches
* start adding bytes based scoring
* clean up
* more clean up
* Use Fuse
* log quality drop
* clean up debug log
* - Use number of windows for wait to make things simpler
- track no layer expected case
- always update transition
- always call updateScore
* Change lock scope of access to RTCP sender report data.
Forwarder calls back to get time stamp offset.
Holding buffer lock is a much bigger scoped lock.
Reduce lock scope and cache latest sender report under its own lock.
And use that cache when calculating time stamp offset.
* move sr cache to stream tracker manager for re-use in relay
* cache before spread
* Use purely RR based RTT.
With normalization of NTP time stamp to local time,
don't need to keep track of NTP time of publisher + local time of
when a report is sent. RTT calculations can happen with RR only.
Also, do not log errors when RTT cannot be calculated due to
no last SR. This can happen if the receiver sends an RR before
it receives an SR. As SFU is doing SRs once in 5 seconds, it is
possible some RRs happen before the first SR.
* use error type
* correct error name
* Use local time base for NTP in RTCP Sender Report for downtracks.
More details in comments in code.
* Remove debug
* RTCPSenderReportInfo -> RTCPSenderReportDataExt
* Get rid of sender report data pointer checks
* some additional logging
* Do not use local time stamp when sending RTCP Sender Report
As local time does not take into account the transmission delay
of publisher side sender report, using local time to calculate
offset is not accurate.
Calculate NTP time stamp based on difference in RTP time.
Notes in code about some shortcomings of this, but should
get better RTT numbers. I think RTT numbers were bloated because of
using local time stamp.
* WIP commit
* comment
* clean up
* remove unused stuff
* cleaner comment
* remove unused stuff
* remove unused stuff
* more comments
* TrackSender method to handle RTCP sender report data
* fix test
* push rtcp sender report data to down tracks
* Need payload type for codec id mapping in relay protocol
* rename variable a bit
When switching from local -> remote or remote -> local,
the forwarder state is cached and restored after the switch
to ensure continuity in sequence number /time stamp.
But, if the forwarder had not started before the switch,
the sequence number always starts at 1 because of seeding.
So, do not see unless forwarder was started before the switch.
* Seed snapshots
- For one cycle after seeding, delta snap shot can get a huge gap
because of snapshot iitializing from start if not present. Not
a huge deal sa it should not affect functionality, but saving/restoring
(at least with down track) snap shot is a big deal. So just do it.
- Have been seeing a bunch of cases of delta stats getting a lot of
packets due to out-of-order (what seems like) receiver report. So,
save the receiver report and log it when out-of-order is detected
to understand if they are closely spaced or something else could be
happening.
* Remove comment that does not apply anymore
* log current time and RR
Have been seeing a few instances of "too many packets expected in delta"
when trying to generate RTCP SR on down track. Actual sequence numbers
indicate that start is after the end.
As down track RTPStats are driven by receiver report, wondering if we
are getting RTCP_RR out-of-order somehow causing this to happen.
Cannot find any other reason for this.
So, accepting RTCP_RR based update only if the sequence number is higher
than existing and also logging a warning with sequence numbers if they
look out-of-order.
* Cache RTPStats and seed on re-use
When a cached down track is re-used, RTPStats was not cached.
This caused sender reports getting out-of-sync with the remote side.
Cache RTPStats and seed it on re-use.
* staticcheck
- Do not update jitter on padding only packet.
Padding only packet may not have proper timestamp.
If it does, it probably has the time stamp of the
last packet with payload. That will also affect
jitter calculation, i. e. wall clock time is moving,
but RTP time is the same.
- Do not send `onMaxLayer` changed on bind.
It was probably racing with update when max layer
is updated when adaptive stream is off. There is
no need to send that update as the default would
be OFF. It will be enabled when adaptive stream
subscription turns it on or when max layer is
set when down track bind happens and adaptive stream
is off.
* Use media payload size in scoring.
Subtract out header bytes when calculating score.
This does not seem to affect the score (under perfect conditions),
but, using header bytes will inflate the bit rate and
will affect scoring.
* Add header bytes to ToProto
* protocol pointer
* fix test
With rapid changes to subscription settings, use of a goroutine
could end up processing dynacast needs for that subscriber in
a different order. So, record the susbcription needs of a subscriber
in the callback and process the data in a go routine.
* WIP commit
* WIP commit
* Remove debug
* Revert to reduce diff
* Fix tests
* Determine spatial layer from track info quality if non-simulcast
* Adjust for invalid layer on no rid, previously that function was returning 0 for no rid case
* Fall back to top level width/height if there are no layers
* Use duration from RTPDeltaInfo
* Stats of NACKs acked and number of repeated NACKs.
Also making a change in delta stat to drop negative packet loss
counts to 0. Because of windowing it is a legitimate case.
The receiver could have seen a loss in window we are measuring
and in the subsequent window, the receiver could have gotten
a retransmission and reduced the packet loss count resulting
in a negative delta. When we report negative delta, it could
get dropped by analytics validator. That will be lost data.
Avoid that.
* Remove unused code
* Pick up latest protocol
* Remove `Head` field from `ExtPacket` structure.
Although we do not intend to, but if packets get out-of-order
in the forwarding path (maybe reading in multiple goroutines
or using some worker pool to distribute packets), the `Head`
indicator could lead to wrong behaviour. It is possible that
at the receiver, the order is
- Seq Num N, Head = true
- N + 1, Head = true
If the forwarding path sees `N + 1` first, the Head flag
when it sees `N` packet is incorrect and will lead to incorrect
behaviour.
The alternative check is very simple. So, remove `Head` flag.
* Remove unused field
* Use delta stats throughout and avoid calculating deltas in telemetry
* Fix a few things after testing
* Remove debug
* Fix tests
* delete instead of setting to nil
* Point to the latest protocol
* Handle an edge in layer lock.
A very edge case
- Available layer: [0, 1, 2], but bitrate is not yet available.
We set it to layer 2 awaiting measurement.
- Measurement for layers 0 and 1 come through.
- Still no key frame for layer 2.
- Finalize layers runs and sees that bitrate is available for 0 and 1.
It finalizes layer 1.
- Layer 1 key frame comes (because we asked key frame of layer 2,
publisher sends key frame for all layers). Locks to layer 1.
- No more events happen to switch to layer 2.
Changes
-------
- Move bit rate measurement to StreamTrackerManager. Allows re-use
in relay.
- Make bit rate availability (from zero -> non-zero) an event
and let it flow through the stream allocation flow so that we
always have an event when bit rate measurement becomes available.
This gets rid of finalization which I was unhappy with it anyway as
it was polling every second.
- Removing REMB stuff from buffer. We do not use it.
It is incorrect anyway. REMB should be ay peer connection level.
* Fix test
* fix test
* Simplify allocate
* Simplify/clean up
This still does not address root cause of large loss, but at least
does not display crazy thing like packets = 0, but packet rate is 45/s.
Also, RLock in ToString() as there are bits of structure used in
stringification.
* Key frames
- Keep track of key frame stats
- Split out PLI from down track used for purpose of layer locking.
This will give us a good picture of down stream issues forcing a PLI.
- Use key frame requester whenever there is a layer lock required.
Not just the first key frame. With the synchronous thing, the counter
was just ridiculously high like 150 or something because of all
the initial padding packets. Also, use RTT in key frame requester.
* send first PLI before waiting
* Turn off key frame requester when disabled
* simplify