Commit Graph

72 Commits

Author SHA1 Message Date
Raja Subramanian 396371312b Use variables for score -> quality mapping (#2268)
* Use variables for score -> quality mapping

* spelling
2023-11-28 11:51:21 +05:30
Raja Subramanian 5f76d1adcc Introduce DISCONNECTED connection quality. (#2265)
* Introduce `DISCONNECTED` connection quality.

Currently, this state happens when any up stream track does not
send any packets in an analysis window when it is expected to send
packets.

This can be used by participants to know the quality of a potentially
disconnected participant. Previously, it took 20 - 30 seconds for
the stale timeout to kick in and disconnect the limbo participant which
triggered a participant update through which other participants knew
about it.

Previously, `POOR` quality was also overloaded to denote that the
up stream is not sending any packets. With this change, that is a
separate indicator, i. e. `DISCONNECTED`.

* clean up

* Update deps

* spelling
2023-11-27 23:06:53 +05:30
Raja Subramanian 2cf751d261 Use timer in scorer lock scope. (#2066)
Using time from outside make anachronous samples in expected
distance/bit rate measurement. So, have to let the time be
snap shotted in scorer lock scope.
2023-09-13 01:38:34 +05:30
Raja Subramanian 254a35543d Fix down stream stats. (#2063)
Need to pass in the correct time. Previously streaming start was
determined by another delta snap shot which as removed for efficiency.
Did not realise that we were passing in zero time for stats.

Also, revert of the change (the part which did not re-pause) from this
PR (https://github.com/livekit/livekit/pull/2037). That change affects
other paths. The edge it was trying to fix is more rare. Need to think
about a way which covers all cases.
2023-09-12 08:34:28 +05:30
Raja Subramanian 1b20b8f1ac Make interface for connection stats. (#2056)
* Make interface for connection stats.

Implement suggestion from @paulwe to clean that up a bit.

* fix test
2023-09-11 08:39:33 +05:30
Raja Subramanian c09d8d0878 Split RTPStats into receiver and sender. (#2055)
* Split RTPStats into receiver and sender.

For receiver, short types are input and need to calculate extended type.

For sender (subscriber), it can operate only in extended type.
This makes the subscriber side a little simpler and should make it more
efficient as it can do simple comparisons in extended type space.

There was also an issue with subscriber using shorter type and
calculating extended type. When subscriber starts after the publisher
has already rolled over in sequence number OR timestamp, when
subsequent publisher side sender reports are used to adjust subscriber
time stamps, they were out of whack. Using extended type on subscriber
does not face that.

* fix test

* extended types from sequencer

* log
2023-09-11 07:33:39 +05:30
Raja Subramanian b95670f56b Removing one snapshot in down track. (#2047)
Profiling showed updating jitter going through the snapshot maps.
With the reduction of one, there should only be one snapshot
and hopefully that should gain some cycles back.
2023-09-07 22:22:00 +05:30
David Zhao 981fb7cac7 Adding license notices (#1913)
* Adding license notices

* remove from config
2023-07-27 16:43:19 -07:00
Raja Subramanian 5459bd2931 Push track quality to poor on a bandwidth constrained pause. (#1867)
* Push track quality to poor on a bandwidth constrained pause.

* add tests

* scale distance by divisor

* fix test distance to desired

* wait longer for subscription manager to reconcile
2023-07-11 15:29:35 +05:30
Raja Subramanian e6f5f2f344 Prevent anachronous sample reading. (#1863)
* Prevenet anachronous sample reading.

Not so pretty way of solving this. Please let me know if you have
thoughts.

Passing in time allows testing easier. But, that also leads to
time reversal problems. Example scenario
1. Connection stats worker gets a time and initiates quality
   calculation.
2. A layer transition is recorded after that.
3. By the time, scorer is called to calculate score with time from Step
   1, there is time reversal and results in anachronous sample.

One option is to use a scorer lock in connection stats module and wrap
all calls to scorer in that lock, but that does not prevent the passed
in time stamps themselves getting out of order. Also, stand alond use
of scorer in some other context will be problematic.

Doing the hybrid thing of taking current time in scorer if passed in
time is zero so that scorer lock domain controls it.

* use zero time everywhere in normal flow

* make APIs with and without time passed in as Paul suggested
2023-07-10 08:39:52 +05:30
Raja Subramanian e3954d1d64 Use timed aggregator. (#1843)
* Use timed aggregator.

For aggregate bitrate and average distance from desired.

Also, clean up debug added to track leak.

* update deps
2023-07-01 10:21:15 +05:30
Raja Subramanian 496656627e Logging more to understand layer transition leak better. (#1840) 2023-06-30 11:59:53 +05:30
Raja Subramanian cea41e4189 Discount out-of-order packets in downstream score. (#1831)
* Discount out-of-order packets in downstream score.

More notes inline.

* correct comment

* clean up comment
2023-06-27 17:44:53 +05:30
Raja Subramanian 72ed5b19f7 Use receiver report stats for loss/rtt/jitter. (#1781)
* Use receiver report stats for loss/rtt/jitter.

Reversing a bit of https://github.com/livekit/livekit/pull/1664.
That PR did two snapshots (one based on what SFU is sending
and one based on combination of what SFU is sending reconciled with
stats reported from client via RTCP Receiver Report). That PR
reported SFU only view to analytics. But, that view does not have
information about loss seen by client in the downstream.
Also, that does not have RTT/jitter information. The rationale behind
using SFU only view is that SFU should report what it sends irrespective
of client is receiving or not. But, that view did not have proper
loss/RTT/jitter.

So, switch back to reporting SFU + receiver report reconciled view.
The down side is that when receiver reports are not receiver,
packets sent/bytes sent will not be reported to analytics.

An option is to report SFU only view if there are no receiver reports.
But, it becomes complex because of the offset. Receiver report would
acknowledge certain range whereas SFU only view could be different
because of propagation delay. To simplify, just using the reconciled
view to report to analytics. Using the available view will require
a bunch more work to produce accurate data.
(NOTE: all this started due to a bug where RTCP was not restarted on
a track resume which killed receiver reports and we went on this path
to distinguish between publisher stopping vs RTCP receiver report not
happening)

One optimisation to here here concerns the check to see if publisher is sending data.
Using a full DeltaInfo for that is an overkill. Can do a lighter weight
for that later.

* return available streams

* fix test
2023-06-09 23:31:25 +05:30
Raja Subramanian 1d3faefc5e More scoring tweaks (#1719)
1. Completely removing RTT and jitter from score calculation.
   Need to do more work there.
   a. Jitter is slow moving (RFC 3550 formula is designed that way).
      But, we still get high values at times. Ideally, that should
      penalise the score, but due to jitter buffer, effect may not be
      too bad.
   b. Need to smooth RTT. It is based on receiver report and if one
      sample causes a high number, score could be penalised
      (this was being used in down track direction only). One option
      is to smooth it like the jitter formula above and try using it.
      But, for now, disabling that also.

2. When receiving lesser number of packets (for example DTX), reduce the
   weight of packet loss with a quadratic relationship to packet loss
   ratio. Previously using a square root and it was potentially
   weighting it too high. For example, if only 5 packets were received
   due to DTX instead of 50, we were still giving 30% weight
   (sqrt(0.1)). Now, it gets 1% weight. So, if one of those 5 packets
   were lost (20% packet loss ratio), it still does not get much weight
   as the number of packets is low.,

3. Slightly slower decrease in score (in EWMA)

4. When using RED, increase packet loss weight thresholds to be able to
   take more loss before penalizing score.
2023-05-18 20:16:43 +05:30
Raja Subramanian 28a8a808f2 Do not add empty video layers in stats. (#1685) 2023-05-05 08:59:08 +05:30
Raja Subramanian 50ab72a5f8 DownTrack scoring when RR is not received. (#1664) 2023-04-28 14:50:06 +05:30
Raja Subramanian c1c4e8aea0 Include packetsMissing field in string representation (#1659)
* Include packetsMissing field in string representation

* do not set stub directly
2023-04-27 14:39:05 +05:30
Raja Subramanian 9db46bb866 Avoid divide-by-zero and NaN (#1656) 2023-04-26 21:29:25 +05:30
Raja Subramanian 09c0b25787 Ensure that RR is not received for a while before running scorer on nil (#1653)
data.

Without the check, it was getting tripped by publisher not publishing
any data. Both conditions returned nil, but in one case, the receiver
report should have been received, but no movement in number of packets.
2023-04-24 23:39:30 +05:30
Raja Subramanian a9fe9f331c Run quality scorer when there are no streams. (#1633)
* Run quality scorer when there are no streams.

In the down stream direction, receiver report is used for scoring.
If there are no receiver reports, it should go to `dry` state and report
poor quality.

Update scorer on dry condition only when update score has not happened
for longer than some multiple of update interval. Cannot update on every
interval when there are no streams as receiver report might be just
missed. Waiting for longer to ensure that report is definitely not
received.

* update last stats time
2023-04-19 13:05:43 +05:30
Raja Subramanian 59961c1992 Aggregate method for RTPDeltaInfo (#1562) 2023-03-30 13:34:52 +05:30
Raja Subramanian de86ccb3df Calculate stats duration. (#1554) 2023-03-28 07:18:31 +05:30
Raja Subramanian db2e9f1f8b Use layer 2 for SVC always. (#1542)
This fixes the case of screen share forwarding. We should probably also
look at proper AddTrack. The problem was that
- AddTrack used two layers for screen share from JS sample app
- Track was published with rid = f. Given that and the track info,
  consistent layer mapping set the layer as 1.
- `getBufferLocked` always uses the highest layer for SVC
- Between the two, when down track was requesting PLI, there was
  no buffer at the requested layer and hence no PLI went out.

A few other notes
- Tried locking SVC to layer 0 (instead of layer 2), but that resulted
  in PLI layer lock spamming. It did not happen in v1.3.0 of the server
  though. Not sure what causes that. Need to investigate later.
  But, that does not happen when using layer 2 buffer as SVC buffer.
- When using layer 2 for SVC, the PLI throttle config will be using that
  of layer 2. Is that okay?
- `buffer` structure should maintain more stats about spatial layers for
  SVC case so that layer stats can be reported to analytics/scoring etc.
- In general, `buffer` may need some more hooks to make it SVC aware so
  that it can handle various spetial layer aware/specific bits.
2023-03-23 13:16:14 +05:30
David Colburn 191a9e8014 update core to 0.0.5 (#1540)
* update core

* sort imports

* fix typos

* redundant types
2023-03-22 16:53:23 -07:00
Raja Subramanian e7c5872758 Dependent RTT/jitter control. (#1537) 2023-03-22 11:59:32 +05:30
Raja Subramanian f782c8956d Extend range of GOOD scores. (#1536)
Empirically, the experience is not bad for a larger range.
So, triggering POOR too early causes confusion.
2023-03-22 11:36:30 +05:30
Raja Subramanian f770f0cb67 Use pointer to struct in logging (#1530) 2023-03-19 21:57:35 +05:30
Raja Subramanian aeefbb080e Account for time before measurement available in connection quality. (#1528) 2023-03-19 18:34:56 +05:30
Raja Subramanian ed2eaaabb2 Add layer mute notification (#1522)
* Layer mute

* clean up

* clean up

* set max temporal layer seen on down track add
2023-03-15 15:24:17 +05:30
Raja Subramanian 5bef98dc2a Switching to layer based quality for camera tracks also. (#1520)
There are cases where the layer bit rate configuration is such that
the expected bitrate difference is very high. For example,
setting up layer 2 (f) layer for 1.7 Mbps and layer 1 (h) for 180 kbps.
With bitrate based quality, a layer drop results in going to `POOR`
quality rating. With layer based, it will drop one level only.

Also, cleaning up the distance to desired calculation a bit.
2023-03-15 11:51:14 +05:30
Raja Subramanian c2335968de Prevent evaluation over small wkndow. (#1516)
With push model (i. e. connection quality evaluation triggered
by reception of RTCP receiver report), it is possible that a report
is received quickly after a track is started (especially with video).
Those should not trigger a quality evaluation.

Set `lastStatsAt` in `Start` routine and ensure that start has been
called and enough time has passed since last stats time to avoid
small windows.
2023-03-14 16:27:39 +05:30
Raja Subramanian c70aa616a9 Expected vs actual Layer based connection quality. (#1509)
* Expected vs actual Layer based connection quality.

With VBR streams (like screen share), bit rate is not a good indicator
of whether desired layer (spatial/temporal) is achieved due to high
variance.

Using expected vs actual layer (i. e. distance to desired) can capture
any short fall and include it in quality scoring.

This PR uses distance to desired, i. e. how many steps it would take to
go from actual spatial/temporal -> desired spatial/temporal and that
distance is propotionally used (currently it is just linear) to decrease
score.

* wire up layer transitions for screen share tracks
2023-03-10 13:08:36 +05:30
Raja Subramanian e893d30fd0 Use EWMA (Exponentially Weighted Moving Average) for score updates. (#1507)
* Use EWMA (Exponentially Weighted Moving Average) for score updates.

Makes code simpler, but makes it harder to test as the inflection points
are not exact.

Score falls a bit slower to be conservative on dropping quality too
quickly. Still fall factor is higher (i. e. newer scores get more
weight) than rise factor (i. e. newer scores get lower weight).
Slower rise factor to introduce hysteresis on things climibing back too
quickly.

In the extreme case, asympttotic conditions could cause unexpected
results. For example, having 4% loss of video continously will never
drop quality to `POOR`. It will get close to 60, but it will always
stay above 60 forever and hence quality will never drop to POOR.
Maybe, need some sort of variable thresholding to deal with that. But,
that is an extreme case and may not happen in real life.

* remove unused stuff
2023-03-09 13:52:01 +05:30
Raja Subramanian 14b0b48b15 Push/pull for connection stats/quality scoring. (#1505)
* Push/pull for connection stats/quality scoring.

Was not happy with pure pull method missing a window because
of RTCP RR timing is slightly off for audio and using a much
larger window of data in the next update.

That also resulted in RTP stats getting some bits of code.
As that is per-packet processing, was not a good idea.

Switching to push-pull method.
For up track, it is pull, i. e. connection stats worker will pull stats.

For down track, there is a new notification about receiver report
reception. Using this to check for time to run stats. And adding a bit
of tolerance for processing window (currently set so that as long as it
is > 95% of usual processing interval). This allows two things
- for video, RTCP RR are more frequent, but we will still not process
  till enough time has passed
- for audio, RTCP RR could be once in 5 seconds or so. Can process when
  it is available rather than miss a window and use a much larger window
  later.

* uber atomic
2023-03-09 11:51:20 +05:30
Raja Subramanian 99601e6d41 Handle the case of no packets in down stream tracks better. (#1500) 2023-03-07 14:32:43 +05:30
Raja Subramanian 04269c100c Connection quality misc changes (#1496)
* Connectino quality misc changes

1. Call scorer.Update() with nil stat when no data available so that
   scorer can synthesise window with proper window time.
2. Substract out loss in interval to account for packets not sent at
   all.
3. Fix `packetsNotFound` variable in `getIntervalStats`. I remember this
   working at some point. Not sure if I fat fingered in another PR and
   deleted the increment line.
4. Logging a bit more when no packets expected. Those can get noisy
   especially when track is muted. But, seeing some unexplained
   instances of no packets leading to quality drop. So, temporary logging
   to get a bit more information.

* correct spelling

* Limit packet score minimum to 0.0
2023-03-07 09:08:19 +05:30
Raja Subramanian 9e327b1f3c Connection quality (#1490)
* Make connection quality not too optimistic.

With score normalization, the quality indicator showed good
under conditions which should have normally showed some badness.

So, a few things in this PR
- Do not normalize scores
- Pick the weakest link as the representative score (moving away from
  averaging)
- For down track direction, when reporting delta stats, take the number
  of packets sent actually. If there are holes in the feed (upstream
  packet loss), down tracks should not be penalised for that loss.

State of things in connection quality feature
- Audio uses rtcscore-go (with a change to accommodate RED codec). This
  follows the E-model.
- Camera uses rtcscore-go. No change here. NOTE: THe rtscore here is
  purely based on bits per pixel per frame (bpf). This has the following
  existing issues (no change, these were already there)
  o Does not take packet loss, jitter, rtt into account
  o Expected frame rate is not available. So, measured frame rate is
    used as expected frame rate also. If expected frame rate were available,
    the score could be reduced for lower frame rates.
- Screen share tracks: No change. This uses the very old simple loss
  based thresholding for scoring. As the bit rate varies a lot based on
  content and rtcscore video algorithm used for camera relies on
  bits per pixel per frame, this could produce a very low value
  (large width/height encoded in a small number of bits because of static content)
  and hence a low score. So, the old loss based thresholding is used.

* clean up

* update rtcscore pointer

* fix tests

* log lines reformat

* WIP commit

* WIP commit

* update mute of receiver

* WIP commit

* WIP commit

* start adding tests

* take min score if quality matches

* start adding bytes based scoring

* clean up

* more clean up

* Use Fuse

* log quality drop

* clean up debug log

* - Use number of windows for wait to make things simpler
- track no layer expected case
- always update transition
- always call updateScore
2023-03-05 12:55:04 +05:30
Raja Subramanian fe0502c886 Demote some stable logs to Debugw (#1158)
* Demote some stable logs to Debugw

* Add 'discard message from' to ignore list
2022-11-11 10:17:47 +05:30
David Zhao 02537a121d Store initial track MimeType in TrackInfo (#1065) 2022-09-30 23:33:22 -07:00
Raja Subramanian c03003becf Logging some connection quality stuff to get some data. (#1008)
* Logging some connection quality stuff to get some data.

Setting it at 4.5 as normalised scores are higher.

* log average score
2022-09-15 17:16:59 +05:30
Raja Subramanian d76f7811e9 An attempt to use consistent layer mapping (#986)
* WIP commit

* Consistent layers.

* slight re-arrangement of code

* log mime

* fix tests

* map -> array
2022-09-07 09:57:31 +05:30
Raja Subramanian c75f38bce6 Protect against looking up dimensions for invalid spatial layer (#977)
Also use loss based scoring when track dimensions are not available.
2022-09-03 00:59:47 +05:30
Raja Subramanian b5c023f986 Connection quality changes (#913)
* WIP commit

* Connection quality changes

- Fix Firefox showing poor quality
  o The issue was that we were using max available layer and
    calculating quality. The rationale being that even if
    server sends dynacast messages, client may not implement
    dynacast and still stream all layers. But, with Firefox
    (maybe a Firefox bug), it sends some small amount of
    data on layer 2 even when that layer is disabled.
    Guessing it is probing (or actually we might be using
    some small value for high layers as Firefox cannot turn off
    layers). That higher layer gets used in quality calculation.
    As the bit rate on that layer is extremely low, it yields low
    score.

    Fixed by considering the max expected layer. That is of most
    interest. Yes, clients may ignore dynacast and stream all layers,
    but, max expected is the one of interest. So, look for
    quality in the max expected layer and not max available layer.
- Lots of clean up around connection quality stuff
  o Use a dynamic scaling thing to ensure that we do not get bitten
    by absolute values. Calculate best possible scenario score and
    map that to maximum MOS score. This will ensure that different
    codecs, different settings do not mess up the scoring. For example,
    a client might use 1 Mbps for 720p, but a different client could
    use 2 Mbps for 720p. As an SFU/infrastructure middlebox, we do
    not have control over quality at those rates. We can only ensure
    that streaming happens smoothly at those rates. So, in that
    example, for client 1, 1 Mbps will map to MOS 5.0 and for client 2,
    2 Mbps will map to MOS 5.0. Any impairments after that will
    reflect in the score.
  o Penalise for missing target layer by one level for one layer missed.
  o Move tests to connection quality directory. The participant test
    was not super useful.

* Add missed file

* Remove debug code

* use more constants and initialise normalisation factor

* rtcscore pointer
2022-08-15 13:21:07 +05:30
Raja Subramanian dbcc53f04e Use media payload size in scoring. (#912)
* Use media payload size in scoring.

Subtract out header bytes when calculating score.
This does not seem to affect the score (under perfect conditions),
but, using header bytes will inflate the bit rate and
will affect scoring.

* Add header bytes to ToProto

* protocol pointer

* fix test
2022-08-14 13:22:58 +05:30
David Zhao f09885825e Return ServerInfo to clients on join (#904)
* checkpoint

* Return ServerInfo in join response

* also include node information

* less verbose quality score

* update go modules
2022-08-10 17:04:17 -07:00
David Zhao 53f51c8cb0 Logging cleanup (#843)
* Logging cleanup

Changes log levels to better match significance

* fix lock
2022-07-21 00:39:49 -07:00
Raja Subramanian c15eeeff2b Run connection quality worker every 5 seconds. (#795)
With a small window, the quality is volatile even on small disturbances.
For example losing 2 audio packets in a 2 second window could
drop the quality metric.
2022-06-30 09:10:18 +05:30
Raja Subramanian 2c48eafd6e Retain previous audio score if number of packets is low (#793)
* Retain previous audio score if number of packets is low

* better comment and correct spelling
2022-06-29 14:48:06 +05:30
Raja Subramanian 45ed8ce85a Look for stable mex expected layer before calculating score. (#774) 2022-06-21 17:24:34 +05:30