Commit Graph

103 Commits

Author SHA1 Message Date
Raja Subramanian 248d73948d Guard against timestamp inversion in RED -> Opus conversion. (#4414)
* Guard against timestamp inversion in RED -> Opus conversion.

Seeing timestamp inversion (sequence number is +1, but timestamp is
-960, i.e. 20ms) in the RED -> Opus conversion path. Not able to spot
any bugs in code. So, logging details upon detection and also dropping
the packet. If not dropped, downstream components like Egress treat it
as a big timestamp jump (because sequence number is moving forward) and
try to adjust pts which ends up causing drops.

* do not log time reversal at the start

* typo
2026-03-31 17:08:13 +05:30
Anunay Maheshwari 0c33b8c671 chore: move codecs/mime stuff to protocol (#4255) 2026-01-20 20:54:32 +05:30
Raja Subramanian 2510b9462e Taking a bunch of go modernize suggestions. (#4194)
This is not all of it as it is not possible (or at least I do not know
of a way) to get all suggestions for a repo/project. Did this via loop
searching mainly and taking the modernize suggestions.
2025-12-25 16:55:58 +05:30
Raja Subramanian e0aea17a06 Flush stats on close (#3713) 2025-06-09 09:33:34 -07:00
Raja Subramanian f0edfbba8d Fix receiver rtt/jitter. (#3487) 2025-03-04 21:22:17 +05:30
Raja Subramanian 05dfd30d5b Take RTT and jitter from receiver view while reporting track stats for (#3483)
* Take RTT and jitter from receiver view while reporting track stats for
down stream tracks.

* adjust jitter in aggregate
2025-03-03 18:48:37 +05:30
Raja Subramanian 1c3c70a136 Check for nil in sender view stats. (#3426)
Seeing some panic due to sender view being nil. It is possible to have
nil sender view and not-nil receiver view. For analytics, only sender
view is used. Handle nil properly.
2025-02-13 10:55:23 +05:30
Raja Subramanian 7fef374b19 Split down stream snapshot into sender view and receiver view. (#3422)
Receiver view is used for connection quality.

Sender view is used for analytics. One thing that this introduces is
that sender view uses the packet loss information from receiver view as
true loss is available only in the RTCP Receiver Reports received from
the remote side. So, the time alignment is off, i. e. receiver report
happens periodically and it includes information till the time at which
it was sent from remote side, but sender could have sent more packets
after that time.

The split should ensure that analytics does not rely on remote side
sending proper receiver repoerts albeit at slight misalignment of loss
statistic for remotes that send RTCP RR (which should be majority of the
cases)
2025-02-11 16:05:00 +05:30
Raja Subramanian 9551c52c85 Try 2 to consolidate mime type (#3407)
* Normalize mime type and add utilities.

An attempt to normalize mime type and avoid string compares remembering
to do case insensitive search.

Not the best solution. Open to ideas. But, define our own mime types
(just in case Pion changes things and Pion also does not have red mime
type defined which should be easy to add though) and tried to use it everywhere.
But, as we get a bunch of callbacks and info from Pion, needed conversion in
more places than I anticipated. And also makes it necessary to carry
that cognitive load of what comes from Pion and needing to process it
properly.

* more locations

* test

* Paul feedback

* MimeType type

* more consolidation

* Remove unused

* test

* test

* mime type as int

* use string method

* Pass error details and timeouts. (#3402)

* go mod tidy (#3408)

* Rename CHANGELOG to CHANGELOG.md (#3391)

Enables markdown features in this otherwise already markdown'ish formatted document

* Update config.go to properly process bool env vars (#3382)

Fixes issue https://github.com/livekit/livekit/issues/3381

* fix(deps): update go deps (#3341)

Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Use a Twirp server hook to send API call details to telemetry. (#3401)

* Use a Twirp server hook to send API call details to telemetry.

* mage generate and clean up

* Add project_id

* deps

* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
  participant_id, track_id as appropriate
- Store status as int

* deps

* Update pkg/sfu/mime/mimetype.go

* Fix prefer codec test

* handle down track mime changes

---------

Co-authored-by: Denys Smirnov <dennwc@pm.me>
Co-authored-by: Philzen <Philzen@users.noreply.github.com>
Co-authored-by: Pablo Fuente Pérez <pablofuenteperez@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Paul Wells <paulwe@gmail.com>
Co-authored-by: cnderrauber <zengjie9004@gmail.com>
2025-02-10 10:44:15 +05:30
Raja Subramanian f45e3613ac Correct reason for poor/lost score. (#3397)
No functional change, just logging reason was confusing.
Also, log no packets case. Seeing some instances in staging where there
are periods of no packets received. Trying to understand better.
2025-02-04 00:59:56 +05:30
Raja Subramanian cfe3178542 Reconcile RTP stats with RTX data. (#3252)
* RTX RTPStats

* WIP

* RTCP RTX handler

* reconcile rtx

* cache size

* clean up

* test

* clean up
2024-12-15 14:33:02 +05:30
cnderrauber 54f9f7de51 upgrade to pion/webrtc v4 (#3213) 2024-11-28 16:05:38 +08:00
Raja Subramanian 9ac48e2984 Grab time under lock. (#3100)
Revert part of my previous commit. I vaguely remembered there was a reason for
having code like that, but did not remember the details and ended up
consolidating. The issue is that time needs to be grabbed under lock so
that two events happening close to each other do not get order swapped.
2024-10-15 22:39:43 +05:30
Raja Subramanian 8b604df32a Set FEC enabled properly in connection stats module. (#3098)
* Set FEC enabled properly in connection stats module.

With RED, the FEC indication is in primary codec.

Also, clean up some bits that were not necessary (TrackInfoAvailable is
not needed)

TODO: There are still a couple of things to figure out
- If codec is RED, Opus is added as second codec synthetically using
  https://github.com/livekit/livekit/blob/33098337fc17705bbdb3283c7a7034aa6b2f3745/pkg/rtc/mediaengine.go#L31
  which hard codecs FEC enabled. Ideally, we should get the primary
  codec parameters from SDP offer.
- The WebRTCReceiver does not have information about primary codec. For
  now, just setting FEC to true when RED is enabled. It is okay as it
  just affects when we declare quality drops, but ideally the primary
  codec should be retrieved from SDP offer.

* clean up and comment

* full prop check
2024-10-15 17:39:42 +05:30
Raja Subramanian d052caa104 Use PPS mode rather than max to adjust packet loss weight. (#3095) 2024-10-14 20:16:19 +05:30
Raja Subramanian a8da4872b1 Drop quality a bit faster on score trending lower to be more responsive. (#3093)
Also, logging a bit more about quality changes to understand why
high(ish) loss does not drop quality. Will remove the loss thresholded
logging after collecting some data.
2024-10-14 17:21:42 +05:30
Raja Subramanian f154b236b5 Fix down stream packet loss reporting. (#3092)
* Fix down stream packet loss reporting.

* format
2024-10-14 11:08:10 +05:30
Raja Subramanian 8ac33a868c Splitting out rtp stats stuff into its own package. (#3060)
* Splitting out rtp stats stuff into its own package.

Going to be making some lighter versions of these.
Will be cleaner to have all of these grouped together.
So, as a first step, just making a package for it.

* tests
2024-10-03 15:51:24 +05:30
Raja Subramanian d53f732ada Do not take padding packets into account in max pps calculation (#2990) 2024-09-09 11:08:50 +05:30
Raja Subramanian 787b8450e9 Record out-of-packet count/rate in prom. (#2980)
* Record out-of-packet count/rate in prom.

Adding a field to AnalyticsStream to make this easier to report.
Let me know if adding to AnalyticsStream is not ok.

Will set up a protocol PR if it is okay.

* deps
2024-09-07 00:19:54 +05:30
Raja Subramanian f9f761b223 Demote some less useful/noisy logs. (#2743) 2024-05-29 12:05:18 +05:30
Raja Subramanian 71b5ffed93 Less confusing variable name (#2706) 2024-05-08 16:10:49 +05:30
Raja Subramanian af0b0c4734 Connection quality LOST only if RTCP is also not available. (#2670)
* Connection quality LOST only if RTCP is also not available.

It is possible that sender stops all layers of video due to some
constraint (CPU or bandwidth). Packet reception going dry due to
that should not trigger `LOST` quality.

Add last received RTCP time also to distinguish the case
of real `LOST` and sender stopping traffic.

Some bits to watch for
- With audio, RTCP reports could be more than 5 seconds apart (5 seconds
  is the default interval for connection quality scorer), but audio
  senders usually send silence packets even when there is no input.
  So audio completely stopping can be considered `LOST`.
- With video, have to observe if all clients continue to send RTCP even
  if all layers are stopped.
- RTCP bandwidth is not supposed to exceed the primary stream bandwidth.
  libwebrtc calculates that and spaces out RTCP reports accordingly.
  That is the reason why audio reports are that far apart. If a video
  stream is encoded at a very low bit rate, it could also be sending
  RTCP rarely. So, there is the case of LOST being indistinguishable
  from sender stopping all layers. But, this should be a rare case.

* typo
2024-04-21 23:35:24 +05:30
Raja Subramanian ec41d20f81 Reduce RED weight in half. (#2648) 2024-04-12 20:39:53 +05:30
wanshuangcheng e1b68012a1 chore: fix typos in comment (#2634)
Signed-off-by: wanshuangcheng <wanshuangcheng@outlook.com>
2024-04-10 09:27:48 -07:00
Raja Subramanian 63b1fba082 Add start/end time to AnalyticsStream. (#2618)
* Add start/end time to AnalyticsStream.

* fix test
2024-04-03 12:23:18 +05:30
Raja Subramanian 45581433cc Add option to enable bitrate based scoring (#2600) 2024-03-27 18:45:53 +05:30
Raja Subramanian ea66eae9f5 Start moving things to structured logging (#2527) 2024-02-29 14:35:19 +05:30
Denys Smirnov f5eb6c8a95 Update usage of core.Fuse. (#2519) 2024-02-28 03:48:58 +02:00
Raja Subramanian 174e69c81d Restore min score to 30. (#2435)
Was at 20 when LOST was introduced, but was going to 20 even when under
not LOST conditions. When there are packets, want the min to be at 30.
Going down to 20 resulted in reporting LOST quality even when packets
were flowing (although they were experiencing heavy loss and quality
would have been very bad, yet they are not lost).

Also, sample warning about adding packet to bucket even more.
2024-02-02 08:52:52 +05:30
Raja Subramanian a2053dfd94 ConnectionQuality DISCONNECTED -> LOST (#2276) 2023-11-29 23:17:17 +05:30
Raja Subramanian 396371312b Use variables for score -> quality mapping (#2268)
* Use variables for score -> quality mapping

* spelling
2023-11-28 11:51:21 +05:30
Raja Subramanian 5f76d1adcc Introduce DISCONNECTED connection quality. (#2265)
* Introduce `DISCONNECTED` connection quality.

Currently, this state happens when any up stream track does not
send any packets in an analysis window when it is expected to send
packets.

This can be used by participants to know the quality of a potentially
disconnected participant. Previously, it took 20 - 30 seconds for
the stale timeout to kick in and disconnect the limbo participant which
triggered a participant update through which other participants knew
about it.

Previously, `POOR` quality was also overloaded to denote that the
up stream is not sending any packets. With this change, that is a
separate indicator, i. e. `DISCONNECTED`.

* clean up

* Update deps

* spelling
2023-11-27 23:06:53 +05:30
Raja Subramanian 2cf751d261 Use timer in scorer lock scope. (#2066)
Using time from outside make anachronous samples in expected
distance/bit rate measurement. So, have to let the time be
snap shotted in scorer lock scope.
2023-09-13 01:38:34 +05:30
Raja Subramanian 254a35543d Fix down stream stats. (#2063)
Need to pass in the correct time. Previously streaming start was
determined by another delta snap shot which as removed for efficiency.
Did not realise that we were passing in zero time for stats.

Also, revert of the change (the part which did not re-pause) from this
PR (https://github.com/livekit/livekit/pull/2037). That change affects
other paths. The edge it was trying to fix is more rare. Need to think
about a way which covers all cases.
2023-09-12 08:34:28 +05:30
Raja Subramanian 1b20b8f1ac Make interface for connection stats. (#2056)
* Make interface for connection stats.

Implement suggestion from @paulwe to clean that up a bit.

* fix test
2023-09-11 08:39:33 +05:30
Raja Subramanian c09d8d0878 Split RTPStats into receiver and sender. (#2055)
* Split RTPStats into receiver and sender.

For receiver, short types are input and need to calculate extended type.

For sender (subscriber), it can operate only in extended type.
This makes the subscriber side a little simpler and should make it more
efficient as it can do simple comparisons in extended type space.

There was also an issue with subscriber using shorter type and
calculating extended type. When subscriber starts after the publisher
has already rolled over in sequence number OR timestamp, when
subsequent publisher side sender reports are used to adjust subscriber
time stamps, they were out of whack. Using extended type on subscriber
does not face that.

* fix test

* extended types from sequencer

* log
2023-09-11 07:33:39 +05:30
Raja Subramanian b95670f56b Removing one snapshot in down track. (#2047)
Profiling showed updating jitter going through the snapshot maps.
With the reduction of one, there should only be one snapshot
and hopefully that should gain some cycles back.
2023-09-07 22:22:00 +05:30
David Zhao 981fb7cac7 Adding license notices (#1913)
* Adding license notices

* remove from config
2023-07-27 16:43:19 -07:00
Raja Subramanian 5459bd2931 Push track quality to poor on a bandwidth constrained pause. (#1867)
* Push track quality to poor on a bandwidth constrained pause.

* add tests

* scale distance by divisor

* fix test distance to desired

* wait longer for subscription manager to reconcile
2023-07-11 15:29:35 +05:30
Raja Subramanian e6f5f2f344 Prevent anachronous sample reading. (#1863)
* Prevenet anachronous sample reading.

Not so pretty way of solving this. Please let me know if you have
thoughts.

Passing in time allows testing easier. But, that also leads to
time reversal problems. Example scenario
1. Connection stats worker gets a time and initiates quality
   calculation.
2. A layer transition is recorded after that.
3. By the time, scorer is called to calculate score with time from Step
   1, there is time reversal and results in anachronous sample.

One option is to use a scorer lock in connection stats module and wrap
all calls to scorer in that lock, but that does not prevent the passed
in time stamps themselves getting out of order. Also, stand alond use
of scorer in some other context will be problematic.

Doing the hybrid thing of taking current time in scorer if passed in
time is zero so that scorer lock domain controls it.

* use zero time everywhere in normal flow

* make APIs with and without time passed in as Paul suggested
2023-07-10 08:39:52 +05:30
Raja Subramanian e3954d1d64 Use timed aggregator. (#1843)
* Use timed aggregator.

For aggregate bitrate and average distance from desired.

Also, clean up debug added to track leak.

* update deps
2023-07-01 10:21:15 +05:30
Raja Subramanian 496656627e Logging more to understand layer transition leak better. (#1840) 2023-06-30 11:59:53 +05:30
Raja Subramanian cea41e4189 Discount out-of-order packets in downstream score. (#1831)
* Discount out-of-order packets in downstream score.

More notes inline.

* correct comment

* clean up comment
2023-06-27 17:44:53 +05:30
Raja Subramanian 72ed5b19f7 Use receiver report stats for loss/rtt/jitter. (#1781)
* Use receiver report stats for loss/rtt/jitter.

Reversing a bit of https://github.com/livekit/livekit/pull/1664.
That PR did two snapshots (one based on what SFU is sending
and one based on combination of what SFU is sending reconciled with
stats reported from client via RTCP Receiver Report). That PR
reported SFU only view to analytics. But, that view does not have
information about loss seen by client in the downstream.
Also, that does not have RTT/jitter information. The rationale behind
using SFU only view is that SFU should report what it sends irrespective
of client is receiving or not. But, that view did not have proper
loss/RTT/jitter.

So, switch back to reporting SFU + receiver report reconciled view.
The down side is that when receiver reports are not receiver,
packets sent/bytes sent will not be reported to analytics.

An option is to report SFU only view if there are no receiver reports.
But, it becomes complex because of the offset. Receiver report would
acknowledge certain range whereas SFU only view could be different
because of propagation delay. To simplify, just using the reconciled
view to report to analytics. Using the available view will require
a bunch more work to produce accurate data.
(NOTE: all this started due to a bug where RTCP was not restarted on
a track resume which killed receiver reports and we went on this path
to distinguish between publisher stopping vs RTCP receiver report not
happening)

One optimisation to here here concerns the check to see if publisher is sending data.
Using a full DeltaInfo for that is an overkill. Can do a lighter weight
for that later.

* return available streams

* fix test
2023-06-09 23:31:25 +05:30
Raja Subramanian 1d3faefc5e More scoring tweaks (#1719)
1. Completely removing RTT and jitter from score calculation.
   Need to do more work there.
   a. Jitter is slow moving (RFC 3550 formula is designed that way).
      But, we still get high values at times. Ideally, that should
      penalise the score, but due to jitter buffer, effect may not be
      too bad.
   b. Need to smooth RTT. It is based on receiver report and if one
      sample causes a high number, score could be penalised
      (this was being used in down track direction only). One option
      is to smooth it like the jitter formula above and try using it.
      But, for now, disabling that also.

2. When receiving lesser number of packets (for example DTX), reduce the
   weight of packet loss with a quadratic relationship to packet loss
   ratio. Previously using a square root and it was potentially
   weighting it too high. For example, if only 5 packets were received
   due to DTX instead of 50, we were still giving 30% weight
   (sqrt(0.1)). Now, it gets 1% weight. So, if one of those 5 packets
   were lost (20% packet loss ratio), it still does not get much weight
   as the number of packets is low.,

3. Slightly slower decrease in score (in EWMA)

4. When using RED, increase packet loss weight thresholds to be able to
   take more loss before penalizing score.
2023-05-18 20:16:43 +05:30
Raja Subramanian 28a8a808f2 Do not add empty video layers in stats. (#1685) 2023-05-05 08:59:08 +05:30
Raja Subramanian 50ab72a5f8 DownTrack scoring when RR is not received. (#1664) 2023-04-28 14:50:06 +05:30
Raja Subramanian c1c4e8aea0 Include packetsMissing field in string representation (#1659)
* Include packetsMissing field in string representation

* do not set stub directly
2023-04-27 14:39:05 +05:30
Raja Subramanian 9db46bb866 Avoid divide-by-zero and NaN (#1656) 2023-04-26 21:29:25 +05:30