For applications with heavy data usage, accumulating data bytes over 5
minutes and then calculating rate using a much shorter window (like 2 -
5 seconds) makes it looks like there is a massive rate spike.
While this change is not a fix, this should soften the impact.
Need a better way to handle different parts of the system operating at
different frequencies. Can use rate in the reporting window, but that
will miss the spikes. Maybe that is okay. For example, if the reporting
window is 5 minutes and there was a 100 Mbps spike for about 10 seconds
of it, it would get smoothed out.
On a resume, the signal stats will call `ParticipantLeft`. Although, it
explicity says not to send events, it could still close the stats
worker.
To handle that, we created a stats worker if needed in
`ParticipantResume` notification in this PR
(https://github.com/livekit/livekit/pull/2982), but that is not enough
as that event could happen before previous signal connection closes the
stats worker.
A new stats worker does get created when `ParticipantJoined` is called
by the new signal connection, but it does not transfer connected state.
So, when the client leaves, `ParticipantLeft` is not sent.
I am not seeing why we should not transfer connected state always given
that it is the same participant SID/session. But, I have a feeling that
I am missing some corner case. Please let me know if I am missing
something here.
* De-centralize some configs to where they are used.
And make default variables.
Renaming a bit, but these are all internal config and have not been
added to documented config.
* Keep documented config as is.
* test
* typo
* Some misc clean up.
- Have been seeing counterfeiter warnings about efficiency for a while
with go:generate declaration multiple times in the same package.
Address that: https://github.com/maxbrunsfeld/counterfeiter?tab=readme-ov-file#step-2b---add-counterfeitergenerate-directives
- A bit more readability on parameters passed to `sendLeave`
* spacing
* revert some deletes as the complaint was in analytics service only
* Declare in package only once.
Although the warning is about go:generate multiple times when directly
giving the interface to generate, have `go:generate` multiple times in a
package even with `-generate` ends up generating once per invocation.
Once per package is enough to run the generation just once.
* Add counter for pub&sub time metrics
The pub&sub shows large value in migration related case like
muted/disabled migration, the subscription time depends on
the time when publisher unmute the track(sending rtp packet
after migration), add a counter to distinguish since we
can't control the time in such cases and the first subscription
attemps also is more meaningful than those cases.
* Add info log for high publish delay
* Record out-of-packet count/rate in prom.
Adding a field to AnalyticsStream to make this easier to report.
Let me know if adding to AnalyticsStream is not ok.
Will set up a protocol PR if it is okay.
* deps
* Ref count the stats worker.
NOTE: Don't liek this much, but wanted to open this get some 👀 on
this and get feedback.
There are two entities, one for counting signal bytes and another for
media stats. They both send `ParticipantJoined` and `ParticipantLeft`
event.
In the case of a participant resume, as the old web socket
connection is closed, that triggers a signal stats counter close. That
would call `ParticipantLeft` and that would close the stats worker.
The closed stats worker got reaped in `FlushStats` after three minutes.
So, all events after that did not have a worker and hence went
unreported including missing participant_left webhook because it relied
on checking if a participant was ever connected and that needed to check
the worker state.
Using a ref count to keep track of join/leaves. And not close the worker
until ref count goes down to 0.
* create a stats worker on resume
* revert incorrect changes
* transfer connected state
* transfer connected state when creating worker
* resolve participant on a resume
* speed up track publication
Add metrics for track publication and subscription
Return EnabledCodecs in JoinResponse so client can
choose codec without server side codec fallback
Cache remote webrtc track without AddTrackRequest to
let client send publisher offer before AddTrackRequest response
* go mod
* clean code
* Use Seque in ops queue.
Standardizing some uses
- Change OpsQueue to use Deque so that it can grow/shrink as necessary and
need not worry about channel getting full and dropping events.
- Change StreamAllocator and TelemetryService to use OpsQueue so that
they also need not worry about channel size and overflows.
* Address feedback
* delete obvious comment
* clean up
* Participant traffic load.
Capturing information about participant traffic
- Upstream/Downstream
- Audio/Video/Data
- Packets/Bytes
This captures a notion of how much traffic load a participant is
generating.
Can be used to make allocation decisions.
* Clean up
* SIP patches
* reporter goroutine
* unlock
* move traffic stats from protocol
* check type
* Use a worker to report signal/data stats.
Was checking if reporting is needed on every update.
The check is wasted work if volume of signal/data messages is high
as reporting happens only once in 10 seconds.
Changing to a worker based on a timer. And also aligning with
telemetry reporting interval which defaults to 30 seconds.
* Remove unused constant