Commit Graph

181 Commits

Author SHA1 Message Date
Raja Subramanian 3b0077f2fe Log connection quality changes. (#3311)
Also remove the connection quality drop prom as it is unused and also
adds state/complexity.
2025-01-07 10:58:31 +05:30
cnderrauber 54f9f7de51 upgrade to pion/webrtc v4 (#3213) 2024-11-28 16:05:38 +08:00
Raja Subramanian baf47db834 Publish data and signal bytes once every 30 seconds. (#3212)
For applications with heavy data usage, accumulating data bytes over 5
minutes and then calculating rate using a much shorter window (like 2 -
5 seconds) makes it looks like there is a massive rate spike.

While this change is not a fix, this should soften the impact.

Need a better way to handle different parts of the system operating at
different frequencies. Can use rate in the reporting window, but that
will miss the spikes. Maybe that is okay. For example, if the reporting
window is 5 minutes and there was a 100 Mbps spike for about 10 seconds
of it, it would get smoothed out.
2024-11-28 09:21:44 +05:30
Raja Subramanian cc22306047 Attempt to fix missing participant left webhook. (#3173)
On a resume, the signal stats will call `ParticipantLeft`. Although, it
explicity says not to send events, it could still close the stats
worker.

To handle that, we created a stats worker if needed in
`ParticipantResume` notification in this PR
(https://github.com/livekit/livekit/pull/2982), but that is not enough
as that event could happen before previous signal connection closes the
stats worker.

A new stats worker does get created when `ParticipantJoined` is called
by the new signal connection, but it does not transfer connected state.
So, when the client leaves, `ParticipantLeft` is not sent.

I am not seeing why we should not transfer connected state always given
that it is the same participant SID/session. But, I have a feeling that
I am missing some corner case. Please let me know if I am missing
something here.
2024-11-14 10:59:15 +05:30
Raja Subramanian 86383b2271 De-centralize some configs to where they are used. (#3162)
* De-centralize some configs to where they are used.

And make default variables.

Renaming a bit, but these are all internal config and have not been
added to documented config.

* Keep documented config as is.

* test

* typo
2024-11-08 12:47:30 +05:30
Raja Subramanian 365e63230d Some misc clean up. (#3156)
* Some misc clean up.

- Have been seeing counterfeiter warnings about efficiency for a while
  with go:generate declaration multiple times in the same package.
  Address that: https://github.com/maxbrunsfeld/counterfeiter?tab=readme-ov-file#step-2b---add-counterfeitergenerate-directives
- A bit more readability on parameters passed to `sendLeave`

* spacing

* revert some deletes as the complaint was in analytics service only

* Declare in package only once.

Although the warning is about go:generate multiple times when directly
giving the interface to generate, have `go:generate` multiple times in a
package even with `-generate` ends up generating once per invocation.
Once per package is enough to run the generation just once.
2024-11-04 11:26:41 +05:30
Raja Subramanian 49b75e94a6 Consolidate operations on LocalNode. (#3140) 2024-10-25 18:57:23 +05:30
cnderrauber cf59267631 Add counter for pub&sub time metrics (#3084)
* Add counter for pub&sub time metrics

The pub&sub shows large value in migration related case like
muted/disabled migration, the subscription time depends on
the time when publisher unmute the track(sending rtp packet
after migration), add a counter to distinguish since we
can't control the time in such cases and the first subscription
attemps also is more meaningful than those cases.

* Add info log for high publish delay
2024-10-11 12:07:24 +08:00
Paul Wells 4deaac2f3f replace proto.Clone calls (#3024)
* replace proto.Clone calls

* deps

* tests
2024-09-18 22:47:33 -07:00
cnderrauber 978db00034 Add sdk, participant_kind to pub sub metrics (#3023)
* exclude go client from track publication metric

* add sdk,participant_kind lables

* fix test
2024-09-19 10:42:47 +08:00
Raja Subramanian 787b8450e9 Record out-of-packet count/rate in prom. (#2980)
* Record out-of-packet count/rate in prom.

Adding a field to AnalyticsStream to make this easier to report.
Let me know if adding to AnalyticsStream is not ok.

Will set up a protocol PR if it is okay.

* deps
2024-09-07 00:19:54 +05:30
Raja Subramanian bec7453a1f Recreate stats worker on resume if needed. (#2982)
* Ref count the stats worker.

NOTE: Don't liek this much, but wanted to open this get some 👀 on
this and get feedback.

There are two entities, one for counting signal bytes and another for
media stats. They both send `ParticipantJoined` and `ParticipantLeft`
event.

In the case of a participant resume, as the old web socket
connection is closed, that triggers a signal stats counter close. That
would call `ParticipantLeft` and that would close the stats worker.

The closed stats worker got reaped in `FlushStats` after three minutes.

So, all events after that did not have a worker and hence went
unreported including missing participant_left webhook because it relied
on checking if a participant was ever connected and that needed to check
the worker state.

Using a ref count to keep track of join/leaves. And not close the worker
until ref count goes down to 0.

* create a stats worker on resume

* revert incorrect changes

* transfer connected state

* transfer connected state when creating worker

* resolve participant on a resume
2024-09-06 23:58:03 +05:30
cnderrauber 947e8f5909 Speed up track publication (#2952)
* speed up track publication

Add metrics for track publication and subscription

Return EnabledCodecs in JoinResponse so client can
choose codec without server side codec fallback

Cache remote webrtc track without AddTrackRequest to
let client send publisher offer before AddTrackRequest response

* go mod

* clean code
2024-08-23 18:38:32 +08:00
Paul Wells afda860162 prevent race in telemetry worker cleanup (#2879) 2024-07-18 03:37:45 -07:00
Lukas Herman 8a229fda9d add participant session duration metric (#2801) 2024-06-17 17:52:08 -04:00
David Zhao ecf1175832 Generate and send uuid with analytics (#2790)
* Generate and send uuid with analytics

* go mod
2024-06-13 23:00:50 -07:00
Paul Wells d95b59de58 update protocol (#2764)
* update protocol

* deps
2024-06-05 23:50:54 -07:00
Paul Wells f1886ece42 update protocol (#2760)
* update protocol

* deps
2024-06-05 19:46:34 -07:00
David Zhao b99650aaf6 Send NodeID with analytics events (#2749) 2024-06-02 09:09:55 -07:00
cnderrauber 7ed1284b96 report average forward metrics (#2737)
* report average forward metrics

* unused parameter
2024-05-28 17:03:18 +08:00
cnderrauber 2288e402ac register forward metrics (#2735) 2024-05-27 15:47:01 +08:00
Paul Wells 38470f378b add message bytes metric (#2731) 2024-05-26 14:01:13 -07:00
cnderrauber e6aa36fdd6 Add forward stats (#2725)
* Add forward metrics

* ignore packets was not forwarded

* rename
2024-05-24 17:43:28 +08:00
Paul Wells 9a5db132eb add room/participant name limit (#2704)
* add room/participant name limit

* defaults

* simplify

* omitempty

* handle 0 config

* fix race

* unlock

* tidy
2024-05-06 17:25:18 -07:00
Paul Wells ac1b0e38ca store active stats workers in list (#2690)
* store active stats workers in list

* test

* single node cleanup

* cleanup

* cleanup

* cleanup
2024-04-26 17:24:10 -07:00
cnderrauber f239f8bff1 Fix SubParticipant twice when paticipant left (#2672) 2024-04-23 16:09:02 +08:00
Mathew Kamkar 10c8582a6b get cpu stats from cgroup, remove env (#2636)
* get cpu stats from cgroup, remove env

* undo rand seed removal

* tests
2024-04-08 21:15:17 -07:00
Raja Subramanian 63b1fba082 Add start/end time to AnalyticsStream. (#2618)
* Add start/end time to AnalyticsStream.

* fix test
2024-04-03 12:23:18 +05:30
Paul Wells f1c991c547 skip logging retry message when ws disconnections before signal finishes (#2604) 2024-03-29 06:30:12 -07:00
Raja Subramanian 14321f21bf Make OpsQueueParams to make it easier to understand args. (#2578) 2024-03-14 10:27:24 +05:30
Paul Wells ad341d41f5 start telemetry participant worker to collect signal stats (#2538)
* start telemetry participant worker to collect signal stats

* format

* resolve room

* tidy
2024-03-03 02:47:51 -08:00
Denys Smirnov f5eb6c8a95 Update usage of core.Fuse. (#2519) 2024-02-28 03:48:58 +02:00
Raja Subramanian 7649e4ffab Post data and signal stats once in 5 minutes (#2518) 2024-02-27 15:45:32 +05:30
Paul Wells e5b8e25064 use shared psrpc utils (#2506)
* use shared psrpc utils

* fix

* deps
2024-02-24 00:38:49 -08:00
Raja Subramanian 5ac5bd236a Let track events go through after participant close. (#2487)
* Let track events go through after participant close.

Also, reducing lock scope in telemetry service.

* use shadow
2024-02-17 13:40:07 +05:30
Mathew Kamkar 7508560fde larger buckets for jitter prometheus histogram (#2468) 2024-02-09 12:09:51 -08:00
Raja Subramanian b71d373f4a Use Deque in ops queue. (#2418)
* Use Seque in ops queue.

Standardizing some uses
- Change OpsQueue to use Deque so that it can grow/shrink as necessary and
  need not worry about channel getting full and dropping events.
- Change StreamAllocator and TelemetryService to use OpsQueue so that
  they also need not worry about channel size and overflows.

* Address feedback

* delete obvious comment

* clean up
2024-01-28 13:48:30 +05:30
Paul Wells c726cbf2ba increase max session start time bin size (#2380) 2024-01-12 03:49:23 -08:00
Paul Wells 2fe2a9c9f2 add session start time metric (#2377) 2024-01-11 23:23:51 -08:00
shishirng 3770fbce64 Analytics: send local node room state/info (#2335)
* Analytics: send local node room state/info

Signed-off-by: shishir gowda <shishir@livekit.io>
2023-12-22 18:59:04 -05:00
Raja Subramanian dcff75a516 Record number of data messages in prometheus. (#2282) 2023-12-01 16:10:57 +05:30
Raja Subramanian 53542b09a0 Participant traffic load. (#2262)
* Participant traffic load.

Capturing information about participant traffic
- Upstream/Downstream
- Audio/Video/Data
- Packets/Bytes

This captures a notion of how much traffic load a participant is
generating.

Can be used to make allocation decisions.

* Clean up

* SIP patches

* reporter goroutine

* unlock

* move traffic stats from protocol

* check type
2023-11-26 23:05:00 +05:30
Raja Subramanian 56dd399684 Use a worker to report signal/data stats. (#2260)
* Use a worker to report signal/data stats.

Was checking if reporting is needed on every update.
The check is wasted work if volume of signal/data messages is high
as reporting happens only once in 10 seconds.

Changing to a worker based on a timer. And also aligning with
telemetry reporting interval which defaults to 30 seconds.

* Remove unused constant
2023-11-22 11:47:15 +05:30
Paul Wells f4a984d446 preallocate prometheus packet counters (#1942) 2023-08-08 01:06:14 -07:00
David Zhao 981fb7cac7 Adding license notices (#1913)
* Adding license notices

* remove from config
2023-07-27 16:43:19 -07:00
Benjamin Pracht 552e3758d5 Add IngressUpdated event (#1775) 2023-06-16 10:58:49 -07:00
David Zhao f71544e27a Do not send ParticipantJoined webhook if connection was resumed (#1795)
* Do not send ParticipantJoined webhook if connection was resumed

* isResume -> isMigration
2023-06-15 15:39:04 -07:00
shishirng 2dd4e1365b Send EgressUpdated event (#1792)
Signed-off-by: shishir gowda <shishir@livekit.io>
2023-06-14 18:56:07 -04:00
David Zhao 7e5a7ae79f Fixed windows build (#1768) 2023-06-04 00:17:25 -07:00
Benjamin Pracht e7879a46fc Add ingress telemetry support (#1763) 2023-06-02 17:38:19 -07:00