Commit Graph

181 Commits

Author SHA1 Message Date
Raja Subramanian
3b0077f2fe Log connection quality changes. (#3311)
Also remove the connection quality drop prom as it is unused and also
adds state/complexity.
2025-01-07 10:58:31 +05:30
cnderrauber
54f9f7de51 upgrade to pion/webrtc v4 (#3213) 2024-11-28 16:05:38 +08:00
Raja Subramanian
baf47db834 Publish data and signal bytes once every 30 seconds. (#3212)
For applications with heavy data usage, accumulating data bytes over 5
minutes and then calculating rate using a much shorter window (like 2 -
5 seconds) makes it looks like there is a massive rate spike.

While this change is not a fix, this should soften the impact.

Need a better way to handle different parts of the system operating at
different frequencies. Can use rate in the reporting window, but that
will miss the spikes. Maybe that is okay. For example, if the reporting
window is 5 minutes and there was a 100 Mbps spike for about 10 seconds
of it, it would get smoothed out.
2024-11-28 09:21:44 +05:30
Raja Subramanian
cc22306047 Attempt to fix missing participant left webhook. (#3173)
On a resume, the signal stats will call `ParticipantLeft`. Although, it
explicity says not to send events, it could still close the stats
worker.

To handle that, we created a stats worker if needed in
`ParticipantResume` notification in this PR
(https://github.com/livekit/livekit/pull/2982), but that is not enough
as that event could happen before previous signal connection closes the
stats worker.

A new stats worker does get created when `ParticipantJoined` is called
by the new signal connection, but it does not transfer connected state.
So, when the client leaves, `ParticipantLeft` is not sent.

I am not seeing why we should not transfer connected state always given
that it is the same participant SID/session. But, I have a feeling that
I am missing some corner case. Please let me know if I am missing
something here.
2024-11-14 10:59:15 +05:30
Raja Subramanian
86383b2271 De-centralize some configs to where they are used. (#3162)
* De-centralize some configs to where they are used.

And make default variables.

Renaming a bit, but these are all internal config and have not been
added to documented config.

* Keep documented config as is.

* test

* typo
2024-11-08 12:47:30 +05:30
Raja Subramanian
365e63230d Some misc clean up. (#3156)
* Some misc clean up.

- Have been seeing counterfeiter warnings about efficiency for a while
  with go:generate declaration multiple times in the same package.
  Address that: https://github.com/maxbrunsfeld/counterfeiter?tab=readme-ov-file#step-2b---add-counterfeitergenerate-directives
- A bit more readability on parameters passed to `sendLeave`

* spacing

* revert some deletes as the complaint was in analytics service only

* Declare in package only once.

Although the warning is about go:generate multiple times when directly
giving the interface to generate, have `go:generate` multiple times in a
package even with `-generate` ends up generating once per invocation.
Once per package is enough to run the generation just once.
2024-11-04 11:26:41 +05:30
Raja Subramanian
49b75e94a6 Consolidate operations on LocalNode. (#3140) 2024-10-25 18:57:23 +05:30
cnderrauber
cf59267631 Add counter for pub&sub time metrics (#3084)
* Add counter for pub&sub time metrics

The pub&sub shows large value in migration related case like
muted/disabled migration, the subscription time depends on
the time when publisher unmute the track(sending rtp packet
after migration), add a counter to distinguish since we
can't control the time in such cases and the first subscription
attemps also is more meaningful than those cases.

* Add info log for high publish delay
2024-10-11 12:07:24 +08:00
Paul Wells
4deaac2f3f replace proto.Clone calls (#3024)
* replace proto.Clone calls

* deps

* tests
2024-09-18 22:47:33 -07:00
cnderrauber
978db00034 Add sdk, participant_kind to pub sub metrics (#3023)
* exclude go client from track publication metric

* add sdk,participant_kind lables

* fix test
2024-09-19 10:42:47 +08:00
Raja Subramanian
787b8450e9 Record out-of-packet count/rate in prom. (#2980)
* Record out-of-packet count/rate in prom.

Adding a field to AnalyticsStream to make this easier to report.
Let me know if adding to AnalyticsStream is not ok.

Will set up a protocol PR if it is okay.

* deps
2024-09-07 00:19:54 +05:30
Raja Subramanian
bec7453a1f Recreate stats worker on resume if needed. (#2982)
* Ref count the stats worker.

NOTE: Don't liek this much, but wanted to open this get some 👀 on
this and get feedback.

There are two entities, one for counting signal bytes and another for
media stats. They both send `ParticipantJoined` and `ParticipantLeft`
event.

In the case of a participant resume, as the old web socket
connection is closed, that triggers a signal stats counter close. That
would call `ParticipantLeft` and that would close the stats worker.

The closed stats worker got reaped in `FlushStats` after three minutes.

So, all events after that did not have a worker and hence went
unreported including missing participant_left webhook because it relied
on checking if a participant was ever connected and that needed to check
the worker state.

Using a ref count to keep track of join/leaves. And not close the worker
until ref count goes down to 0.

* create a stats worker on resume

* revert incorrect changes

* transfer connected state

* transfer connected state when creating worker

* resolve participant on a resume
2024-09-06 23:58:03 +05:30
cnderrauber
947e8f5909 Speed up track publication (#2952)
* speed up track publication

Add metrics for track publication and subscription

Return EnabledCodecs in JoinResponse so client can
choose codec without server side codec fallback

Cache remote webrtc track without AddTrackRequest to
let client send publisher offer before AddTrackRequest response

* go mod

* clean code
2024-08-23 18:38:32 +08:00
Paul Wells
afda860162 prevent race in telemetry worker cleanup (#2879) 2024-07-18 03:37:45 -07:00
Lukas Herman
8a229fda9d add participant session duration metric (#2801) 2024-06-17 17:52:08 -04:00
David Zhao
ecf1175832 Generate and send uuid with analytics (#2790)
* Generate and send uuid with analytics

* go mod
2024-06-13 23:00:50 -07:00
Paul Wells
d95b59de58 update protocol (#2764)
* update protocol

* deps
2024-06-05 23:50:54 -07:00
Paul Wells
f1886ece42 update protocol (#2760)
* update protocol

* deps
2024-06-05 19:46:34 -07:00
David Zhao
b99650aaf6 Send NodeID with analytics events (#2749) 2024-06-02 09:09:55 -07:00
cnderrauber
7ed1284b96 report average forward metrics (#2737)
* report average forward metrics

* unused parameter
2024-05-28 17:03:18 +08:00
cnderrauber
2288e402ac register forward metrics (#2735) 2024-05-27 15:47:01 +08:00
Paul Wells
38470f378b add message bytes metric (#2731) 2024-05-26 14:01:13 -07:00
cnderrauber
e6aa36fdd6 Add forward stats (#2725)
* Add forward metrics

* ignore packets was not forwarded

* rename
2024-05-24 17:43:28 +08:00
Paul Wells
9a5db132eb add room/participant name limit (#2704)
* add room/participant name limit

* defaults

* simplify

* omitempty

* handle 0 config

* fix race

* unlock

* tidy
2024-05-06 17:25:18 -07:00
Paul Wells
ac1b0e38ca store active stats workers in list (#2690)
* store active stats workers in list

* test

* single node cleanup

* cleanup

* cleanup

* cleanup
2024-04-26 17:24:10 -07:00
cnderrauber
f239f8bff1 Fix SubParticipant twice when paticipant left (#2672) 2024-04-23 16:09:02 +08:00
Mathew Kamkar
10c8582a6b get cpu stats from cgroup, remove env (#2636)
* get cpu stats from cgroup, remove env

* undo rand seed removal

* tests
2024-04-08 21:15:17 -07:00
Raja Subramanian
63b1fba082 Add start/end time to AnalyticsStream. (#2618)
* Add start/end time to AnalyticsStream.

* fix test
2024-04-03 12:23:18 +05:30
Paul Wells
f1c991c547 skip logging retry message when ws disconnections before signal finishes (#2604) 2024-03-29 06:30:12 -07:00
Raja Subramanian
14321f21bf Make OpsQueueParams to make it easier to understand args. (#2578) 2024-03-14 10:27:24 +05:30
Paul Wells
ad341d41f5 start telemetry participant worker to collect signal stats (#2538)
* start telemetry participant worker to collect signal stats

* format

* resolve room

* tidy
2024-03-03 02:47:51 -08:00
Denys Smirnov
f5eb6c8a95 Update usage of core.Fuse. (#2519) 2024-02-28 03:48:58 +02:00
Raja Subramanian
7649e4ffab Post data and signal stats once in 5 minutes (#2518) 2024-02-27 15:45:32 +05:30
Paul Wells
e5b8e25064 use shared psrpc utils (#2506)
* use shared psrpc utils

* fix

* deps
2024-02-24 00:38:49 -08:00
Raja Subramanian
5ac5bd236a Let track events go through after participant close. (#2487)
* Let track events go through after participant close.

Also, reducing lock scope in telemetry service.

* use shadow
2024-02-17 13:40:07 +05:30
Mathew Kamkar
7508560fde larger buckets for jitter prometheus histogram (#2468) 2024-02-09 12:09:51 -08:00
Raja Subramanian
b71d373f4a Use Deque in ops queue. (#2418)
* Use Seque in ops queue.

Standardizing some uses
- Change OpsQueue to use Deque so that it can grow/shrink as necessary and
  need not worry about channel getting full and dropping events.
- Change StreamAllocator and TelemetryService to use OpsQueue so that
  they also need not worry about channel size and overflows.

* Address feedback

* delete obvious comment

* clean up
2024-01-28 13:48:30 +05:30
Paul Wells
c726cbf2ba increase max session start time bin size (#2380) 2024-01-12 03:49:23 -08:00
Paul Wells
2fe2a9c9f2 add session start time metric (#2377) 2024-01-11 23:23:51 -08:00
shishirng
3770fbce64 Analytics: send local node room state/info (#2335)
* Analytics: send local node room state/info

Signed-off-by: shishir gowda <shishir@livekit.io>
2023-12-22 18:59:04 -05:00
Raja Subramanian
dcff75a516 Record number of data messages in prometheus. (#2282) 2023-12-01 16:10:57 +05:30
Raja Subramanian
53542b09a0 Participant traffic load. (#2262)
* Participant traffic load.

Capturing information about participant traffic
- Upstream/Downstream
- Audio/Video/Data
- Packets/Bytes

This captures a notion of how much traffic load a participant is
generating.

Can be used to make allocation decisions.

* Clean up

* SIP patches

* reporter goroutine

* unlock

* move traffic stats from protocol

* check type
2023-11-26 23:05:00 +05:30
Raja Subramanian
56dd399684 Use a worker to report signal/data stats. (#2260)
* Use a worker to report signal/data stats.

Was checking if reporting is needed on every update.
The check is wasted work if volume of signal/data messages is high
as reporting happens only once in 10 seconds.

Changing to a worker based on a timer. And also aligning with
telemetry reporting interval which defaults to 30 seconds.

* Remove unused constant
2023-11-22 11:47:15 +05:30
Paul Wells
f4a984d446 preallocate prometheus packet counters (#1942) 2023-08-08 01:06:14 -07:00
David Zhao
981fb7cac7 Adding license notices (#1913)
* Adding license notices

* remove from config
2023-07-27 16:43:19 -07:00
Benjamin Pracht
552e3758d5 Add IngressUpdated event (#1775) 2023-06-16 10:58:49 -07:00
David Zhao
f71544e27a Do not send ParticipantJoined webhook if connection was resumed (#1795)
* Do not send ParticipantJoined webhook if connection was resumed

* isResume -> isMigration
2023-06-15 15:39:04 -07:00
shishirng
2dd4e1365b Send EgressUpdated event (#1792)
Signed-off-by: shishir gowda <shishir@livekit.io>
2023-06-14 18:56:07 -04:00
David Zhao
7e5a7ae79f Fixed windows build (#1768) 2023-06-04 00:17:25 -07:00
Benjamin Pracht
e7879a46fc Add ingress telemetry support (#1763) 2023-06-02 17:38:19 -07:00