Commit Graph

70 Commits

Author SHA1 Message Date
Raja Subramanian b81bac0ec3 Key telemetry stats worker using combination of roomID, participantID (#4323)
Test / test (push) Failing after 17s
Release to Docker / docker (push) Failing after 3m42s
* Key telemetry stats work using combination of roomID, participantID

With forwarded participant, the same participantID can existing in two
rooms.

NOTE: This does not yet allow a participant session to report its
events/track stats into multiple rooms. That would require regitering
multiple listeners (from rooms a participant is forwarded to).

* missed file

* data channel stats

* PR comments + pass in room name so that telemetry events have proper room name also
2026-02-16 13:56:13 +05:30
Anunay Maheshwari 0c33b8c671 chore: move codecs/mime stuff to protocol (#4255) 2026-01-20 20:54:32 +05:30
Raja Subramanian b8ddd0f98c Taking interface{} -> any modernize bits (#4204) 2025-12-28 05:22:12 +05:30
Raja Subramanian cbb2c61787 Publish/Unpublish counter match. (#4173)
Published counter was bumped up only when not migrating in, but it was
decremented when a migrating participant leaves without expectation to
resume. That could have resulted in negative counts.

Always change counters irrespective of migration or expected to resume
on leave. Control events send based on migration/resume.
2025-12-18 10:16:27 +05:30
Raja Subramanian f01008f876 Revert telemetry stats worker wait configuration. (#4151)
Mostly reverting https://github.com/livekit/livekit/pull/4148. Leaving
the one bit to pass in a wait time to `Flush`.
2025-12-12 10:56:25 +05:30
Raja Subramanian 97099cae3e Configurable telemetry stats worker clean up wait. (#4148)
* Configurable telemetry stats worker clean up wait.

* make worker clean up wait setting atomic
2025-12-11 11:25:32 +05:30
Paul Wells 3d73703152 add idempotent reference count to telemetry stats worker (#3964)
* add idempotent reference guard to telemetry stats worker

* tidy

* sync

* tidy
2025-09-29 02:35:16 -07:00
Benjamin Pracht e5cbb22777 Allow specifying extra webhooks with egress requests (#3597) 2025-04-09 16:20:21 -07:00
Raja Subramanian 8cc17f8f8b Rework node stats a bit. (#3555)
* Rework node stats a bit.

Related protocol PR - https://github.com/livekit/protocol/pull/1023

- Make a config for node stats measurements. Wanted to put the config in
  `routing` package, but a circular dependency forced me to put in
   config.go
- Make rate calculations explicit, i. e. requested via config.
  Previously, it had some odd checks to decide when to calculate rate
  and it would have been calculating over different windows.
- Report signal/data channel bytes every 5 seconds to stats collection
  module. Previously, it was doing it every 30 seconds and that meant
  some windows could have had a large spike
  NOTE: Still need to think about this for load calculations as a large
  number of participants leaving could flush in a small window and that
  could report a large spike in bytes/packets. Maybe need to ignore
  signal bytes for load calculation?

* deps

* use default node stats config if given config is nil

* split out node stats into a struct for re-use

* update config
2025-03-27 12:42:19 +05:30
Raja Subramanian 1ae2e48c2e Webhook analytics event. (#3423)
* Webhook analytics event.

* deps

* generate

* nil notifier
2025-02-13 10:39:45 +05:30
Raja Subramanian 9551c52c85 Try 2 to consolidate mime type (#3407)
* Normalize mime type and add utilities.

An attempt to normalize mime type and avoid string compares remembering
to do case insensitive search.

Not the best solution. Open to ideas. But, define our own mime types
(just in case Pion changes things and Pion also does not have red mime
type defined which should be easy to add though) and tried to use it everywhere.
But, as we get a bunch of callbacks and info from Pion, needed conversion in
more places than I anticipated. And also makes it necessary to carry
that cognitive load of what comes from Pion and needing to process it
properly.

* more locations

* test

* Paul feedback

* MimeType type

* more consolidation

* Remove unused

* test

* test

* mime type as int

* use string method

* Pass error details and timeouts. (#3402)

* go mod tidy (#3408)

* Rename CHANGELOG to CHANGELOG.md (#3391)

Enables markdown features in this otherwise already markdown'ish formatted document

* Update config.go to properly process bool env vars (#3382)

Fixes issue https://github.com/livekit/livekit/issues/3381

* fix(deps): update go deps (#3341)

Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Use a Twirp server hook to send API call details to telemetry. (#3401)

* Use a Twirp server hook to send API call details to telemetry.

* mage generate and clean up

* Add project_id

* deps

* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
  participant_id, track_id as appropriate
- Store status as int

* deps

* Update pkg/sfu/mime/mimetype.go

* Fix prefer codec test

* handle down track mime changes

---------

Co-authored-by: Denys Smirnov <dennwc@pm.me>
Co-authored-by: Philzen <Philzen@users.noreply.github.com>
Co-authored-by: Pablo Fuente Pérez <pablofuenteperez@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Paul Wells <paulwe@gmail.com>
Co-authored-by: cnderrauber <zengjie9004@gmail.com>
2025-02-10 10:44:15 +05:30
Raja Subramanian 99afbf587b Use a Twirp server hook to send API call details to telemetry. (#3401)
* Use a Twirp server hook to send API call details to telemetry.

* mage generate and clean up

* Add project_id

* deps

* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
  participant_id, track_id as appropriate
- Store status as int

* deps
2025-02-07 16:16:41 +05:30
Raja Subramanian 14e65f1459 AnalyticsEvent for generic reports (#3400)
* AnalyticsEvent for generic reports

* deps
2025-02-05 10:13:43 +05:30
Raja Subramanian baf47db834 Publish data and signal bytes once every 30 seconds. (#3212)
For applications with heavy data usage, accumulating data bytes over 5
minutes and then calculating rate using a much shorter window (like 2 -
5 seconds) makes it looks like there is a massive rate spike.

While this change is not a fix, this should soften the impact.

Need a better way to handle different parts of the system operating at
different frequencies. Can use rate in the reporting window, but that
will miss the spikes. Maybe that is okay. For example, if the reporting
window is 5 minutes and there was a 100 Mbps spike for about 10 seconds
of it, it would get smoothed out.
2024-11-28 09:21:44 +05:30
Raja Subramanian cc22306047 Attempt to fix missing participant left webhook. (#3173)
On a resume, the signal stats will call `ParticipantLeft`. Although, it
explicity says not to send events, it could still close the stats
worker.

To handle that, we created a stats worker if needed in
`ParticipantResume` notification in this PR
(https://github.com/livekit/livekit/pull/2982), but that is not enough
as that event could happen before previous signal connection closes the
stats worker.

A new stats worker does get created when `ParticipantJoined` is called
by the new signal connection, but it does not transfer connected state.
So, when the client leaves, `ParticipantLeft` is not sent.

I am not seeing why we should not transfer connected state always given
that it is the same participant SID/session. But, I have a feeling that
I am missing some corner case. Please let me know if I am missing
something here.
2024-11-14 10:59:15 +05:30
Raja Subramanian 86383b2271 De-centralize some configs to where they are used. (#3162)
* De-centralize some configs to where they are used.

And make default variables.

Renaming a bit, but these are all internal config and have not been
added to documented config.

* Keep documented config as is.

* test

* typo
2024-11-08 12:47:30 +05:30
Raja Subramanian 365e63230d Some misc clean up. (#3156)
* Some misc clean up.

- Have been seeing counterfeiter warnings about efficiency for a while
  with go:generate declaration multiple times in the same package.
  Address that: https://github.com/maxbrunsfeld/counterfeiter?tab=readme-ov-file#step-2b---add-counterfeitergenerate-directives
- A bit more readability on parameters passed to `sendLeave`

* spacing

* revert some deletes as the complaint was in analytics service only

* Declare in package only once.

Although the warning is about go:generate multiple times when directly
giving the interface to generate, have `go:generate` multiple times in a
package even with `-generate` ends up generating once per invocation.
Once per package is enough to run the generation just once.
2024-11-04 11:26:41 +05:30
Raja Subramanian bec7453a1f Recreate stats worker on resume if needed. (#2982)
* Ref count the stats worker.

NOTE: Don't liek this much, but wanted to open this get some 👀 on
this and get feedback.

There are two entities, one for counting signal bytes and another for
media stats. They both send `ParticipantJoined` and `ParticipantLeft`
event.

In the case of a participant resume, as the old web socket
connection is closed, that triggers a signal stats counter close. That
would call `ParticipantLeft` and that would close the stats worker.

The closed stats worker got reaped in `FlushStats` after three minutes.

So, all events after that did not have a worker and hence went
unreported including missing participant_left webhook because it relied
on checking if a participant was ever connected and that needed to check
the worker state.

Using a ref count to keep track of join/leaves. And not close the worker
until ref count goes down to 0.

* create a stats worker on resume

* revert incorrect changes

* transfer connected state

* transfer connected state when creating worker

* resolve participant on a resume
2024-09-06 23:58:03 +05:30
Paul Wells afda860162 prevent race in telemetry worker cleanup (#2879) 2024-07-18 03:37:45 -07:00
Paul Wells 9a5db132eb add room/participant name limit (#2704)
* add room/participant name limit

* defaults

* simplify

* omitempty

* handle 0 config

* fix race

* unlock

* tidy
2024-05-06 17:25:18 -07:00
Paul Wells ac1b0e38ca store active stats workers in list (#2690)
* store active stats workers in list

* test

* single node cleanup

* cleanup

* cleanup

* cleanup
2024-04-26 17:24:10 -07:00
Raja Subramanian 14321f21bf Make OpsQueueParams to make it easier to understand args. (#2578) 2024-03-14 10:27:24 +05:30
Paul Wells ad341d41f5 start telemetry participant worker to collect signal stats (#2538)
* start telemetry participant worker to collect signal stats

* format

* resolve room

* tidy
2024-03-03 02:47:51 -08:00
Raja Subramanian 5ac5bd236a Let track events go through after participant close. (#2487)
* Let track events go through after participant close.

Also, reducing lock scope in telemetry service.

* use shadow
2024-02-17 13:40:07 +05:30
Raja Subramanian b71d373f4a Use Deque in ops queue. (#2418)
* Use Seque in ops queue.

Standardizing some uses
- Change OpsQueue to use Deque so that it can grow/shrink as necessary and
  need not worry about channel getting full and dropping events.
- Change StreamAllocator and TelemetryService to use OpsQueue so that
  they also need not worry about channel size and overflows.

* Address feedback

* delete obvious comment

* clean up
2024-01-28 13:48:30 +05:30
shishirng 3770fbce64 Analytics: send local node room state/info (#2335)
* Analytics: send local node room state/info

Signed-off-by: shishir gowda <shishir@livekit.io>
2023-12-22 18:59:04 -05:00
David Zhao 981fb7cac7 Adding license notices (#1913)
* Adding license notices

* remove from config
2023-07-27 16:43:19 -07:00
Benjamin Pracht 552e3758d5 Add IngressUpdated event (#1775) 2023-06-16 10:58:49 -07:00
David Zhao f71544e27a Do not send ParticipantJoined webhook if connection was resumed (#1795)
* Do not send ParticipantJoined webhook if connection was resumed

* isResume -> isMigration
2023-06-15 15:39:04 -07:00
Benjamin Pracht e7879a46fc Add ingress telemetry support (#1763) 2023-06-02 17:38:19 -07:00
David Colburn 2ccee369a6 update notifier (#1702) 2023-05-09 20:52:22 -07:00
David Zhao 40ceddd18b Integrate QueuedNotifier, fixes out-of-order delivery (#1615) 2023-04-15 01:20:23 -07:00
David Colburn 108b251045 egress updated webhook (#1555) 2023-03-27 16:34:44 -07:00
David Zhao 5ff72a99b9 Report publish & subscribe RTPStats as Telemetry events (#1506) 2023-03-10 10:28:54 -08:00
shishirng 8856ce6422 Bump up interval for sending telemetry stats to 30 seconds (#1430)
Signed-off-by: shishir gowda <shishir@livekit.io>
2023-02-16 15:53:58 -05:00
David Zhao 2851a8ac98 Improved robustness of subscription stack (#1382)
UpdateSubscription had a shortcoming where when it couldn't find the
participant, it ignored the request.

This PR further removes the reliance of current publisher state from
subscribers.
- SubscribeToTrack only takes in a trackID
- Introduced RoomTrackManager to maintain all published tracks to a room
- Added TrackUnpublished event to clearly indicate when a track has been removed
- SubscribeRequested event no longer include information about the publisher
2023-02-06 18:08:26 -08:00
cnderrauber 8b6dab780c Add reconnect reason and signal rtt calculation (#1381)
* Add connect reason and signal rtt calculate

* Update protocol

* solve comment
2023-02-06 11:12:25 +08:00
David Zhao b023c531c2 Fix incorrect unsubscribed track telemetry (#1350)
- only log unsubscribe on close if Track was actually bound
- update subscribed counters even if a failure had been logged before
2023-01-30 10:16:21 -08:00
David Zhao cd6b8b80b9 feat: SubscriptionManager to consolidate subscription handling (#1317)
Added a new manager to handle all subscription needs. Implemented using reconciler pattern. The goals are:

improve subscription resilience by separating desired state and current state
reduce complexity of synchronous processing
better detect failures with the ability to trigger full reconnect
2023-01-24 23:06:16 -08:00
Paul Wells 1ef7c46fd7 publish stream stats to prometheus (#1313)
* add prometheus stats for rtt/jitter/packet loss

* add track source to metrics

* better packet loss bins

* add track type to metrics

* remove source from AnalyticsStat

* regenerate telemetry service fake

* compute loss from per stream packet count
2023-01-19 19:37:15 -08:00
David Zhao 732309a8c1 Added track success & muted events (#1308)
Related to livekit/protocol#273

This PR adds:
- ParticipantResumed - for when ICE restart or migration had occurred
- TrackPublishRequested - when we initiate a publication
- TrackSubscribeRequested - when we initiate a subscription
- TrackMuted - publisher muted track
- TrackUnmuted - publisher unmuted track
- TrackPublish/TrackSubcribe events will indicate when those actions have been successful, to differentiate.
2023-01-15 15:40:20 -08:00
David Zhao 120335da00 Allow skipping of sending ParticipantJoined analytics event (#1236)
In certain scenarios such as migration, we do not want a duplicate event
to be sent when the participant is reconnecting. The Prometheus metric
should still be updated though.
2022-12-18 22:09:20 -08:00
David Zhao 33902a9f2a Do not send ParticipantLeft webhook event unless connected successfully. (#1234)
Fixes #1130
2022-12-18 17:37:55 -08:00
shishirng 8dc5a899a9 Create stats worker for participant on Active if not exists (#1059)
on migration, participants don't send JOIN event, so create it in
PARTICIPANT_ACTIVE event
2022-09-29 17:26:43 -04:00
David Colburn 803046b882 Auto egress (#1011)
* auto egress

* fix room service test

* reuse StartTrackEgress

* add timestamp

* update prefixed filename explicitly

* update protocol

* clean up telemetry

* fix telemetry tests

* separate room internal storage

* auto participant egress

* remove custom template url

* fix internal key

* use map for stats workers

* remove sync.Map

* remove participant composite
2022-09-21 12:04:19 -07:00
Raja Subramanian 29039b4e76 Use a go routine to clean up stats workers. (#836)
* Use a go routine to clean up stats workers.

It is possible that certain events (like TrackUnpublished) can
happen after the participant is closed. For webhooks pertaining
to those events, need details like room name/id. So,reap stats
workers a little while after the participant left event happens.

* handle data race report

* log analytics worker reap

* debug log
2022-07-18 11:47:43 +05:30
David Colburn fbbcbe77df Remove recording (#811)
* remove recorder service

* update protocol
2022-07-05 18:39:32 -07:00
cnderrauber f958fbcc1c simulcast codecs support (#720)
simulcast codecs support 

Co-authored-by: David Zhao <dz@livekit.io>
2022-05-27 19:55:50 +08:00
David Zhao 79296d0939 Fixed concurrent modification to map (#702)
Synchronizes access to stats worker maps. Previously it was accessed
from both OpsQueue goroutine and run() worker
2022-05-20 13:45:13 -07:00
David Zhao b821a0997d Use common logging init functions (#633)
* Use common logging init functions

* update protocol commit

* fix tests
2022-04-20 00:15:11 -07:00