216 Commits

Author SHA1 Message Date
Paul Wells
c8bb2578be Rename log field "pID" to "participantID" for consistency (#4365)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 04:32:02 -07:00
Raja Subramanian
b81bac0ec3 Key telemetry stats worker using combination of roomID, participantID (#4323)
Some checks failed
Test / test (push) Failing after 17s
Release to Docker / docker (push) Failing after 3m42s
* Key telemetry stats work using combination of roomID, participantID

With forwarded participant, the same participantID can existing in two
rooms.

NOTE: This does not yet allow a participant session to report its
events/track stats into multiple rooms. That would require regitering
multiple listeners (from rooms a participant is forwarded to).

* missed file

* data channel stats

* PR comments + pass in room name so that telemetry events have proper room name also
2026-02-16 13:56:13 +05:30
Raja Subramanian
77c858f001 Log stats worker when it not closable. (#4313) 2026-02-11 13:18:34 +05:30
Paul Wells
3cca718072 use separate allocation for signal stats telemetry guard (#4281) 2026-02-03 02:13:49 -08:00
Paul Wells
333f0349d1 clear reference guard when resetting signal stats (#4279) 2026-02-03 01:54:03 -08:00
Anunay Maheshwari
0c33b8c671 chore: move codecs/mime stuff to protocol (#4255) 2026-01-20 20:54:32 +05:30
Raja Subramanian
b8ddd0f98c Taking interface{} -> any modernize bits (#4204) 2025-12-28 05:22:12 +05:30
Raja Subramanian
cbb2c61787 Publish/Unpublish counter match. (#4173)
Published counter was bumped up only when not migrating in, but it was
decremented when a migrating participant leaves without expectation to
resume. That could have resulted in negative counts.

Always change counters irrespective of migration or expected to resume
on leave. Control events send based on migration/resume.
2025-12-18 10:16:27 +05:30
Raja Subramanian
f01008f876 Revert telemetry stats worker wait configuration. (#4151)
Mostly reverting https://github.com/livekit/livekit/pull/4148. Leaving
the one bit to pass in a wait time to `Flush`.
2025-12-12 10:56:25 +05:30
Raja Subramanian
97099cae3e Configurable telemetry stats worker clean up wait. (#4148)
* Configurable telemetry stats worker clean up wait.

* make worker clean up wait setting atomic
2025-12-11 11:25:32 +05:30
Paul Wells
c6e6c0215f add debug metric for tracking references (#4134) 2025-12-07 11:39:21 -08:00
Raja Subramanian
7f10e18bac Record join/publish/subscribe cancellations. (#4102)
To get better picture of success/failure rate.
2025-11-25 14:06:02 +05:30
Raja Subramanian
f8b994d491 Forwarding latency measurement tweaks. (#4080)
* Forwarding latency measurement tweaks.

- prom transmission type public
- do not measure short term values as it is not used and saves some lock
  contention time in packet path potentially. Adding a separate method
  for that.
- Change latency/jitter summary reporting to `ns` also to match the
  histogram.

* add GetShortStats
2025-11-13 18:39:49 +05:30
Raja Subramanian
4ce07bedeb Higher resolution forwarding latency histogram. (#4067)
* Higher resolution forwarding latency histogram.

Was using the average latency/jitter of last second to populate
forwarding latency/jitter histogram. But, it is too coarse, i. e. the
average value of latency/jitter is very low and those summarised samples
end up in the lowest bucket always.

A few things to address it
- record per packet forwarding latency in histogram
- adjust histogram bins to include smaller values
- Drop jitter histogram

This is a per packet call, but prometheus histogram is supposedly
fast/light weight. Would be good to get better resolution histograms.
Hence doing this. Please let me know if there are performance concerns.

* typo

* one more typo
2025-11-09 17:29:40 +05:30
Raja Subramanian
9d5c351d36 Fix prom units for forwarding latency/jitter. (#4045) 2025-11-02 14:38:25 +05:30
Raja Subramanian
e183657cff Add prom histogram for forwarding latency and jitter. (#4044)
* Add prom histogram for forwarding latency and jitter.

Using short term stats for histogram.

An example setting is
1s - short term
1m - long term

Using the 1s (short term) data for histogram. In that 1 second, all
packet forwarding latencies are averaged for latency and std. dev. of
the collection is used as jitter.

* try different staticcheck
2025-11-01 23:25:03 +05:30
Raja Subramanian
ca0d5ee972 Count request/response packets on both client and server side. (#4001)
Currently, the signal requests are counted on media side and signal
responses are counted on controller side. This does not provide the
granularity to check how many response messages each media node is
sending.

Seeing some cases where track subscriptions are slow under load. This
would be good to see if the media node is doing a lot of signal response
messages.
2025-10-14 16:58:36 +05:30
Paul Wells
b3ee219ccb fix stats worker closed condition (#3965)
* fix stats worker closed condition

* test

* tidy
2025-09-29 02:51:58 -07:00
Paul Wells
3d73703152 add idempotent reference count to telemetry stats worker (#3964)
* add idempotent reference guard to telemetry stats worker

* tidy

* sync

* tidy
2025-09-29 02:35:16 -07:00
Raja Subramanian
fa5f4ef33c Populate SDP cid in track info when available. (#3845)
* Populate SDP cid in track info when available.

- Adding SDP cid to TrackInfo. Browsers like FF uses a different stream
  id for AddTrack and actual SDP offer. So, have to look up using both
  on server side. To make it easier, store both (only if different) in
  TrackInfo.
- Use a codec in TrackInfo for audio also. There is some discussion
  around doing simulcast codec for audio so that something like PSTN can
  use G.711 without any transcoding. So, just keep it consistent between
  audio and video.
- Populate SDP cid when SDP offer is received. It could populate a
  pending track or an already published track if the new offer is for a
  back up codec where the primary codec is already published.
- Passed around parsed offer to more places to avoid parsing multiple
  times.
- Clean up MediaTrack interface a bit and remove unneeded methods.

* WIP

* WIP

* deps

* stream allocator mime aware

* clean up

* populate SDP cid before munging

* interface methods
2025-08-13 10:53:16 +05:30
Raja Subramanian
eed27885e5 Send participant_connection_aborted when participant session is closed (#3848)
* Send `participant_connection_aborted` when participant session is closed
without becoming `ACTIVE`.

There is one sticky case. If there is a migration and the migration
fails, this will send `participant_connection_aborted` even though the
participant may have connected properly on the previous node.

* depsg
2025-08-13 10:36:31 +05:30
Raja Subramanian
10103449c5 Add country label to edge prom stats. (#3816)
* Add country label to edge prom stats.

* data channel country stats

* test

* pub/sub time country
2025-07-24 13:23:05 +05:30
Paul Wells
630aa7d970 implement observability for room metrics (#3712)
* implement observability for room metrics

* deps

* test

* test

* Raja feedback

* cleanup
2025-06-09 09:32:58 -07:00
Raja Subramanian
fc867c5b8e Webhook prom stats (#3697) 2025-06-04 14:31:28 -07:00
Benjamin Pracht
28dfac14e0 Use exported GetEgressNotifyOptions (#3604) 2025-04-11 09:45:27 -07:00
Benjamin Pracht
e5cbb22777 Allow specifying extra webhooks with egress requests (#3597) 2025-04-09 16:20:21 -07:00
Raja Subramanian
1c8307c72c Use cgroup for memstats. (#3573)
* Use cgroup for memstats.

* deps
2025-04-05 11:54:36 +05:30
Raja Subramanian
3238ab8d77 Calculate rates for memory used and total. (#3570)
Calculating rate for total does seem odd, but keeping it consitent/lined
up with used memory calculation.
2025-04-02 10:23:38 +05:30
Raja Subramanian
8cc17f8f8b Rework node stats a bit. (#3555)
* Rework node stats a bit.

Related protocol PR - https://github.com/livekit/protocol/pull/1023

- Make a config for node stats measurements. Wanted to put the config in
  `routing` package, but a circular dependency forced me to put in
   config.go
- Make rate calculations explicit, i. e. requested via config.
  Previously, it had some odd checks to decide when to calculate rate
  and it would have been calculating over different windows.
- Report signal/data channel bytes every 5 seconds to stats collection
  module. Previously, it was doing it every 30 seconds and that meant
  some windows could have had a large spike
  NOTE: Still need to think about this for load calculations as a large
  number of participants leaving could flush in a small window and that
  could report a large spike in bytes/packets. Maybe need to ignore
  signal bytes for load calculation?

* deps

* use default node stats config if given config is nil

* split out node stats into a struct for re-use

* update config
2025-03-27 12:42:19 +05:30
Raja Subramanian
fe673bb257 Send regressed codec upstream stats to analytics. (#3532)
* Send regressed codec upstream stats to analytics.

There is more work to do for analytics for simulcast codec, i. e. when
both codecs are published. Shorter term, ensure that analytics are sent
for regressed codec if active.

* lock access to regressed codec received
2025-03-18 12:46:24 +05:30
Paul Wells
3167266495 add datapacket stream metrics (#3450)
* add datapacket stream metrics

* normalize mime type
2025-02-19 22:28:10 -08:00
Raja Subramanian
1ae2e48c2e Webhook analytics event. (#3423)
* Webhook analytics event.

* deps

* generate

* nil notifier
2025-02-13 10:39:45 +05:30
Raja Subramanian
9551c52c85 Try 2 to consolidate mime type (#3407)
* Normalize mime type and add utilities.

An attempt to normalize mime type and avoid string compares remembering
to do case insensitive search.

Not the best solution. Open to ideas. But, define our own mime types
(just in case Pion changes things and Pion also does not have red mime
type defined which should be easy to add though) and tried to use it everywhere.
But, as we get a bunch of callbacks and info from Pion, needed conversion in
more places than I anticipated. And also makes it necessary to carry
that cognitive load of what comes from Pion and needing to process it
properly.

* more locations

* test

* Paul feedback

* MimeType type

* more consolidation

* Remove unused

* test

* test

* mime type as int

* use string method

* Pass error details and timeouts. (#3402)

* go mod tidy (#3408)

* Rename CHANGELOG to CHANGELOG.md (#3391)

Enables markdown features in this otherwise already markdown'ish formatted document

* Update config.go to properly process bool env vars (#3382)

Fixes issue https://github.com/livekit/livekit/issues/3381

* fix(deps): update go deps (#3341)

Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Use a Twirp server hook to send API call details to telemetry. (#3401)

* Use a Twirp server hook to send API call details to telemetry.

* mage generate and clean up

* Add project_id

* deps

* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
  participant_id, track_id as appropriate
- Store status as int

* deps

* Update pkg/sfu/mime/mimetype.go

* Fix prefer codec test

* handle down track mime changes

---------

Co-authored-by: Denys Smirnov <dennwc@pm.me>
Co-authored-by: Philzen <Philzen@users.noreply.github.com>
Co-authored-by: Pablo Fuente Pérez <pablofuenteperez@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Paul Wells <paulwe@gmail.com>
Co-authored-by: cnderrauber <zengjie9004@gmail.com>
2025-02-10 10:44:15 +05:30
Raja Subramanian
99afbf587b Use a Twirp server hook to send API call details to telemetry. (#3401)
* Use a Twirp server hook to send API call details to telemetry.

* mage generate and clean up

* Add project_id

* deps

* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
  participant_id, track_id as appropriate
- Store status as int

* deps
2025-02-07 16:16:41 +05:30
Raja Subramanian
14e65f1459 AnalyticsEvent for generic reports (#3400)
* AnalyticsEvent for generic reports

* deps
2025-02-05 10:13:43 +05:30
Raja Subramanian
3b0077f2fe Log connection quality changes. (#3311)
Also remove the connection quality drop prom as it is unused and also
adds state/complexity.
2025-01-07 10:58:31 +05:30
cnderrauber
54f9f7de51 upgrade to pion/webrtc v4 (#3213) 2024-11-28 16:05:38 +08:00
Raja Subramanian
baf47db834 Publish data and signal bytes once every 30 seconds. (#3212)
For applications with heavy data usage, accumulating data bytes over 5
minutes and then calculating rate using a much shorter window (like 2 -
5 seconds) makes it looks like there is a massive rate spike.

While this change is not a fix, this should soften the impact.

Need a better way to handle different parts of the system operating at
different frequencies. Can use rate in the reporting window, but that
will miss the spikes. Maybe that is okay. For example, if the reporting
window is 5 minutes and there was a 100 Mbps spike for about 10 seconds
of it, it would get smoothed out.
2024-11-28 09:21:44 +05:30
Raja Subramanian
cc22306047 Attempt to fix missing participant left webhook. (#3173)
On a resume, the signal stats will call `ParticipantLeft`. Although, it
explicity says not to send events, it could still close the stats
worker.

To handle that, we created a stats worker if needed in
`ParticipantResume` notification in this PR
(https://github.com/livekit/livekit/pull/2982), but that is not enough
as that event could happen before previous signal connection closes the
stats worker.

A new stats worker does get created when `ParticipantJoined` is called
by the new signal connection, but it does not transfer connected state.
So, when the client leaves, `ParticipantLeft` is not sent.

I am not seeing why we should not transfer connected state always given
that it is the same participant SID/session. But, I have a feeling that
I am missing some corner case. Please let me know if I am missing
something here.
2024-11-14 10:59:15 +05:30
Raja Subramanian
86383b2271 De-centralize some configs to where they are used. (#3162)
* De-centralize some configs to where they are used.

And make default variables.

Renaming a bit, but these are all internal config and have not been
added to documented config.

* Keep documented config as is.

* test

* typo
2024-11-08 12:47:30 +05:30
Raja Subramanian
365e63230d Some misc clean up. (#3156)
* Some misc clean up.

- Have been seeing counterfeiter warnings about efficiency for a while
  with go:generate declaration multiple times in the same package.
  Address that: https://github.com/maxbrunsfeld/counterfeiter?tab=readme-ov-file#step-2b---add-counterfeitergenerate-directives
- A bit more readability on parameters passed to `sendLeave`

* spacing

* revert some deletes as the complaint was in analytics service only

* Declare in package only once.

Although the warning is about go:generate multiple times when directly
giving the interface to generate, have `go:generate` multiple times in a
package even with `-generate` ends up generating once per invocation.
Once per package is enough to run the generation just once.
2024-11-04 11:26:41 +05:30
Raja Subramanian
49b75e94a6 Consolidate operations on LocalNode. (#3140) 2024-10-25 18:57:23 +05:30
cnderrauber
cf59267631 Add counter for pub&sub time metrics (#3084)
* Add counter for pub&sub time metrics

The pub&sub shows large value in migration related case like
muted/disabled migration, the subscription time depends on
the time when publisher unmute the track(sending rtp packet
after migration), add a counter to distinguish since we
can't control the time in such cases and the first subscription
attemps also is more meaningful than those cases.

* Add info log for high publish delay
2024-10-11 12:07:24 +08:00
Paul Wells
4deaac2f3f replace proto.Clone calls (#3024)
* replace proto.Clone calls

* deps

* tests
2024-09-18 22:47:33 -07:00
cnderrauber
978db00034 Add sdk, participant_kind to pub sub metrics (#3023)
* exclude go client from track publication metric

* add sdk,participant_kind lables

* fix test
2024-09-19 10:42:47 +08:00
Raja Subramanian
787b8450e9 Record out-of-packet count/rate in prom. (#2980)
* Record out-of-packet count/rate in prom.

Adding a field to AnalyticsStream to make this easier to report.
Let me know if adding to AnalyticsStream is not ok.

Will set up a protocol PR if it is okay.

* deps
2024-09-07 00:19:54 +05:30
Raja Subramanian
bec7453a1f Recreate stats worker on resume if needed. (#2982)
* Ref count the stats worker.

NOTE: Don't liek this much, but wanted to open this get some 👀 on
this and get feedback.

There are two entities, one for counting signal bytes and another for
media stats. They both send `ParticipantJoined` and `ParticipantLeft`
event.

In the case of a participant resume, as the old web socket
connection is closed, that triggers a signal stats counter close. That
would call `ParticipantLeft` and that would close the stats worker.

The closed stats worker got reaped in `FlushStats` after three minutes.

So, all events after that did not have a worker and hence went
unreported including missing participant_left webhook because it relied
on checking if a participant was ever connected and that needed to check
the worker state.

Using a ref count to keep track of join/leaves. And not close the worker
until ref count goes down to 0.

* create a stats worker on resume

* revert incorrect changes

* transfer connected state

* transfer connected state when creating worker

* resolve participant on a resume
2024-09-06 23:58:03 +05:30
cnderrauber
947e8f5909 Speed up track publication (#2952)
* speed up track publication

Add metrics for track publication and subscription

Return EnabledCodecs in JoinResponse so client can
choose codec without server side codec fallback

Cache remote webrtc track without AddTrackRequest to
let client send publisher offer before AddTrackRequest response

* go mod

* clean code
2024-08-23 18:38:32 +08:00
Paul Wells
afda860162 prevent race in telemetry worker cleanup (#2879) 2024-07-18 03:37:45 -07:00
Lukas Herman
8a229fda9d add participant session duration metric (#2801) 2024-06-17 17:52:08 -04:00