livekit

mirror of https://github.com/livekit/livekit.git synced 2026-07-02 20:02:14 +00:00

Author	SHA1	Message	Date
Raja Subramanian	1b69630a28	Prometheus metric for join latency. (#4616 ) * Prometheus metric for join latency. Also including a couple of other failures in the signal connection path and moving the signal connected to after all that. Not doing counters for the new signal failure paths. I should not have done for the other two I added a little while ago also ( validation failure and start participant failure) as those are not scalable to keep adding to node stats. Will probably remove those two from node stats later. Can add those counters if they are useful. * deprecate signal failed counters	2026-06-22 22:07:32 +05:30
Raja Subramanian	8d2b827f44	Add prom metrics for peer connectino state. (#4574 ) * Add prom metrics for peer connectino state. By direction (PUBLISHER vs SUBSCRIBER) and state ("started" -> "connected"). This gives a way to track peer connections failing to finish establishment. The RTC active count can be useful for primary peer connection, but not for non-primary. This counter can be used to track any and can generally be used to understand success/failure rate of peer connection establishment. * add a couple of more states * clean up and avoid duplicate reporting fully established * staticcheck	2026-06-09 16:11:03 +05:30
Raja Subramanian	835ef1b353	Metrics for participant active, i. e. fully established. (#4557 ) * Metrics for participant active, i. e. fully established. - Egress stub for v2 API - Fix the participant canceled counter 🤦 - Add active counter -> this is increment when a participant becomes active, i. e. primary peer connection established. Can be used to monitor node wise connection establishment issues. - Add singnalling validation fail counter. With this, we have - signalling validation fail - signalling failed --> this is when the `startSession` fails - signalling connected -> signalling is succesful and can send back joinResponse to client on media connection side - rtc_init -> start - rtc_connected -> participant session created (joined) - rtc_active -> primay peer connection established - rtc_canceled -> could not proceed with RTC connection due to not being able to resume. * signalling counters deps * revert pion/webrtc to 4.2.12 to get SCTP without interleaving * go back to pion/webrtc 4.2.11 and sctp 1.9.5	2026-06-03 19:50:19 +05:30
Paul Wells	2dd5e63207	telemetry: split webhook-processed hook out of NewTelemetryService (#4548 ) * telemetry: split webhook-processed hook registration out of NewTelemetryService NewTelemetryService used to register a notifier processed-hook on the inner *telemetryService directly. That made it impossible for downstream wrappers (e.g. cloud's TelemetryService that overrides Webhook to fan out to a v3 observability pipeline) to intercept webhook events without double-firing the legacy emission. Lift the registration into a new exported helper RegisterWebhookHook, and have the standalone server's wire provider createTelemetryService call it right after construction so behavior is unchanged for callers that don't wrap the service.	2026-05-27 09:40:55 -07:00
Ninad Pundalik	145689e627	Start tracking Twirp method request latency in prometheus too, not just in logs (#4545 ) * Start tracking Twirp method request latency in prometheus too, not just datadog * Simplify latency tracking, do it in the logger itself	2026-05-26 14:53:16 +05:30
cnderrauber	89faaeba82	Apply ttl check only when authenticate allocation creating (#4526 ) * Apply ttl check only when authenticate allocation creating TTL check could reject allocation/persmission refresh in security enhancement #4505, cause long-live session disconnect when turn credential is expired. Only check ttl on allocation creating to prevent abusing leaked credential but keep long-live session work.	2026-05-15 14:55:05 +08:00
Raja Subramanian	701a37c2d1	Convert sort.Slice -> slices.SortFunc (#4472 ) * Convert sort.Slice -> slices.SortFunc * active speaker loudness in descending order	2026-04-23 15:12:24 +05:30
Paul Wells	2a04bc3ca8	fix publisher frame count reporting for simulcast streams (#4457 )	2026-04-16 11:08:33 -07:00
Raja Subramanian	69aa94797b	Some drive-by clean up (#4452 )	2026-04-15 12:23:33 +05:30
Paul Wells	c8bb2578be	Rename log field "pID" to "participantID" for consistency (#4365 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 04:32:02 -07:00
Raja Subramanian	b81bac0ec3	Key telemetry stats worker using combination of roomID, participantID (#4323 ) Test / test (push) Failing after 17s Release to Docker / docker (push) Failing after 3m42s * Key telemetry stats work using combination of roomID, participantID With forwarded participant, the same participantID can existing in two rooms. NOTE: This does not yet allow a participant session to report its events/track stats into multiple rooms. That would require regitering multiple listeners (from rooms a participant is forwarded to). * missed file * data channel stats * PR comments + pass in room name so that telemetry events have proper room name also	2026-02-16 13:56:13 +05:30
Raja Subramanian	77c858f001	Log stats worker when it not closable. (#4313 )	2026-02-11 13:18:34 +05:30
Paul Wells	3cca718072	use separate allocation for signal stats telemetry guard (#4281 )	2026-02-03 02:13:49 -08:00
Paul Wells	333f0349d1	clear reference guard when resetting signal stats (#4279 )	2026-02-03 01:54:03 -08:00
Anunay Maheshwari	0c33b8c671	chore: move codecs/mime stuff to protocol (#4255 )	2026-01-20 20:54:32 +05:30
Raja Subramanian	b8ddd0f98c	Taking interface{} -> any modernize bits (#4204 )	2025-12-28 05:22:12 +05:30
Raja Subramanian	cbb2c61787	Publish/Unpublish counter match. (#4173 ) Published counter was bumped up only when not migrating in, but it was decremented when a migrating participant leaves without expectation to resume. That could have resulted in negative counts. Always change counters irrespective of migration or expected to resume on leave. Control events send based on migration/resume.	2025-12-18 10:16:27 +05:30
Raja Subramanian	f01008f876	Revert telemetry stats worker wait configuration. (#4151 ) Mostly reverting https://github.com/livekit/livekit/pull/4148. Leaving the one bit to pass in a wait time to `Flush`.	2025-12-12 10:56:25 +05:30
Raja Subramanian	97099cae3e	Configurable telemetry stats worker clean up wait. (#4148 ) * Configurable telemetry stats worker clean up wait. * make worker clean up wait setting atomic	2025-12-11 11:25:32 +05:30
Paul Wells	c6e6c0215f	add debug metric for tracking references (#4134 )	2025-12-07 11:39:21 -08:00
Raja Subramanian	7f10e18bac	Record join/publish/subscribe cancellations. (#4102 ) To get better picture of success/failure rate.	2025-11-25 14:06:02 +05:30
Raja Subramanian	f8b994d491	Forwarding latency measurement tweaks. (#4080 ) * Forwarding latency measurement tweaks. - prom transmission type public - do not measure short term values as it is not used and saves some lock contention time in packet path potentially. Adding a separate method for that. - Change latency/jitter summary reporting to `ns` also to match the histogram. * add GetShortStats	2025-11-13 18:39:49 +05:30
Raja Subramanian	4ce07bedeb	Higher resolution forwarding latency histogram. (#4067 ) * Higher resolution forwarding latency histogram. Was using the average latency/jitter of last second to populate forwarding latency/jitter histogram. But, it is too coarse, i. e. the average value of latency/jitter is very low and those summarised samples end up in the lowest bucket always. A few things to address it - record per packet forwarding latency in histogram - adjust histogram bins to include smaller values - Drop jitter histogram This is a per packet call, but prometheus histogram is supposedly fast/light weight. Would be good to get better resolution histograms. Hence doing this. Please let me know if there are performance concerns. * typo * one more typo	2025-11-09 17:29:40 +05:30
Raja Subramanian	9d5c351d36	Fix prom units for forwarding latency/jitter. (#4045 )	2025-11-02 14:38:25 +05:30
Raja Subramanian	e183657cff	Add prom histogram for forwarding latency and jitter. (#4044 ) * Add prom histogram for forwarding latency and jitter. Using short term stats for histogram. An example setting is 1s - short term 1m - long term Using the 1s (short term) data for histogram. In that 1 second, all packet forwarding latencies are averaged for latency and std. dev. of the collection is used as jitter. * try different staticcheck	2025-11-01 23:25:03 +05:30
Raja Subramanian	ca0d5ee972	Count request/response packets on both client and server side. (#4001 ) Currently, the signal requests are counted on media side and signal responses are counted on controller side. This does not provide the granularity to check how many response messages each media node is sending. Seeing some cases where track subscriptions are slow under load. This would be good to see if the media node is doing a lot of signal response messages.	2025-10-14 16:58:36 +05:30
Paul Wells	b3ee219ccb	fix stats worker closed condition (#3965 ) * fix stats worker closed condition * test * tidy	2025-09-29 02:51:58 -07:00
Paul Wells	3d73703152	add idempotent reference count to telemetry stats worker (#3964 ) * add idempotent reference guard to telemetry stats worker * tidy * sync * tidy	2025-09-29 02:35:16 -07:00
Raja Subramanian	fa5f4ef33c	Populate SDP cid in track info when available. (#3845 ) * Populate SDP cid in track info when available. - Adding SDP cid to TrackInfo. Browsers like FF uses a different stream id for AddTrack and actual SDP offer. So, have to look up using both on server side. To make it easier, store both (only if different) in TrackInfo. - Use a codec in TrackInfo for audio also. There is some discussion around doing simulcast codec for audio so that something like PSTN can use G.711 without any transcoding. So, just keep it consistent between audio and video. - Populate SDP cid when SDP offer is received. It could populate a pending track or an already published track if the new offer is for a back up codec where the primary codec is already published. - Passed around parsed offer to more places to avoid parsing multiple times. - Clean up MediaTrack interface a bit and remove unneeded methods. * WIP * WIP * deps * stream allocator mime aware * clean up * populate SDP cid before munging * interface methods	2025-08-13 10:53:16 +05:30
Raja Subramanian	eed27885e5	Send `participant_connection_aborted` when participant session is closed (#3848 ) * Send `participant_connection_aborted` when participant session is closed without becoming `ACTIVE`. There is one sticky case. If there is a migration and the migration fails, this will send `participant_connection_aborted` even though the participant may have connected properly on the previous node. * depsg	2025-08-13 10:36:31 +05:30
Raja Subramanian	10103449c5	Add country label to edge prom stats. (#3816 ) * Add country label to edge prom stats. * data channel country stats * test * pub/sub time country	2025-07-24 13:23:05 +05:30
Paul Wells	630aa7d970	implement observability for room metrics (#3712 ) * implement observability for room metrics * deps * test * test * Raja feedback * cleanup	2025-06-09 09:32:58 -07:00
Raja Subramanian	fc867c5b8e	Webhook prom stats (#3697 )	2025-06-04 14:31:28 -07:00
Benjamin Pracht	28dfac14e0	Use exported GetEgressNotifyOptions (#3604 )	2025-04-11 09:45:27 -07:00
Benjamin Pracht	e5cbb22777	Allow specifying extra webhooks with egress requests (#3597 )	2025-04-09 16:20:21 -07:00
Raja Subramanian	1c8307c72c	Use cgroup for memstats. (#3573 ) * Use cgroup for memstats. * deps	2025-04-05 11:54:36 +05:30
Raja Subramanian	3238ab8d77	Calculate rates for memory used and total. (#3570 ) Calculating rate for total does seem odd, but keeping it consitent/lined up with used memory calculation.	2025-04-02 10:23:38 +05:30
Raja Subramanian	8cc17f8f8b	Rework node stats a bit. (#3555 ) * Rework node stats a bit. Related protocol PR - https://github.com/livekit/protocol/pull/1023 - Make a config for node stats measurements. Wanted to put the config in `routing` package, but a circular dependency forced me to put in config.go - Make rate calculations explicit, i. e. requested via config. Previously, it had some odd checks to decide when to calculate rate and it would have been calculating over different windows. - Report signal/data channel bytes every 5 seconds to stats collection module. Previously, it was doing it every 30 seconds and that meant some windows could have had a large spike NOTE: Still need to think about this for load calculations as a large number of participants leaving could flush in a small window and that could report a large spike in bytes/packets. Maybe need to ignore signal bytes for load calculation? * deps * use default node stats config if given config is nil * split out node stats into a struct for re-use * update config	2025-03-27 12:42:19 +05:30
Raja Subramanian	fe673bb257	Send regressed codec upstream stats to analytics. (#3532 ) * Send regressed codec upstream stats to analytics. There is more work to do for analytics for simulcast codec, i. e. when both codecs are published. Shorter term, ensure that analytics are sent for regressed codec if active. * lock access to regressed codec received	2025-03-18 12:46:24 +05:30
Paul Wells	3167266495	add datapacket stream metrics (#3450 ) * add datapacket stream metrics * normalize mime type	2025-02-19 22:28:10 -08:00
Raja Subramanian	1ae2e48c2e	Webhook analytics event. (#3423 ) * Webhook analytics event. * deps * generate * nil notifier	2025-02-13 10:39:45 +05:30
Raja Subramanian	9551c52c85	Try 2 to consolidate mime type (#3407 ) * Normalize mime type and add utilities. An attempt to normalize mime type and avoid string compares remembering to do case insensitive search. Not the best solution. Open to ideas. But, define our own mime types (just in case Pion changes things and Pion also does not have red mime type defined which should be easy to add though) and tried to use it everywhere. But, as we get a bunch of callbacks and info from Pion, needed conversion in more places than I anticipated. And also makes it necessary to carry that cognitive load of what comes from Pion and needing to process it properly. * more locations * test * Paul feedback * MimeType type * more consolidation * Remove unused * test * test * mime type as int * use string method * Pass error details and timeouts. (#3402) * go mod tidy (#3408) * Rename CHANGELOG to CHANGELOG.md (#3391) Enables markdown features in this otherwise already markdown'ish formatted document * Update config.go to properly process bool env vars (#3382) Fixes issue https://github.com/livekit/livekit/issues/3381 * fix(deps): update go deps (#3341) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Use a Twirp server hook to send API call details to telemetry. (#3401) * Use a Twirp server hook to send API call details to telemetry. * mage generate and clean up * Add project_id * deps * - Redact requests - Do not store responses - Extract top level fields room_name, room_id, participant_identity, participant_id, track_id as appropriate - Store status as int * deps * Update pkg/sfu/mime/mimetype.go * Fix prefer codec test * handle down track mime changes --------- Co-authored-by: Denys Smirnov <dennwc@pm.me> Co-authored-by: Philzen <Philzen@users.noreply.github.com> Co-authored-by: Pablo Fuente Pérez <pablofuenteperez@gmail.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Paul Wells <paulwe@gmail.com> Co-authored-by: cnderrauber <zengjie9004@gmail.com>	2025-02-10 10:44:15 +05:30
Raja Subramanian	99afbf587b	Use a Twirp server hook to send API call details to telemetry. (#3401 ) * Use a Twirp server hook to send API call details to telemetry. * mage generate and clean up * Add project_id * deps * - Redact requests - Do not store responses - Extract top level fields room_name, room_id, participant_identity, participant_id, track_id as appropriate - Store status as int * deps	2025-02-07 16:16:41 +05:30
Raja Subramanian	14e65f1459	AnalyticsEvent for generic reports (#3400 ) * AnalyticsEvent for generic reports * deps	2025-02-05 10:13:43 +05:30
Raja Subramanian	3b0077f2fe	Log connection quality changes. (#3311 ) Also remove the connection quality drop prom as it is unused and also adds state/complexity.	2025-01-07 10:58:31 +05:30
cnderrauber	54f9f7de51	upgrade to pion/webrtc v4 (#3213 )	2024-11-28 16:05:38 +08:00
Raja Subramanian	baf47db834	Publish data and signal bytes once every 30 seconds. (#3212 ) For applications with heavy data usage, accumulating data bytes over 5 minutes and then calculating rate using a much shorter window (like 2 - 5 seconds) makes it looks like there is a massive rate spike. While this change is not a fix, this should soften the impact. Need a better way to handle different parts of the system operating at different frequencies. Can use rate in the reporting window, but that will miss the spikes. Maybe that is okay. For example, if the reporting window is 5 minutes and there was a 100 Mbps spike for about 10 seconds of it, it would get smoothed out.	2024-11-28 09:21:44 +05:30
Raja Subramanian	cc22306047	Attempt to fix missing participant left webhook. (#3173 ) On a resume, the signal stats will call `ParticipantLeft`. Although, it explicity says not to send events, it could still close the stats worker. To handle that, we created a stats worker if needed in `ParticipantResume` notification in this PR (https://github.com/livekit/livekit/pull/2982), but that is not enough as that event could happen before previous signal connection closes the stats worker. A new stats worker does get created when `ParticipantJoined` is called by the new signal connection, but it does not transfer connected state. So, when the client leaves, `ParticipantLeft` is not sent. I am not seeing why we should not transfer connected state always given that it is the same participant SID/session. But, I have a feeling that I am missing some corner case. Please let me know if I am missing something here.	2024-11-14 10:59:15 +05:30
Raja Subramanian	86383b2271	De-centralize some configs to where they are used. (#3162 ) * De-centralize some configs to where they are used. And make default variables. Renaming a bit, but these are all internal config and have not been added to documented config. * Keep documented config as is. * test * typo	2024-11-08 12:47:30 +05:30
Raja Subramanian	365e63230d	Some misc clean up. (#3156 ) * Some misc clean up. - Have been seeing counterfeiter warnings about efficiency for a while with go:generate declaration multiple times in the same package. Address that: https://github.com/maxbrunsfeld/counterfeiter?tab=readme-ov-file#step-2b---add-counterfeitergenerate-directives - A bit more readability on parameters passed to `sendLeave` * spacing * revert some deletes as the complaint was in analytics service only * Declare in package only once. Although the warning is about go:generate multiple times when directly giving the interface to generate, have `go:generate` multiple times in a package even with `-generate` ends up generating once per invocation. Once per package is enough to run the generation just once.	2024-11-04 11:26:41 +05:30

1 2 3 4 5

225 Commits