Release to Docker / docker (push) Failing after 3m42s
* Key telemetry stats work using combination of roomID, participantID
With forwarded participant, the same participantID can existing in two
rooms.
NOTE: This does not yet allow a participant session to report its
events/track stats into multiple rooms. That would require regitering
multiple listeners (from rooms a participant is forwarded to).
* missed file
* data channel stats
* PR comments + pass in room name so that telemetry events have proper room name also
Published counter was bumped up only when not migrating in, but it was
decremented when a migrating participant leaves without expectation to
resume. That could have resulted in negative counts.
Always change counters irrespective of migration or expected to resume
on leave. Control events send based on migration/resume.
* Rework node stats a bit.
Related protocol PR - https://github.com/livekit/protocol/pull/1023
- Make a config for node stats measurements. Wanted to put the config in
`routing` package, but a circular dependency forced me to put in
config.go
- Make rate calculations explicit, i. e. requested via config.
Previously, it had some odd checks to decide when to calculate rate
and it would have been calculating over different windows.
- Report signal/data channel bytes every 5 seconds to stats collection
module. Previously, it was doing it every 30 seconds and that meant
some windows could have had a large spike
NOTE: Still need to think about this for load calculations as a large
number of participants leaving could flush in a small window and that
could report a large spike in bytes/packets. Maybe need to ignore
signal bytes for load calculation?
* deps
* use default node stats config if given config is nil
* split out node stats into a struct for re-use
* update config
* Normalize mime type and add utilities.
An attempt to normalize mime type and avoid string compares remembering
to do case insensitive search.
Not the best solution. Open to ideas. But, define our own mime types
(just in case Pion changes things and Pion also does not have red mime
type defined which should be easy to add though) and tried to use it everywhere.
But, as we get a bunch of callbacks and info from Pion, needed conversion in
more places than I anticipated. And also makes it necessary to carry
that cognitive load of what comes from Pion and needing to process it
properly.
* more locations
* test
* Paul feedback
* MimeType type
* more consolidation
* Remove unused
* test
* test
* mime type as int
* use string method
* Pass error details and timeouts. (#3402)
* go mod tidy (#3408)
* Rename CHANGELOG to CHANGELOG.md (#3391)
Enables markdown features in this otherwise already markdown'ish formatted document
* Update config.go to properly process bool env vars (#3382)
Fixes issue https://github.com/livekit/livekit/issues/3381
* fix(deps): update go deps (#3341)
Generated by renovateBot
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* Use a Twirp server hook to send API call details to telemetry. (#3401)
* Use a Twirp server hook to send API call details to telemetry.
* mage generate and clean up
* Add project_id
* deps
* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
participant_id, track_id as appropriate
- Store status as int
* deps
* Update pkg/sfu/mime/mimetype.go
* Fix prefer codec test
* handle down track mime changes
---------
Co-authored-by: Denys Smirnov <dennwc@pm.me>
Co-authored-by: Philzen <Philzen@users.noreply.github.com>
Co-authored-by: Pablo Fuente Pérez <pablofuenteperez@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Paul Wells <paulwe@gmail.com>
Co-authored-by: cnderrauber <zengjie9004@gmail.com>
* Use a Twirp server hook to send API call details to telemetry.
* mage generate and clean up
* Add project_id
* deps
* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
participant_id, track_id as appropriate
- Store status as int
* deps
For applications with heavy data usage, accumulating data bytes over 5
minutes and then calculating rate using a much shorter window (like 2 -
5 seconds) makes it looks like there is a massive rate spike.
While this change is not a fix, this should soften the impact.
Need a better way to handle different parts of the system operating at
different frequencies. Can use rate in the reporting window, but that
will miss the spikes. Maybe that is okay. For example, if the reporting
window is 5 minutes and there was a 100 Mbps spike for about 10 seconds
of it, it would get smoothed out.
On a resume, the signal stats will call `ParticipantLeft`. Although, it
explicity says not to send events, it could still close the stats
worker.
To handle that, we created a stats worker if needed in
`ParticipantResume` notification in this PR
(https://github.com/livekit/livekit/pull/2982), but that is not enough
as that event could happen before previous signal connection closes the
stats worker.
A new stats worker does get created when `ParticipantJoined` is called
by the new signal connection, but it does not transfer connected state.
So, when the client leaves, `ParticipantLeft` is not sent.
I am not seeing why we should not transfer connected state always given
that it is the same participant SID/session. But, I have a feeling that
I am missing some corner case. Please let me know if I am missing
something here.
* De-centralize some configs to where they are used.
And make default variables.
Renaming a bit, but these are all internal config and have not been
added to documented config.
* Keep documented config as is.
* test
* typo
* Some misc clean up.
- Have been seeing counterfeiter warnings about efficiency for a while
with go:generate declaration multiple times in the same package.
Address that: https://github.com/maxbrunsfeld/counterfeiter?tab=readme-ov-file#step-2b---add-counterfeitergenerate-directives
- A bit more readability on parameters passed to `sendLeave`
* spacing
* revert some deletes as the complaint was in analytics service only
* Declare in package only once.
Although the warning is about go:generate multiple times when directly
giving the interface to generate, have `go:generate` multiple times in a
package even with `-generate` ends up generating once per invocation.
Once per package is enough to run the generation just once.
* Ref count the stats worker.
NOTE: Don't liek this much, but wanted to open this get some 👀 on
this and get feedback.
There are two entities, one for counting signal bytes and another for
media stats. They both send `ParticipantJoined` and `ParticipantLeft`
event.
In the case of a participant resume, as the old web socket
connection is closed, that triggers a signal stats counter close. That
would call `ParticipantLeft` and that would close the stats worker.
The closed stats worker got reaped in `FlushStats` after three minutes.
So, all events after that did not have a worker and hence went
unreported including missing participant_left webhook because it relied
on checking if a participant was ever connected and that needed to check
the worker state.
Using a ref count to keep track of join/leaves. And not close the worker
until ref count goes down to 0.
* create a stats worker on resume
* revert incorrect changes
* transfer connected state
* transfer connected state when creating worker
* resolve participant on a resume
* Use Seque in ops queue.
Standardizing some uses
- Change OpsQueue to use Deque so that it can grow/shrink as necessary and
need not worry about channel getting full and dropping events.
- Change StreamAllocator and TelemetryService to use OpsQueue so that
they also need not worry about channel size and overflows.
* Address feedback
* delete obvious comment
* clean up
UpdateSubscription had a shortcoming where when it couldn't find the
participant, it ignored the request.
This PR further removes the reliance of current publisher state from
subscribers.
- SubscribeToTrack only takes in a trackID
- Introduced RoomTrackManager to maintain all published tracks to a room
- Added TrackUnpublished event to clearly indicate when a track has been removed
- SubscribeRequested event no longer include information about the publisher
Added a new manager to handle all subscription needs. Implemented using reconciler pattern. The goals are:
improve subscription resilience by separating desired state and current state
reduce complexity of synchronous processing
better detect failures with the ability to trigger full reconnect
* add prometheus stats for rtt/jitter/packet loss
* add track source to metrics
* better packet loss bins
* add track type to metrics
* remove source from AnalyticsStat
* regenerate telemetry service fake
* compute loss from per stream packet count
Related to livekit/protocol#273
This PR adds:
- ParticipantResumed - for when ICE restart or migration had occurred
- TrackPublishRequested - when we initiate a publication
- TrackSubscribeRequested - when we initiate a subscription
- TrackMuted - publisher muted track
- TrackUnmuted - publisher unmuted track
- TrackPublish/TrackSubcribe events will indicate when those actions have been successful, to differentiate.
In certain scenarios such as migration, we do not want a duplicate event
to be sent when the participant is reconnecting. The Prometheus metric
should still be updated though.
* Use a go routine to clean up stats workers.
It is possible that certain events (like TrackUnpublished) can
happen after the participant is closed. For webhooks pertaining
to those events, need details like room name/id. So,reap stats
workers a little while after the participant left event happens.
* handle data race report
* log analytics worker reap
* debug log