Release to Docker / docker (push) Failing after 3m42s
* Key telemetry stats work using combination of roomID, participantID
With forwarded participant, the same participantID can existing in two
rooms.
NOTE: This does not yet allow a participant session to report its
events/track stats into multiple rooms. That would require regitering
multiple listeners (from rooms a participant is forwarded to).
* missed file
* data channel stats
* PR comments + pass in room name so that telemetry events have proper room name also
Published counter was bumped up only when not migrating in, but it was
decremented when a migrating participant leaves without expectation to
resume. That could have resulted in negative counts.
Always change counters irrespective of migration or expected to resume
on leave. Control events send based on migration/resume.
* Forwarding latency measurement tweaks.
- prom transmission type public
- do not measure short term values as it is not used and saves some lock
contention time in packet path potentially. Adding a separate method
for that.
- Change latency/jitter summary reporting to `ns` also to match the
histogram.
* add GetShortStats
* Higher resolution forwarding latency histogram.
Was using the average latency/jitter of last second to populate
forwarding latency/jitter histogram. But, it is too coarse, i. e. the
average value of latency/jitter is very low and those summarised samples
end up in the lowest bucket always.
A few things to address it
- record per packet forwarding latency in histogram
- adjust histogram bins to include smaller values
- Drop jitter histogram
This is a per packet call, but prometheus histogram is supposedly
fast/light weight. Would be good to get better resolution histograms.
Hence doing this. Please let me know if there are performance concerns.
* typo
* one more typo
* Add prom histogram for forwarding latency and jitter.
Using short term stats for histogram.
An example setting is
1s - short term
1m - long term
Using the 1s (short term) data for histogram. In that 1 second, all
packet forwarding latencies are averaged for latency and std. dev. of
the collection is used as jitter.
* try different staticcheck
Currently, the signal requests are counted on media side and signal
responses are counted on controller side. This does not provide the
granularity to check how many response messages each media node is
sending.
Seeing some cases where track subscriptions are slow under load. This
would be good to see if the media node is doing a lot of signal response
messages.
* Populate SDP cid in track info when available.
- Adding SDP cid to TrackInfo. Browsers like FF uses a different stream
id for AddTrack and actual SDP offer. So, have to look up using both
on server side. To make it easier, store both (only if different) in
TrackInfo.
- Use a codec in TrackInfo for audio also. There is some discussion
around doing simulcast codec for audio so that something like PSTN can
use G.711 without any transcoding. So, just keep it consistent between
audio and video.
- Populate SDP cid when SDP offer is received. It could populate a
pending track or an already published track if the new offer is for a
back up codec where the primary codec is already published.
- Passed around parsed offer to more places to avoid parsing multiple
times.
- Clean up MediaTrack interface a bit and remove unneeded methods.
* WIP
* WIP
* deps
* stream allocator mime aware
* clean up
* populate SDP cid before munging
* interface methods
* Send `participant_connection_aborted` when participant session is closed
without becoming `ACTIVE`.
There is one sticky case. If there is a migration and the migration
fails, this will send `participant_connection_aborted` even though the
participant may have connected properly on the previous node.
* depsg
* Rework node stats a bit.
Related protocol PR - https://github.com/livekit/protocol/pull/1023
- Make a config for node stats measurements. Wanted to put the config in
`routing` package, but a circular dependency forced me to put in
config.go
- Make rate calculations explicit, i. e. requested via config.
Previously, it had some odd checks to decide when to calculate rate
and it would have been calculating over different windows.
- Report signal/data channel bytes every 5 seconds to stats collection
module. Previously, it was doing it every 30 seconds and that meant
some windows could have had a large spike
NOTE: Still need to think about this for load calculations as a large
number of participants leaving could flush in a small window and that
could report a large spike in bytes/packets. Maybe need to ignore
signal bytes for load calculation?
* deps
* use default node stats config if given config is nil
* split out node stats into a struct for re-use
* update config
* Send regressed codec upstream stats to analytics.
There is more work to do for analytics for simulcast codec, i. e. when
both codecs are published. Shorter term, ensure that analytics are sent
for regressed codec if active.
* lock access to regressed codec received
* Normalize mime type and add utilities.
An attempt to normalize mime type and avoid string compares remembering
to do case insensitive search.
Not the best solution. Open to ideas. But, define our own mime types
(just in case Pion changes things and Pion also does not have red mime
type defined which should be easy to add though) and tried to use it everywhere.
But, as we get a bunch of callbacks and info from Pion, needed conversion in
more places than I anticipated. And also makes it necessary to carry
that cognitive load of what comes from Pion and needing to process it
properly.
* more locations
* test
* Paul feedback
* MimeType type
* more consolidation
* Remove unused
* test
* test
* mime type as int
* use string method
* Pass error details and timeouts. (#3402)
* go mod tidy (#3408)
* Rename CHANGELOG to CHANGELOG.md (#3391)
Enables markdown features in this otherwise already markdown'ish formatted document
* Update config.go to properly process bool env vars (#3382)
Fixes issue https://github.com/livekit/livekit/issues/3381
* fix(deps): update go deps (#3341)
Generated by renovateBot
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* Use a Twirp server hook to send API call details to telemetry. (#3401)
* Use a Twirp server hook to send API call details to telemetry.
* mage generate and clean up
* Add project_id
* deps
* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
participant_id, track_id as appropriate
- Store status as int
* deps
* Update pkg/sfu/mime/mimetype.go
* Fix prefer codec test
* handle down track mime changes
---------
Co-authored-by: Denys Smirnov <dennwc@pm.me>
Co-authored-by: Philzen <Philzen@users.noreply.github.com>
Co-authored-by: Pablo Fuente Pérez <pablofuenteperez@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Paul Wells <paulwe@gmail.com>
Co-authored-by: cnderrauber <zengjie9004@gmail.com>
* Use a Twirp server hook to send API call details to telemetry.
* mage generate and clean up
* Add project_id
* deps
* - Redact requests
- Do not store responses
- Extract top level fields room_name, room_id, participant_identity,
participant_id, track_id as appropriate
- Store status as int
* deps
For applications with heavy data usage, accumulating data bytes over 5
minutes and then calculating rate using a much shorter window (like 2 -
5 seconds) makes it looks like there is a massive rate spike.
While this change is not a fix, this should soften the impact.
Need a better way to handle different parts of the system operating at
different frequencies. Can use rate in the reporting window, but that
will miss the spikes. Maybe that is okay. For example, if the reporting
window is 5 minutes and there was a 100 Mbps spike for about 10 seconds
of it, it would get smoothed out.
On a resume, the signal stats will call `ParticipantLeft`. Although, it
explicity says not to send events, it could still close the stats
worker.
To handle that, we created a stats worker if needed in
`ParticipantResume` notification in this PR
(https://github.com/livekit/livekit/pull/2982), but that is not enough
as that event could happen before previous signal connection closes the
stats worker.
A new stats worker does get created when `ParticipantJoined` is called
by the new signal connection, but it does not transfer connected state.
So, when the client leaves, `ParticipantLeft` is not sent.
I am not seeing why we should not transfer connected state always given
that it is the same participant SID/session. But, I have a feeling that
I am missing some corner case. Please let me know if I am missing
something here.
* De-centralize some configs to where they are used.
And make default variables.
Renaming a bit, but these are all internal config and have not been
added to documented config.
* Keep documented config as is.
* test
* typo
* Some misc clean up.
- Have been seeing counterfeiter warnings about efficiency for a while
with go:generate declaration multiple times in the same package.
Address that: https://github.com/maxbrunsfeld/counterfeiter?tab=readme-ov-file#step-2b---add-counterfeitergenerate-directives
- A bit more readability on parameters passed to `sendLeave`
* spacing
* revert some deletes as the complaint was in analytics service only
* Declare in package only once.
Although the warning is about go:generate multiple times when directly
giving the interface to generate, have `go:generate` multiple times in a
package even with `-generate` ends up generating once per invocation.
Once per package is enough to run the generation just once.
* Add counter for pub&sub time metrics
The pub&sub shows large value in migration related case like
muted/disabled migration, the subscription time depends on
the time when publisher unmute the track(sending rtp packet
after migration), add a counter to distinguish since we
can't control the time in such cases and the first subscription
attemps also is more meaningful than those cases.
* Add info log for high publish delay
* Record out-of-packet count/rate in prom.
Adding a field to AnalyticsStream to make this easier to report.
Let me know if adding to AnalyticsStream is not ok.
Will set up a protocol PR if it is okay.
* deps
* Ref count the stats worker.
NOTE: Don't liek this much, but wanted to open this get some 👀 on
this and get feedback.
There are two entities, one for counting signal bytes and another for
media stats. They both send `ParticipantJoined` and `ParticipantLeft`
event.
In the case of a participant resume, as the old web socket
connection is closed, that triggers a signal stats counter close. That
would call `ParticipantLeft` and that would close the stats worker.
The closed stats worker got reaped in `FlushStats` after three minutes.
So, all events after that did not have a worker and hence went
unreported including missing participant_left webhook because it relied
on checking if a participant was ever connected and that needed to check
the worker state.
Using a ref count to keep track of join/leaves. And not close the worker
until ref count goes down to 0.
* create a stats worker on resume
* revert incorrect changes
* transfer connected state
* transfer connected state when creating worker
* resolve participant on a resume
* speed up track publication
Add metrics for track publication and subscription
Return EnabledCodecs in JoinResponse so client can
choose codec without server side codec fallback
Cache remote webrtc track without AddTrackRequest to
let client send publisher offer before AddTrackRequest response
* go mod
* clean code