Currently, the signal requests are counted on media side and signal
responses are counted on controller side. This does not provide the
granularity to check how many response messages each media node is
sending.
Seeing some cases where track subscriptions are slow under load. This
would be good to see if the media node is doing a lot of signal response
messages.
* Rework node stats a bit.
Related protocol PR - https://github.com/livekit/protocol/pull/1023
- Make a config for node stats measurements. Wanted to put the config in
`routing` package, but a circular dependency forced me to put in
config.go
- Make rate calculations explicit, i. e. requested via config.
Previously, it had some odd checks to decide when to calculate rate
and it would have been calculating over different windows.
- Report signal/data channel bytes every 5 seconds to stats collection
module. Previously, it was doing it every 30 seconds and that meant
some windows could have had a large spike
NOTE: Still need to think about this for load calculations as a large
number of participants leaving could flush in a small window and that
could report a large spike in bytes/packets. Maybe need to ignore
signal bytes for load calculation?
* deps
* use default node stats config if given config is nil
* split out node stats into a struct for re-use
* update config
* De-centralize some configs to where they are used.
And make default variables.
Renaming a bit, but these are all internal config and have not been
added to documented config.
* Keep documented config as is.
* test
* typo
Because we aren't able to get CPU count/load info on Windows, they are
stubbed out to return placeholders. This restores compatibility to run
on Windows.
Added a new manager to handle all subscription needs. Implemented using reconciler pattern. The goals are:
improve subscription resilience by separating desired state and current state
reduce complexity of synchronous processing
better detect failures with the ability to trigger full reconnect
* initial commit
* add correct label
* clean up
* more cleanup on adding stats
* cleanup
* move things to pub and sub monitors, ensure stats are correctly updated
* fix merge conflict
* Fix panic on MacOS (#1296)
* fixing last feedback
Co-authored-by: Raja Subramanian <raja.gobi@tutanota.com>
Exclude NACK count from being a trigger to refresh stats.
Since NACKs are updated instantaneously without having to wait for
Telemetry updates that occurs every 10s, having even a single NACK
could cause us to compute averages prematurely.
* Increase frequency of status updates and longer avail. threshold.
* better fix.
* fix room close test failure due to slow peer connection Close
* Perform avg computation more frequently if data has changed