Commit Graph

961 Commits

Author SHA1 Message Date
Jacob Gelman fcc46d3c45 Use camel case log name in DataBlobKey (#4633) 2026-06-29 21:33:55 -07:00
Jacob Gelman a47e21b6cb Data track schema metadata (#4622)
* Async attributes on participant.

How it is different from existing participant attributes?
1. Async attribute can be added one at a time.
2. These are not included in `ParticipantInfo`.
3. Get an attribute bt participant identity and async attribute ID as
   and when needed.

* clean up

* get full definitions, not just ids

* listener OnDataTrackSchema

* name length config

* data blob

* deps

* static check

* Add missing request ID

* Update protocol commit

* Wire up StoreDataBlobResponse

* Pass request ID through in GetDataBlobResponse

* Pin protocol for schema metadata

* Pass through schema and frame encoding

* Support custom encoding identifiers

* Rename config key

* Increase default length to 32

* Make log messages more generic

* Use getters with built-in null check

* Do not bump deps

* Rename function

* Use protocol v1.48.1 release

---------

Co-authored-by: boks1971 <raja.gobi@tutanota.com>
2026-06-29 10:44:08 -07:00
cnderrauber 2aec61c11b update webrtc to fix interop issue with bundled datachannel (#4631) 2026-06-29 10:33:51 +08:00
renovate[bot] 930a2b6ad7 Update module github.com/urfave/cli/v3 to v3.10.0 (#4612)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2026-06-28 14:57:35 -07:00
cnderrauber eb3de092f0 update pion/sctp (#4623)
* update pion/sctp

* webrtc & datachannel
2026-06-25 21:43:07 +08:00
Raja Subramanian 1faab0c48e Add support for data blob (a. k. a. async participant attributes) (#4619)
* Async attributes on participant.

How it is different from existing participant attributes?
1. Async attribute can be added one at a time.
2. These are not included in `ParticipantInfo`.
3. Get an attribute bt participant identity and async attribute ID as
   and when needed.

* clean up

* get full definitions, not just ids

* listener OnDataTrackSchema

* name length config

* data blob

* deps

* static check

* Add missing request ID

* Update protocol commit

* Wire up StoreDataBlobResponse

* Pass request ID through in GetDataBlobResponse

* deps

* atomic

* sctp at 1.9.5

* remove proto clone

---------

Co-authored-by: Jacob Gelman <3182119+ladvoc@users.noreply.github.com>
2026-06-24 14:42:37 +05:30
Raja Subramanian 1b69630a28 Prometheus metric for join latency. (#4616)
* Prometheus metric for join latency.

Also including a couple of other failures in the signal connection path
and moving the signal connected to after all that.

Not doing counters for the new signal failure paths. I should not have
done for the other two I added a little while ago also (
validation failure and start participant failure) as those are not
scalable to keep adding to node stats. Will probably remove those two
from node stats later. Can add those counters if they are useful.

* deprecate signal failed counters
2026-06-22 22:07:32 +05:30
Ryan Gaus 86a79f83fc fix: report participant capabilities in ParticipantInfo (#4606) 2026-06-22 09:23:33 -04:00
Denys Smirnov 35b5390c27 Update protocol. (#4601) 2026-06-18 13:29:21 +02:00
Paul Wells 12a023ae45 agent: thread attributes map from dispatch to job (#4598)
* agent: thread simulation flag from dispatch to job

Reads simulation from AgentDispatch / RoomAgentDispatch and copies it
onto Job in agent.LaunchJob and the inline room-agent path so workers
see the flag.

Stacked on top of livekit/protocol#1629.

* agent: replace simulation bool with attributes map

Threads the renamed attributes map (was bool simulation) from dispatch
to job and bumps the protocol pseudo-version.

* deps
2026-06-16 01:53:01 -07:00
shishirng 08ab361e8e [WIP] rtc: add RestartSessionTimer to re-anchor participant session duration (#4566)
* rtc: add RestartSessionTimer to re-anchor participant session duration

Exposes ParticipantImpl.RestartSessionTimer so the session timer can be
re-anchored to the actual join time. Duration is only ever emitted once
the participant becomes active, so re-anchoring at join keeps pre-join
wall-clock out of the reported/billed duration. Adds the method to the
LocalParticipant interface (fake regenerated) and a local protocol
replace to pick up SessionTimer.Reset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* tidy

* update protocol

* report ended at for inactive sessions

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Paul Wells <paulwe@gmail.com>
2026-06-11 10:02:25 -07:00
Raja Subramanian 8efc94eb68 Update DTLS to v3.1.4 (#4587) 2026-06-11 12:23:55 +05:30
cnderrauber 7dc6877738 Preserve original expiry when refreshing token (#4580)
To avoid shortening the token expiration time during
refreshing cause client reconnect failed after network
down for a long time (>5min).
2026-06-10 14:51:10 +08:00
Raja Subramanian fd452212c7 Update mediatransportutil to get ICE candidate timeout config (#4572) 2026-06-08 12:42:58 +05:30
renovate[bot] c4e41872c5 Update go deps to v1.17.2 (#4462)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2026-06-07 23:08:05 -07:00
renovate[bot] dc8e0310ad Update go deps to v4 (#4482)
* Update go deps to v4

Generated by renovateBot

* update dockertest to v4

* fix

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: David Zhao <dz@livekit.io>
2026-06-07 23:07:40 -07:00
Raja Subramanian 6590570d7c Pin pion/dtls to v3.1.2 (#4570) 2026-06-06 20:36:10 +05:30
Paul Wells cdbbee1f8e deps: bump protocol + psrpc to latest tips (#4565)
* deps: bump psrpc + protocol + cloud-protocol + backend-common to latest

* deps: go get -u sweep + bump counterfeiter

Direct deps bumped via go get -u ./...:
  clipperhouse/displaywidth 0.10.0 -> 0.11.0
  clipperhouse/uax29/v2 2.6.0 -> 2.7.0
  fatih/color 1.18.0 -> 1.19.0
  florianl/go-tc 0.4.7 -> 0.4.8
  hashicorp/go-version 1.8.0 -> 1.9.0
  livekit/protocol -> a7a83da5 (latest)
  ua-parser bumped
  urfave/cli/v3 3.8.0 -> 3.9.0
  otel/sdk 1.43.0 -> 1.44.0
  yaml.in/yaml/v2 2.4.2 -> 2.4.4

Indirect: mdlayher/netlink + socket bumped, mattn/* bumped, olekukonko/*
bumped, counterfeiter v6.11.1 -> v6.12.2.

Newer-major audit: no actionable majors. +incompatible: twitchtv/twirp
v8.1.3 (upstream choice, stays); docker/cli and docker/docker indirect.

Notable stuck patterns worth a separate cleanup:
- pkg/errors v0.9.1 direct dep (unmaintained; stdlib supplants)
- go.uber.org/atomic + multierr direct deps
- ory/dockertest/v3 v3.12.0 (v4 is available — cascade has migrated
  cloud-protocol, backend-common, psrpc to v4).

* deps: pin pion/webrtc/v4 v4.2.11 + pion/sctp v1.9.5

* deps: pin protocol+psrpc+MTU to landed versions
2026-06-05 14:48:51 -07:00
Raja Subramanian 835ef1b353 Metrics for participant active, i. e. fully established. (#4557)
* Metrics for participant active, i. e. fully established.

- Egress stub for v2 API
- Fix the participant canceled counter 🤦
- Add active counter -> this is increment when a participant becomes
  active, i. e. primary peer connection established. Can be used to
  monitor node wise connection establishment issues.
- Add singnalling validation fail counter.

With this, we have
- signalling validation fail
- signalling failed --> this is when the `startSession` fails
- signalling connected -> signalling is succesful and can send back
  joinResponse to client

on media connection side
- rtc_init -> start
- rtc_connected -> participant session created (joined)
- rtc_active -> primay peer connection established
- rtc_canceled -> could not proceed with RTC connection due to not being
  able to resume.

* signalling counters deps

* revert pion/webrtc to 4.2.12 to get SCTP without interleaving

* go back to pion/webrtc 4.2.11 and sctp 1.9.5
2026-06-03 19:50:19 +05:30
cnderrauber 356ae211a3 Config documentation for advertise_internal_ip and skip_external_ip_validation (#4552)
See https://github.com/livekit/mediatransportutil/pull/88
2026-06-01 14:37:08 +08:00
Raja Subramanian 062d12197f Use NACKQuueInterface type. (#4538)
And some extra logging for subscription permission when it fails.
2026-05-21 23:00:51 +05:30
Paul Wells 019a6640ae rtc: report participant kind code and details (#4534)
* rtc: report participant kind code and details

Plumb ParticipantKind and KindDetails through MediaTrack and
BytesTrackStats so track-level reporting can record the numeric kind
code plus details codes on every participant_session aggregation,
alongside the existing Kind string. Also picks up the new kind fields
on resolved BytesSignalStats participants.

Adds deployment/agentID/version to the agent worker logger.
2026-05-18 23:20:52 -07:00
cnderrauber 89faaeba82 Apply ttl check only when authenticate allocation creating (#4526)
* Apply ttl check only when authenticate allocation creating

TTL check could reject allocation/persmission refresh in
security enhancement #4505, cause long-live session disconnect
when turn credential is expired.
Only check ttl on allocation creating to prevent abusing leaked
credential but keep long-live session work.
2026-05-15 14:55:05 +08:00
Denys Smirnov 8b79ec9e47 Support SIP auth realm for inbound. (#4522) 2026-05-14 10:45:16 +02:00
Raja Subramanian 4b8db3cfe5 Add integration test for TURN auth failures (#4524)
* Add integration test for TURN auth failures

Covers four credential-corruption scenarios against the TURN server
embedded in a single-node server: unparseable username, wrong password,
expired username, and unknown API key. Each case drives a raw pion
turn.Client Allocate and asserts the server rejects with a TURN error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* TURN auth test: cover password expiry binding

The TURN password's hash includes the expiry along with the secret and
participant ID. Add two cases that exercise this binding: a password
generated for a different expiry than the username's, and a password
generated without any expiry component paired with a username that has
one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:48:27 +05:30
Denys Smirnov ba366fc712 Fix SIP media config upgrade. (#4511) 2026-05-07 10:12:45 +02:00
Paul Wells 8fbc5adfce update protocol for protojson (#4510) 2026-05-07 00:55:00 -07:00
Denys Smirnov 8ffcef93b2 Update protocol to support SIP media config. (#4509) 2026-05-06 18:18:21 +02:00
Paul Wells 803999efad rename agent environment to deployment (#4506)
* rename agent environment to deployment

* deps
2026-05-05 14:19:40 -07:00
Paul Wells 253f977d32 add duration seconds reporting (#4500)
* add duration seconds reporting

* deps

* deps
2026-05-02 06:19:23 -07:00
Paul Wells ffab3bd308 add agent environment (#4498)
* add agent environment

* lint

* psrpc error

* deps
2026-05-01 19:30:06 -07:00
Raja Subramanian ccdf23c8a6 Use mediatransportutil/codec package, no functional change (#4497) 2026-05-01 20:06:29 +05:30
David Chen 743d9c8b3a add support for client capabilities (#4461)
* update protocol version

* only check for client capabiltiy to strip packet trailer
2026-04-27 17:58:36 -07:00
Raja Subramanian fc47e47866 Close peer connection unconditionally to unblock set local/remote (#4485)
* Close peer connection unconditionally to unblock set local/remote
description operations.

Have been chasing a leak where participants have a lot of connectivity
issues and analysed a goref with Claude. Output below.

Jo Turk quickly patched sctp for reported issue -
https://github.com/pion/sctp/pull/465.

This PR moves the peer connection close to before waiting for events
queue to be drained as event queue could be blocked on
`SetLocal/RemoteDescription` hanging.

The scenario is a bit far-fetched as a lot of things have to happen, but
it does point to a scenario where things could hang. Remains to be seen
if this helps. Note that closing the peer connection early could mean
the contained objects (like data channels) could all be closed as part
of the peer connection close. But, still keeping the explicit clean up
path (which should effectively become no-op) to minimise changes.

------------------------------------------------------------------

The wedge is in pion/sctp's blocking-write gate, called synchronously from inside the PC's operations queue. Five things have to be true at the same time, and on this build they all are:

  1. SCTPTransport.Start is synchronous in the SetRemoteDescription op

  The stuck stack:
  PeerConnection.SetRemoteDescription.func2  (peerconnection.go:1363)
    → startRTP → startSCTP
      → SCTPTransport.Start         (sctptransport.go:141)
        → DataChannel.open          (datachannel.go:178)
          → datachannel.Dial → Client → Stream.WriteSCTP
            → Association.sendPayloadData    (association.go:3141)  ← blocks here
  SCTPTransport.Start synchronously sends the DCEP "OPEN" for each pre-negotiated channel. The operations.start goroutine runs SetRemoteDescription's logic; it does not return until Start does.

  2. The wait has no deadline

  Stream.WriteSCTP (stream.go:289) calls sendPayloadData(s.writeDeadline, ...). s.writeDeadline is the default zero-value deadline.Deadline — never armed, because DataChannel.Dial doesn't call Stream.SetWriteDeadline. So the <-ctx.Done() arm of the wait select can
  never fire.

  3. EnableDataChannelBlockWrite(true) puts SCTP into a serialized-write gate

  At livekit-server/pkg/rtc/transport.go:362 livekit calls se.EnableDataChannelBlockWrite(true). That flips the sendPayloadData path to:
  // association.go:3138-3148
  if a.blockWrite {
      for a.writePending {
          a.lock.Unlock()
          select {
          case <-ctx.Done():        // never (no deadline)
          case <-a.writeNotify:     // only fires when writeLoop fully drains pendingQueue
          }
          a.lock.Lock()
      }
      a.writePending = true
  }

  4. writeNotify only fires after the writeLoop drains everything

  The only place notifyBlockWritable is called is gatherOutbound (association.go:3085-3088), and only when len(chunks) > 0 && a.pendingQueue.size() == 0 — i.e., the writeLoop actually managed to move all pending chunks to inflight. If cwnd is full and SACKs stop
  arriving, the writeLoop wakes up, sees zero room, sends nothing, and writePending stays true.

  5. There is no association-level abort timer for data writes

  At association.go:764:
  assoc.t3RTX = newRTXTimer(timerT3RTX, assoc, noMaxRetrans, rtoMax)
  noMaxRetrans means the retransmission timer never gives up. INIT has maxInitRetrans, but data does not. There is no equivalent of TCP's tcp_retries2 → ETIMEDOUT → ABORT. So once the path is dead post-handshake, t3RTX keeps firing into the void and the association
   never transitions out of established on its own.

  What it takes to wake it up

  Only an external close: somebody has to terminate the underlying DTLS conn (which makes Association.readLoop's netConn.Read fail, which closes closeWriteLoopCh, which lets timerLoop exit). But — and this is the kicker — readLoop's defer at association.go:976-996
  closes everything except it does not call notifyBlockWritable. So even if readLoop unwinds, any goroutine parked on <-a.writeNotify stays parked unless it was watching ctx (which here it isn't).

  So the trigger sequence on this pod was almost certainly:
  1. Peer establishes ICE+DTLS+SCTP, association goes established.
  2. Peer disappears (ICE silently fails, NAT rebinding, OS sleep, kill -9, etc.).
  3. The first DCEP-OPEN for one of livekit's pre-negotiated channels is queued; cwnd never opens because no SACKs return.
  4. writePending is now true for the lifetime of the process, with no deadline, no ctx, no kill.
  5. The PC's operations queue is wedged, SetRemoteDescription never returns, livekit-server's handleRemoteOfferReceived event handler is parked, the participant is never torn down, and the SCTP timerLoop pins the entire participant graph in memory until OOM-kill.

  Realistic fixes (in order of how clean they are)

  1. Upstream: in pion/sctp, broadcast notifyBlockWritable() (or close writeNotify) inside readLoop's defer cleanup, so a closed association unblocks any pending writers. This is the right fix.
  2. livekit-server: wrap pc.SetRemoteDescription(...) with a timeout, and on timeout call pc.Close() — Close ultimately tears down the DTLS conn, which lets readLoop exit (point 1 still needs to be true for the writer goroutine to actually unblock, though).
  3. Workaround: call stream.SetWriteDeadline(...) on the SCTP stream before issuing the DCEP open, so the ctx arm of the select can fire. Requires reaching past webrtc.DataChannel though.
  4. Heaviest hammer: don't pre-negotiate the data channels inline with SetRemoteDescription — open them lazily after PC reaches connected so a stuck open never blocks signaling.

  Without (1), even (2) leaves the writer goroutine itself parked forever — but at least the PC and its participant-side state would be released; only the SCTP goroutine subtree (much smaller) would leak.

* revert probe stop change

* handle nil offer
2026-04-27 21:38:46 +05:30
renovate[bot] 639406eb96 Update module github.com/pion/ice/v4 to v4.2.3 (#4481)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2026-04-26 22:33:26 -07:00
Anunay Maheshwari 9ee06635d6 feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules (#4466)
* feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules

* update deps
2026-04-22 12:49:36 +05:30
Raja Subramanian 6c81f67858 Add subscriber stream start event notification (#4449) 2026-04-14 22:08:31 +05:30
renovate[bot] ea7b9c6fe1 Update module github.com/livekit/protocol to v1.45.3 (#4435)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2026-04-11 16:17:10 -07:00
renovate[bot] 97378368dd Update go deps (major) (#3179)
* Update go deps

Generated by renovateBot

* update api usage

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: David Zhao <dz@livekit.io>
2026-04-11 14:28:33 -07:00
renovate[bot] d6aef547ce Update go deps (#3862)
Generated by renovateBot

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2026-04-11 13:59:25 -07:00
cnderrauber 5dc4e90d82 Apply IPFilter when get local ip (#4440)
Fix #4437
See https://github.com/livekit/mediatransportutil/pull/80
2026-04-10 08:59:53 +08:00
Anunay Maheshwari ff7fd7ed56 feat(agent-dispatch): add job restart policy (#4401)
* feat(agent-dispatch): add job restart policy

* deps
2026-03-27 21:32:04 +05:30
cnderrauber 9474c807c0 route participant reads through PSRPC instead of Redis (#4387)
rel: #4373
2026-03-24 16:25:11 +08:00
Raja Subramanian bc3aeaf3f8 Update grpc to address CVE-2026-33186 (#4381)
That updates Go version/
2026-03-23 22:01:17 +05:30
Théo Monnom 89410df74c handle AGENT_ERROR disconnect reason (#4339) 2026-03-17 23:00:16 -07:00
Denys Smirnov 4a9e004555 Update protocol. (#4367) 2026-03-16 17:26:56 +02:00
cnderrauber e963953817 Refine ipv6 support (#4352)
* Refine ipv6 support

* go mod

* check ipv4 is set in turn
2026-03-09 20:43:00 +08:00
He Chen cb7dc2d02a TEL-405: support originating calls from custom domains (#4349) 2026-03-06 12:25:40 -08:00
Raja Subramanian f51b27328f Update pion/webrtc and deps to update dtls (#4326)
DTLS v3.1.0 was retracted - https://github.com/pion/dtls/pull/800
2026-02-19 13:43:20 +05:30
Benjamin Pracht 03e90dd762 Fix for some CodeQL reported issues (#4314)
One of the fix is in updated protocol (logging of a request message that includes the turn server settings, including password).
2026-02-11 10:15:12 -08:00