Voice Connections

Voice connections operate in a similar fashion to the Gateway connection. However, they use a different set of payloads and a separate UDP-based connection for RTC data transmission. Because UDP is generally used for both receiving and transmitting RTC data, your client must be able to receive UDP packets, even through a firewall or NAT (see UDP Hole Punching for more information). The Discord voice servers implement functionality (see IP Discovery) for discovering the local machine's remote UDP IP/Port, which can assist in some network configurations. If you cannot support a UDP connection, you may implement a WebRTC connection instead.

Audio and video from a "Go Live" stream require a separate connection to another voice server. Only microphone and camera data are sent over the normal connection.

Voice Gateway

To ensure that you have the most up-to-date information, please use version 9. Otherwise, the events and commands documented here may not reflect what you receive over the socket. Video is only fully supported on Gateway v5 and above.

Gateway Versions

Version	Status	Change
9	Recommended	Added `channel_id` to Opcode 0 Identify and Opcode 7 Resume
8	Recommended	Added buffered resuming
7	Available	Added Opcode 17 Channel Options Update
6	Available	Added Opcode 16 Voice Backend Version
5	Available	Added Opcode 15 Media Sink Wants
4	Available	Changed speaking status from boolean to bitmask
3	Deprecated	Added video functionality, consolidated Opcode 1 Hello payload
2	Deprecated	Changed Gateway heartbeat reply to Opcode 6 Heartbeat ACK
1	Deprecated	Initial version

Gateway Commands

Name	Description
Identify	Start a new voice connection
Resume	Resume a dropped connection
Heartbeat	Maintain an active WebSocket connection
Media Sink Wants	Indicate the desired media stream quality
Select Protocol	Select the voice protocol and mode
Session Update	Indicate the client's supported codecs
Speaking	Indicate the user's speaking state
Video	Indicate the user's video state
Voice Backend Version	Request the current voice backend version
DAVE Protocol Transition Ready	Indicate that a DAVE transition is ready
MLS Key Package	Send an MLS key package
MLS Commit Welcome	Send an MLS commit and optional welcome
MLS Invalid Commit Welcome	Report an invalid MLS commit or welcome
No Route	Indicate that no RTC route was available

Gateway Events

Name	Description
Hello	Defines the heartbeat interval
Heartbeat ACK	Acknowledges a received client heartbeat
DAVE Protocol Execute Transition	Execute a prepared DAVE protocol or MLS group transition
DAVE Protocol Prepare Epoch	Prepare a DAVE protocol version or MLS epoch transition
DAVE Protocol Prepare Transition	Prepare a transition away from the current DAVE protocol
Clients Connect	A user connected to voice, also sent on initial connection to inform the client of existing users
Client Flags	Contains the flags of a user that connected to voice, also sent on initial connection for each existing user
Client Platform	Contains the platform type of a user that connected to voice, also sent on initial connection for each existing user
Client Disconnect	A user disconnected from voice
Media Sink Wants	Requested media stream quality updated
MLS Announce Commit Transition	Dispatches the winning MLS commit for the current epoch
MLS External Sender Package	Provides the voice server's MLS external sender package
MLS Proposals	Dispatches MLS proposals that group members must process
MLS Welcome	Welcomes a pending member into the MLS group
Ready	Contains SSRC, IP/Port, experiment, and encryption mode information
Resumed	Acknowledges a successful connection resume
Session Description	Acknowledges a successful protocol selection and contains the information needed to send/receive RTC data
Session Update	Client session description changed
Speaking	User speaking state updated
Voice Backend Version	Current voice backend version information, as requested by the client

Connecting to Voice

Retrieving Voice Server Information

The first step in connecting to a voice server (and in turn, a guild's voice channel or private channel) is formulating a request that can be sent to the Gateway, which will return information about the voice server we will connect to. Because Discord's voice platform is widely distributed, users should never cache or save the results of this call. To inform the Gateway of our intent to establish voice connectivity, we first send an Update Voice State payload.

If our request succeeded, the Gateway will respond with two events—a Voice State Update event and a Voice Server Update event—meaning you must properly wait for both events before continuing. The first will contain a new key, session_id, and the second will provide voice server information we can use to establish a new voice connection.

With this information, we can move on to establishing a voice WebSocket connection.

When changing channels within the same guild, it is possible to receive a Voice Server Update with the same endpoint as the existing session. However, the token will be changed and you cannot re-use the previous session during a channel change, even if the endpoint remains the same.

Establishing a Voice WebSocket Connection

Once we retrieve a session_id, token, and endpoint information, we can connect and handshake with the voice server over another secure WebSocket. Unlike the Gateway endpoint we receive in a Get Gateway request, the endpoint received from our Voice Server Update payload does not contain a URL protocol, so some libraries may require manually prepending it with wss:// before connecting. Once connected to the voice WebSocket endpoint, we can immediately send an Opcode 0 Identify payload:

Identify Structure

Field	Type	Description
server_id	snowflake	The ID of the guild, private channel, stream, or lobby being connected to
channel_id ¹	snowflake	The ID of the channel being connected to
user_id	snowflake	The ID of the current user
session_id	string	The session ID of the current session
token	string	The voice token for the current session
video?	boolean	Whether this connection supports video (default false)
streams?	array[stream object]	Simulcast streams to send
max_dave_protocol_version?	integer	The maximum DAVE protocol version supported by the client (default 0)

¹ Only required for Gateway v9 and above.

Stream Structure

Field	Type	Description
type	string	The type of media stream to send
rid	string	The RTP stream ID, conventionally the stringified quality
quality?	integer	The media quality to send (0-100, default 0)
active?	boolean	Whether the stream is active (default false)
max_bitrate? ¹	integer	The maximum bitrate to send in bps
max_framerate?	integer	The maximum framerate to send in fps
max_resolution?	stream resolution object	The maximum resolution to send
ssrc?	integer	The SSRC of the stream
rtx_ssrc? ²	integer	The SSRC of the retransmission stream

¹ Not sent by the voice server.

² If omitted for a negotiated video stream, clients should derive the RTX SSRC as the primary stream ssrc + 1.

Media Type

Value	Description
video	Video
screen ¹	Screenshare
test	Speed test

¹ For stream connections, clients may offer screen in Identify. The voice server will still populate the negotiated stream as video in Ready, as video is the actual underlying media type.

Stream Resolution Structure

Field	Type	Description
type	string	The resolution type to use
width	number	The fixed resolution width, or 0 for source
height	number	The fixed resolution height, or 0 for source

Resolution Type

Value	Description
fixed	Fixed resolution
source	Source resolution

Example Identify

{
  "op": 0,
  "d": {
    "server_id": "41771983423143937",
    "channel_id": "127121515262115840",
    "user_id": "104694319306248192",
    "session_id": "30f32c5d54ae86130fc4a215c7474263",
    "token": "66d29164ee8cd919",
    "video": true,
    "streams": [
      { "type": "video", "rid": "100", "quality": 100 },
      { "type": "video", "rid": "50", "quality": 50 }
    ],
    "max_dave_protocol_version": 1
  }
}

The voice server should respond with an Opcode 2 Ready payload, which informs us of our SSRCs and connection information:

Ready Structure

Field	Type	Description
ssrc	integer	The SSRC of the user's voice connection
ip	string	The IP address of the voice server
port	integer	The port of the voice server
modes	array[string]	Supported transport encryption modes
experiments	array[string]	Available voice experiments
streams	array[stream object]	Populated simulcast streams

Example Ready

{
  "op": 2,
  "d": {
    "ssrc": 12871,
    "ip": "127.0.0.1",
    "port": 1234,
    "modes": ["aead_aes256_gcm_rtpsize", "aead_xchacha20_poly1305_rtpsize"],
    "experiments": ["fixed_keyframe_interval"],
    "streams": [
      {
        "type": "video",
        "ssrc": 12872,
        "rtx_ssrc": 12873,
        "rid": "50",
        "quality": 50,
        "active": false
      },
      {
        "type": "video",
        "ssrc": 12874,
        "rtx_ssrc": 12875,
        "rid": "100",
        "quality": 100,
        "active": false
      }
    ]
  }
}

When streams is populated, the voice server has assigned local send SSRCs for the offered simulcast streams. Use each stream's ssrc and rtx_ssrc when announcing local video state with Opcode 12 Video, and when configuring a WebRTC packetizer.

Establishing a Voice Connection

Once we receive the properties of a voice server from our Ready payload, we can proceed to the final step of voice connections, which entails establishing and handshaking a connection for RTC data. First, we establish either a UDP connection using the Ready payload data, or prepare a WebRTC SDP. We then send an Opcode 1 Select Protocol with details about our connection:

Select Protocol Structure

Field	Type	Description
protocol	string	The voice protocol to use
data	?protocol data \| string	The voice connection data or WebRTC SDP
rtc_connection_id?	string	The UUID RTC connection ID, used for analytics
codecs?	array[codec object]	The supported audio/video codecs
experiments?	array[string]	The received voice experiments or selected experiments to use

Protocol Type

Value	Description
udp	Standard UDP voice connection
webrtc	WebRTC voice connection
~~webrtc-p2p~~	~~WebRTC peer-to-peer voice connection~~

Protocol Data Structure

Field	Type	Description
address ¹	string	The discovered IP address of the client
port ¹	integer	The discovered UDP port of the client
mode	string	The transport encryption mode to use

¹ These fields are only used to receive RTC data. If you only wish to send frames and do not care about receiving, you can randomize these values.

Codec Structure

Field	Type	Description
name	string	The name of the codec
type	string	The type of codec
priority ¹	integer	The preferred priority of the codec as a multiple of 1000 (unique per `type`)
payload_type ²	integer	The dynamic RTP payload type of the codec
rtx_payload_type?	integer	The dynamic RTP payload type of the retransmission codec (video-only)
encode?	boolean	Whether the client supports encoding this codec (default true)
decode?	boolean	Whether the client supports decoding this codec (default true)

¹ For audio, Opus is the only available codec and should be priority 1000.

² No payload type should be set to 96, as it is reserved for probe packets.

Supported Codecs

Providing codecs is optional due to backwards compatibility with old clients and bots that do not handle video. If the client does not provide any codecs, the server assumes an Opus audio codec with a payload type of 120 and no specific video codec.

Codec support is used by the server to negotiate a send codec per-client that all other clients can decode. If multiple are supported, the one with the lowest priority will be chosen. If the client does not support any codec others can decode, the server will choose the client's highest priority encode codec. If no codecs are supported, the server will fall back to H264.

Type	Name	Status
audio	opus	Required
video	AV1	Preferred
video	H265	Preferred
video	H264	Default
video	VP8	Available
video	VP9	Available

Example Select Protocol

{
  "op": 1,
  "d": {
    "protocol": "udp",
    "data": {
      "address": "127.0.0.1",
      "port": 1337,
      "mode": "aead_aes256_gcm_rtpsize"
    },
    "codecs": [
      {
        "name": "opus",
        "type": "audio",
        "priority": 1000,
        "payload_type": 120
      },
      {
        "name": "AV1",
        "type": "video",
        "priority": 1000,
        "payload_type": 101,
        "rtx_payload_type": 102,
        "encode": false,
        "decode": true
      },
      {
        "name": "H264",
        "type": "video",
        "priority": 2000,
        "payload_type": 103,
        "rtx_payload_type": 104,
        "encode": true,
        "decode": true
      }
    ],
    "rtc_connection_id": "d6b92f64-40df-48eb-8bce-7facb043149a",
    "experiments": ["fixed_keyframe_interval"]
  }
}

Transport Encryption Mode

The RTP size variants determine the unencrypted size of the RTP header in the same way as SRTP, which considers CSRCs and (optionally) the extension preamble to be part of the unencrypted header. The deprecated variants use a fixed size unencrypted header for RTP.

The Gateway will report what encryption modes are available in Opcode 2 Ready. Compatible modes will always include aead_xchacha20_poly1305_rtpsize but may not include aead_aes256_gcm_rtpsize depending on the underlying hardware. You must support aead_xchacha20_poly1305_rtpsize. You should prefer to use aead_aes256_gcm_rtpsize when it is available.

Value	Name	Nonce	Status
aead_aes256_gcm_rtpsize	AEAD AES256 GCM (RTP Size)	32-bit incremental integer value appended to payload	Preferred
aead_xchacha20_poly1305_rtpsize	AEAD XChaCha20 Poly1305 (RTP Size)	32-bit incremental integer value appended to payload	Required
xsalsa20_poly1305_lite_rtpsize	XSalsa20 Poly1305 Lite (RTP Size)	32-bit incremental integer value appended to payload	Deprecated
aead_aes256_gcm	AEAD AES256-GCM	32-bit incremental integer value appended to payload	Deprecated
xsalsa20_poly1305	XSalsa20 Poly1305	Copy of RTP header	Deprecated
xsalsa20_poly1305_suffix	XSalsa20 Poly1305 (Suffix)	24 random bytes	Deprecated
xsalsa20_poly1305_lite	XSalsa20 Poly1305 (Lite)	32-bit incremental integer value, appended to payload	Deprecated

Finally, the voice server will respond with an Opcode 4 Session Description that includes the mode and secret_key, a 32 byte array used for sending and receiving RTC data:

Session Description Structure

Field	Type	Description
audio_codec ¹	string	The audio codec to use
video_codec ¹	string	The video codec to use
media_session_id	string	The media session ID, used for analytics
mode?	string	The transport encryption mode to use, not applicable to WebRTC
secret_key?	array[integer]	The 32 byte secret key used for encryption, not applicable to WebRTC
sdp?	string	The WebRTC session description protocol
keyframe_interval?	integer	The keyframe interval in milliseconds
bandwidth_estimation_experiment?	string	The selected bandwidth estimation experiment
dave_protocol_version?	integer	The DAVE protocol version to use, where 0 indicates no DAVE support

¹ Note that these describe the codecs the client should send. Other clients may send media in a different codec that you indicated decode support for.

Example Session Description

{
  "op": 4,
  "d": {
    "audio_codec": "opus",
    "media_session_id": "89f1d62f166b948746f7646713d39dbb",
    "mode": "aead_aes256_gcm_rtpsize",
    "secret_key": [ ... ],
    "video_codec": "H264",
    "dave_protocol_version": 1
  }
}

We can now start sending and receiving RTC data over the previously established UDP or WebRTC connection.

Session Updates

At any time, the client may update the codecs they support using an Opcode 14 Session Update. If a user joins that does not support the current codecs, or a user indicates that they no longer support the current codecs, the voice server will send an Opcode 14 Session Update:

This may also be sent to update the current media_session_id or keyframe_interval.

Session Update Structure (Send)

Field	Type	Description
codecs	array[codec object]	The supported audio/video codecs

Session Update Structure (Receive)

Field	Type	Description
audio_codec?	string	The new audio codec to use
video_codec?	string	The new video codec to use
media_session_id?	string	The new media session ID, used for analytics
keyframe_interval?	integer	The keyframe interval in milliseconds

End-to-End Encryption

Since September 2024, Discord has migrated voice and video in private channels, voice channels, and streams to use end-to-end encryption (E2EE) through the DAVE protocol. When any DAVE protocol is enabled for a call, the full contents of media frames sent and received by call participants are end-to-end encrypted.

This section is a high-level overview of how to support Discord's audio & video end-to-end encryption (DAVE) protocol, centered around the Gateway opcodes necessary for the protocol. The most thorough documentation on the DAVE protocol is found in the protocol whitepaper. You may additionally be able to leverage or refer to Discord's open-source library libdave to assist your implementation. The exact format of the DAVE protocol opcodes is detailed in the opcodes section of the protocol whitepaper.

When a call is E2EE, all members of the call exchange keys via a Messaging Layer Security (MLS) group. This group is used to derive per-sender ratcheted media keys (known only to the participants of the group) to encrypt/decrypt media frames sent in the call.

Binary Websocket Messages

To reduce overhead, some of the new DAVE protocol opcodes are sent as binary instead of JSON text. See the format column in voice opcodes to identify them. Client-to-server binary messages start with a 1-byte opcode followed by the payload. Server-to-client binary messages on Gateway v8 and above include a 2-byte sequence number before the opcode:

Field	Type	Description	Size
Sequence ¹	Unsigned short (big endian)	Sequence number	2 bytes
Opcode	Unsigned byte	Opcode value	1 byte
Payload	Binary data	Format defined by opcode	Variable bytes

¹ Sequence numbers are only sent from the server to the client on Gateway v8 and above. See Buffered Resume for further details on how sequence numbers are used when present.

Indicating DAVE Protocol Support

Include the highest DAVE protocol version you support in Opcode 0 Identify as max_dave_protocol_version. Sending version 0, or omitting the max_dave_protocol_version field, indicates no DAVE protocol support.

The voice Gateway specifies the initial protocol version in Opcode 4 Session Description under dave_protocol_version. This may be any non-discontinued protocol version equal to or less than your supported protocol version.

Protocol Transitions

The voice server negotiates protocol version and MLS group transitions to ensure the continuity of media being sent for the call. This can occur when the call is upgrading/downgrading to/from E2EE (in the initial transition phase), changing protocol versions, or when the MLS group is changing.

Some opcodes include a transition ID. After preparing local state necessary to perform the transition, send Opcode 23 DAVE Protocol Transition Ready to indicate to the Gateway that you are ready to execute the transition. When all participants are ready or when a timeout has been reached, the Gateway dispatches Opcode 22 DAVE Protocol Execute Transition to confirm execution of the transition. The transition execution is what indicates to media senders that they can begin sending media with the new protocol context (e.g. without E2EE after a downgrade, with a new protocol version after a protocol version change, or using a new key ratchet after a group participant change).

Downgrade

Downgrades to protocol version 0 are announced via Opcode 21 DAVE Protocol Prepare Transition. This can occur during the transition phase when a client that does not support the protocol joins the call. When this transition is executed, senders should stop sending media using the protocol format.

Version Change & Upgrade

Protocol version transitions (including upgrades from protocol version 0) are announced via Opcode 24 DAVE Protocol Prepare Epoch. In addition to the transition_id, this opcode includes the epoch for the upcoming MLS epoch.

Receiving Opcode 24 DAVE Protocol Prepare Epoch with epoch = 1 indicates that a new MLS group is being created. Participants must:

Prepare a local MLS group with the parameters appropriate for the DAVE protocol version
Generate and send Opcode 26 MLS Key Package to deliver a new MLS key package to the Gateway

When the epoch is greater than 1, the protocol version of the existing MLS group is changing.

When the transition is executed, senders must start sending media using the new protocol context (e.g. formatted for the new protocol version or using a new key ratchet).

MLS Group Changes

When the participants of the MLS group must change, existing participants receive an Opcode 29 MLS Announce Commit Transition, whereas new members being added to the group receive Opcode 30 MLS Welcome. Both opcodes include the transition ID and binary MLS Commit or MLS Welcome message.

To prepare for the protocol transition, existing group members must apply the commit to progress their local MLS group to the correct next state. Opcode 23 DAVE Protocol Transition Ready is sent when the MLS commit has been processed.

Welcomed members send Opcode 23 DAVE Protocol Transition Ready after successfully joining the group received in the MLS Welcome message.

External Sender

The voice server must be an external sender of the MLS group, so that it can send external MLS proposals to add and remove call participants when appropriate (i.e. proposing the addition of new members when they connect and the removal of previous members when they disconnect).

DAVE protocol participants only process proposals which arrive from the external sender, and not from any other group members. The external sender only sends Add or Remove proposals.

The Gateway uses Opcode 25 MLS External Sender Package to provide the external sender public key and credential to MLS group participants. This message may be sent immediately on Gateway connect or at a later time when the call is upgrading to use the DAVE protocol.

Group creators must include the external sender they receive from the Gateway in their MLS group extensions when creating the group. Welcomed group members ensure that the expected external sender extension is present in the group they are about to join.

Joining the MLS Group

Except for the initial creation of the first group for the call, joining the MLS group always occurs after receiving Opcode 30 MLS Welcome.

Key Packages

To be proposed to be added to the MLS group, pending members must send an MLS key package via Opcode 26 MLS Key Package. Key packages are only used one time, and a new key package must be generated each time pending member is waiting to be added or re-added to the group.

Identity Public Key

MLS participants use an asymmetric keypair for MLS message signatures and authentication. The public key of this keypair is included in the key package and MLS tree. It is known to other participants in the call and is leveraged for out-of-band identity verification.

You can choose to generate a new ephemeral keypair for every protocol call or use the same persistent keypair at all times. Keys can be uploaded and verified using Upload Voice Public Key and Verify Voice Public Key respectively.

Initial Group

When there is not yet an MLS group (e.g. a transport-only encrypted call is upgrading or two members have just joined a new call), all pending group members create a local group using the MLS parameters defined by the DAVE protocol version and including the voice server external sender received via Opcode 25 MLS External Sender Package. Every pending member of the group has the chance to produce the initial commit that creates the MLS group with epoch = 1.

Pending group members receive add proposals for every other pending group member from the Gateway. If an additional pending member joins while there is not yet an MLS group, they receive all in-flight proposal messages.

Proposal and commit handling follows the same process whether or not there is an established group. See Proposals and Commits.

Welcome

Pending group members receive a welcome message from another group member which adds them to the MLS group. This is dispatched from the Gateway via Opcode 30 MLS Welcome.

Invalid Group

If the group received in an Opcode 30 MLS Welcome or Opcode 29 MLS Announce Commit Transition is unprocessable, the member receiving the unprocessable message sends Opcode 31 MLS Invalid Commit Welcome to the Gateway. Additionally, the local group state is reset and a new key package is generated and sent to the Gateway via Opcode 26 MLS Key Package.

This causes the Gateway to propose the removal and re-addition of the requesting member.

Proposals and Commits

The Gateway dispatches proposals which must be appended or revoked via Opcode 27 MLS Proposals. All members of the established or pending MLS group must append or revoke the proposals they receive, and then produce an MLS commit message and optionally an MLS welcome message (when committing add proposals which add new members) which they send to the Gateway via Opcode 28 MLS Commit Welcome.

In each epoch, the Gateway dispatches the "winning" commit via Opcode 29 MLS Announce Commit Transition and optionally the associated welcome messages via Opcode 30 MLS Welcome. The Gateway broadcasts the first valid commit and welcome(s) it sees in the given epoch, and drops any commits later received for the out-of-date epoch. All dispatched unrevoked proposals in the epoch must be included in the commit for it to be valid. All members added in the epoch must be welcomed for the welcome to be valid.

Payload Format

Some fields in the protocol frame payload use ULEB128 encoding. This is a variable-length code compression to represent arbitrarily large unsigned integers in a small number of bytes.

Field	Type	Description	Size
Media Frame	Binary data	Interleaved unencrypted and encrypted media frame	Variable bytes
Authentication Tag	Binary data	Truncated AES128-GCM AEAD Authentication Tag	8 bytes
Nonce	ULEB128	Truncated synchronization nonce	Variable bytes
Unencrypted Ranges	ULEB128	Unencrypted range offset and length pairs	Variable bytes
Supplemental Data Size	Unsigned integer (big endian)	Byte size of supplemental data	1 byte
Magic Marker	Binary data	`0xFAFA` marker to assist with protocol frame identification	2 bytes

Media Frame

The encrypted frame transformer is codec-aware and processes incoming encoded frames from WebRTC to determine which ranges must be left unencrypted so that they can pass through the WebRTC packetizer and depacketizer.

All of the (potentially discontiguous) encrypted ranges are joined together, in their order in the original frame, to be encrypted as one block of plaintext, using the AES128-GCM AEAD encryption described below.

All of the (potentially discontiguous) unencrypted ranges from the frame are joined together and included as additional data to be authenticated by the AEAD ciphersuite. This ensures the SFU is unable to include or replace content in user media frames.

In the resulting interleaved protocol media frame, the unencrypted ranges remain unmodified in their original location from the incoming frame. Encrypted ranges are replaced by their associated ciphertext range. The encrypting frame transformer may mutate the encoded frame it receives to ensure it can pass through the packetizer and depacketizer in an expected and reproducible manner.

Authentication Tag

The authentication tag is an 8-byte truncated version of the authentication tag resulting from the AEAD encryption.

Nonce

The ULEB128 nonce is a variable length representation of the nonce used for encryption/decryption.

Unencrypted Ranges

The unencrypted ranges identify which portions of the interleaved protocol media frame are plaintext and which are ciphertext. Each included range is represented as a byte offset and byte size pair, with both encoded using ULEB128. Unencrypted ranges are ordered by their ascending byte offset. The encrypting frame transformer is codec-aware, and processes each incoming encoded frame to determine the unencrypted ranges for the frame. The decrypted frame transformer deserializes the unencrypted ranges from the protocol supplemental data, and reconstructs the merged additional data and ciphertext necessary for decryption.

Supplemental Data Size

The supplemental data size is the sum of bytes required for:

8-byte authentication tag
Variable length ULEB128 nonce
Variable length ULEB128 unencrypted ranges
1 byte supplemental data size
2 byte magic marker

Magic Marker

The magic marker is a constant 2-byte value 0xFAFA. This is used by media receivers to detect protocol frames as well as by the SFU to avoid sending protocol frames to non-protocol-supporting receivers during transition periods.

Payload Encryption

Media frames are encrypted for E2EE using AES128-GCM. Depending on the protocol, some bytes may be left unencrypted to allow for packetization and depacketization of frames. For more detail, see the codec handling section of the protocol whitepaper.

Sender Key Derivation

Each media sender has a ratcheted per-sender key. There is a new per-sender ratchet created in each MLS group epoch. The initial secret for each sender's ratchet is an exported 16-byte secret from the MLS group. Keys are retrieved from the ratchet via a generation counter derived from the most-significant byte of the 4-byte nonce.

For very long lived epochs, the nonce wrap-around must be handled so the generation does not also wrap back around to 0.

See the sender key derivation section of the protocol whitepaper for the detailed process.

Authentication Tag

The authentication tag resulting from the AES128-GCM encryption is truncated to 8 bytes. Some implementations may provide the desired tag length as a parameter whereas some may always return the full 12-byte tag from which the 4 least significant bytes should be removed.

Nonce

The nonce passed to the AES128-GCM encryption and decryption functions is a full 12-byte nonce, but the protocol only uses at most 4-bytes. The 12-byte nonce can be expanded from a 4-byte truncated nonce by setting the 8 most significant bytes of the nonce to zero, with the 4 least significant bytes carrying the value of the truncated nonce.

The generation used for the sender's key ratchet is retrieved from the most-significant byte of the 4-byte nonce (i.e. the 4th least significant byte of the full 12-byte nonce).

AEAD Additional Data

The additional data passed to the AEAD encryption and decryption functions is the concatenation of all unencrypted ranges from the frame. This ensures that the SFU cannot modify any unencrypted content in the frame without being detected by receivers.

Heartbeating

In order to maintain your WebSocket connection, you need to continuously send heartbeats at the interval determined in Opcode 8 Hello.

This is sent at the start of the connection. Be warned that the Opcode 8 Hello structure differs by Gateway version. Versions below v3 follow a flat structure without op or d fields, including only a single heartbeat_interval field. Be sure to expect this different format based on your version.

This heartbeat interval is the minimum interval you should heartbeat at. You can heartbeat at a faster interval if you wish. For example, the web client uses a heartbeat interval of min(heartbeat_interval, 5000) if the Gateway version is v4 or above, and heartbeat_interval * 0.1 otherwise. The desktop client uses the provided heartbeat interval if the Gateway version is v4 or above, and heartbeat_interval * 0.25 otherwise.

Hello Structure

Field	Type	Description
v	integer	The voice server version
heartbeat_interval	integer	The minimum interval (in milliseconds) the client should heartbeat at

Example Hello

{
  "op": 8,
  "d": {
    "v": 8,
    "heartbeat_interval": 41250
  }
}

The Gateway may request a heartbeat from the client in some situations by sending an Opcode 3 Heartbeat. When this occurs, the client should immediately send an Opcode 3 Heartbeat without waiting the remainder of the current interval.

After receiving Opcode 8 Hello, you should send Opcode 3 Heartbeat—which contains an integer nonce—every elapsed interval:

Heartbeat Structure

Field	Type	Description
t	integer	A unique integer nonce (e.g. the current unix timestamp)
seq_ack?	integer	The last received sequence number

Example Heartbeat

{
  "op": 3,
  "d": {
    "t": 1501184119561,
    "seq_ack": 10
  }
}

Since Gateway v8, heartbeat messages must include seq_ack which contains the sequence number of the last numbered message received from the gateway. See Buffered Resume for more information. Previous versions follow a flat structure, with the d field representing the t field in both the Heartbeat and Heartbeat ACK structure.

In return, you will be sent back an Opcode 6 Heartbeat ACK that contains the previously sent nonce:

Example Heartbeat ACK

{
  "op": 6,
  "d": {
    "t": 1501184119561
  }
}

UDP Connections

UDP is the most likely protocol that clients will use. First, we open a UDP connection to the IP and port provided in the Ready payload. If required, we can now perform an IP Discovery using this connection. Once we've fully discovered our external IP and UDP port, we can then tell the voice WebSocket what it is by sending a Select Protocol as outlined above, and receive our Session Description to begin sending/receiving RTC data.

IP Discovery

Generally routers on the Internet mask or obfuscate UDP ports through a process called NAT. Most users who implement voice will want to utilize IP discovery to find their external IP and port which will then be used for receiving voice communications. To retrieve your external IP and port, send the following UDP packet to your voice port (all numeric are big endian):

Field	Type	Description	Size
Type	Unsigned short (big endian)	Values `0x1` and `0x2` indicate request and response, respectively	2 bytes
Length	Unsigned short (big endian)	Message length excluding Type and Length fields (value `70`)	2 bytes
SSRC	Unsigned integer (big endian)	The SSRC of the user	4 bytes
Address	Null-terminated string	The external IP address of the user	64 bytes
Port	Unsigned short (big endian)	The external port number of the user	2 bytes

UDP Ping

Clients may also send a small UDP ping on the same socket. Pinging should start shortly after the UDP socket connects, using a 5 second timeout. Successful responses may be used as the UDP RTT. Media receivers should filter these packets before RTP/RTCP decoding.

UDP Ping Structure

Field	Type	Description	Size
Magic	Unsigned integer (big endian)	`0x1337CAFE` for requests, `0x1337F00D` responses	4 bytes
Sequence	Unsigned integer	Client-chosen sequence echoed by the response	4 bytes

Sending and Receiving Media

Despite the heading, the UDP transport carries all voice, camera, and stream media. Audio is encoded with Opus at 48kHz stereo. Video is encoded with the selected codec from the Session Description, then packetized according to that codec's RTP payload format.

UDP media uses RTP for media packets and RTCP for sender reports, receiver reports, and video feedback. IP discovery and UDP ping packets are Discord UDP control packets and are not RTP or RTCP.

The outbound media pipeline is:

Encode an audio or video frame
Apply DAVE to the encoded frame if the session has an active usable DAVE transition
Packetize the encoded frame as RTP
Add RTP header extensions for audio level, speaking state, transport sequence, playout delay, RID, and other negotiated metadata as applicable
Encrypt the RTP packet with the transport secret_key and selected mode
Send the encrypted packet to the selected UDP endpoint

The inbound pipeline is the reverse: receive UDP, decrypt transport encryption, parse RTP or RTCP, undo RTX/NACK repair when applicable, depacketize encoded frames, decrypt DAVE when applicable, then decode or dispatch the media.

Transport encryption between the client and the selective forwarding unit (SFU) is still used even in E2EE calls.

In RTP-size AEAD modes, the encrypted UDP packet carries a small nonce suffix that must be stripped before decrypting. The packet size protected by the AEAD authentication tag includes the RTP header and encrypted body, so decryptors should not treat the suffix as RTP payload.

When receiving media, the sender is identified by caching SSRC mappings from Speaking and Video events. Audio-only clients can usually rely on the Speaking event arriving before media, but full media clients should still treat SSRC mapping as state: video, RTX, and stream SSRCs are announced separately and can change when users enable camera, stream, or simulcast layers.

RTP Packet Structure

Field	Type	Description	Size
Version + Flags ¹	Unsigned byte	The RTP version and flags; version 2 with no padding, extension, or CSRCs is `0x80`	1 byte
Payload Type ²	Unsigned byte	Marker bit plus the payload type (`0x78` with the default Opus configuration)	1 byte
Sequence	Unsigned short (big endian)	The RTP sequence number, wraps at `65535`	2 bytes
Timestamp	Unsigned integer (big endian)	The RTP timestamp; Opus commonly advances by `960` per 20ms frame; video uses a 90kHz clock	4 bytes
SSRC	Unsigned integer (big endian)	The SSRC for the media or RTX stream	4 bytes
CSRCs?	array[integer]	Optional contributing sources when the CSRC count flag is non-zero	n bytes
Extension? ³	Binary data	Optional RTP header extension block, usually using the one-byte extension profile `0xBEDE`	n bytes
Payload	Binary data	Encrypted audio, video, or RTX payload	n bytes

¹ If sending an RTP header extension, set the extension bit (1 << 4).

² For video, set the marker bit (1 << 7) on the final RTP packet of an encoded frame.

³ With RTP-size AEAD transport modes, the clear authenticated data is only the fixed RTP header, any CSRCs, and the 4 byte RTP extension preamble. The individual RTP extension elements are encrypted with the RTP payload.

Native RTP Header Extensions

Discord uses the one-byte RTP header extension profile (0xBEDE). Extension IDs are negotiated out of band by the Discord client and RTC worker rather than through a public SDP document on UDP connections.

ID	URI	Applies to	Description
1	`urn:ietf:params:rtp-hdrext:ssrc-audio-level`	Audio	One byte. The high bit is voice activity and the lower 7 bits are audio level
2	`urn:ietf:params:rtp-hdrext:toffset`	Video	RTP timestamp offset from the send time
3	`http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time`	Audio, video	Compact send-time value used by congestion control
4	`urn:3gpp:video-orientation`	Video	Encoded camera orientation
5	`http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01`	Video	Transport sequence number used for congestion control
6	`http://www.webrtc.org/experiments/rtp-hdrext/playout-delay`	Video	Minimum and maximum receiver playout delay. Discord expects this on video packets
7	`http://www.webrtc.org/experiments/rtp-hdrext/video-content-type`	Video	Indicates normal video or screen content
8	`http://www.webrtc.org/experiments/rtp-hdrext/video-timing`	Video	Optional encode/decode timing metadata
9	`https://discord.com/#rtp-hdrext/2018-07-29/speaker`	Audio	Custom Discord speaking extension
10	`urn:ietf:params:rtp-hdrext:sdes:mid`	Audio, video	Media section identifier
11	`urn:ietf:params:rtp-hdrext:sdes:rtp-stream-id`	Audio, video	RID for primary media packets, such as `100` or `50`
12	`urn:ietf:params:rtp-hdrext:sdes:repaired-rtp-stream-id`	Audio, video	RID for repaired packets, normally RTX video packets

Audio RTP

Opus audio commonly uses payload type 120 unless a different payload type is selected by Session Description. A normal 20ms Opus frame advances the RTP timestamp by 960 samples at a 48kHz clock. See Voice Data Interpolation for the silence-frame shutdown behavior.

Clients should send a Speaking payload before sending audible Opus packets. The Discord speaking RTP extension can also carry packet-level speaking state, while the audio-level extension carries VAD and level metadata for receivers.

Discord Speaking Extension

The custom Discord speaking extension is audio RTP extension ID 9 with URI https://discord.com/#rtp-hdrext/2018-07-29/speaker. Its payload is a single byte that encodes the Gateway speaking flags value from the speaking field.

Extension bit	Speaking flag
`0x01`	`PRIORITY`
`0x02`	`VOICE`
`0x04`	`SOUNDSHARE`

When sending, let speaking_flags be the integer used in the Gateway speaking field. Convert it into the extension byte as:

extension = ((speaking_flags & 0x03) << 1) | ((speaking_flags & 0x04) >> 2)

This shifts VOICE and SOUNDSHARE one bit left and moves PRIORITY from 0x04 to 0x01. For example, VOICE | PRIORITY (0x05) becomes 0x03.

When receiving a speech packet, missing extension ID 9 and extension value 0x00 should fallback to VOICE speaking. If bit 0x01 is set, receivers should also implicitly set VOICE. Opus silence packets clear speaking state regardless of this extension.

Video RTP

Clients must send Video state before sending camera or stream video. The RTP SSRC must match one of the announced video streams, and the RTX SSRC must match that stream's repair SSRC.

Video RTP uses a 90kHz timestamp clock. Encoded frames may be split across multiple RTP packets; only the last packet for the frame should have the RTP marker bit set. Primary video packets use the negotiated video payload type and the primary stream RID extension. RTX packets use the negotiated RTX payload type and repaired RID extension.

RTCP and RTX

Discord uses RTCP sender reports and receiver reports for quality and clock information. Clients should send an RTCP Sender Report roughly every 5 seconds for each sent media SSRC. Discord sends RTCP Receiver Reports back to the client with packet loss, jitter, and timing feedback.

Video clients should also handle RTCP Generic NACK feedback. When the server reports missing video packets, retransmit them as RTX packets when they are still available. RTX packets use the stream's rtx_ssrc, an RTX payload type, and a payload beginning with the original RTP sequence number followed by the original media payload. Receivers map RTX packets back to the primary media SSRC before depacketizing.

RTCP packets are protected by the same transport encryption. For RTP-size AEAD modes, RTCP feedback packets keep the RTCP header clear as authenticated data and encrypt the feedback body.

Media Receive Loop

RTP and RTCP are multiplexed on the same UDP socket. The clear packet header is enough to route packets before transport decryption: RTCP packets use RTCP packet types such as Sender Report (200), Receiver Report (201), RTPFB (205), and PSFB (206), while RTP packets use the negotiated media payload types.

After transport decryption, parse RTP headers, extensions, payload type, sequence, timestamp, and SSRC before dispatching to a decoder. Audio payloads are decoded as Opus. Video payloads must first be reordered, repaired through RTX when possible, depacketized according to the negotiated codec, DAVE-decrypted when applicable, and then decoded.

Receivers should route media to sinks by user ID when the SSRC is known. During short races where RTP arrives before the matching Gateway state, implementations can queue briefly by SSRC or route to an SSRC-based fallback sink, then attach the user ID once the Speaking or Video event arrives.

Quality of Service

Discord utilizes RTCP packets to monitor connection quality, synchronize audio and video, and repair lost video frames.

At minimum, media clients should parse RTCP Receiver Reports and send RTCP Sender Reports. Video clients should additionally parse RTCP transport feedback and Generic NACK so they can update congestion state and retransmit recently sent packets through RTX.

The voice server also uses Media Sink Wants to communicate desired send quality. While RTCP describes packet delivery and timing, Media Sink Wants describes what video layers and approximate pixel counts the SFU wants the sender to provide.

WebRTC Connections

WebRTC is the browser-compatible voice transport. Despite the name, modern Discord WebRTC voice is not peer-to-peer between users. The browser establishes a WebRTC connection to Discord's RTC worker/SFU, while the voice Gateway WebSocket continues to carry signaling, user, SSRC, video, and media-sink state.

WebRTC replaces the UDP-specific parts of the flow. It does not use IP Discovery, UDP protocol data, mode, secret_key, or the transport encryption modes from UDP connections. Browser media is protected by ICE, DTLS, and SRTP; when DAVE is active, encoded media frames are additionally encrypted with DAVE.

Peer Connection Configuration

Modern Discord WebRTC clients use Unified Plan and bundle all media onto one ICE/DTLS transport:

const pc = new RTCPeerConnection({
  bundlePolicy: "max-bundle",
  sdpSemantics: "unified-plan",
  encodedInsertableStreams: daveEnabled,
});

Create the base receive transceivers before the first offer. These establish stable media sections and mid values for answer generation:

const audio = pc.addTransceiver("audio", { direction: "recvonly" });
const video = pc.addTransceiver("video", { direction: "recvonly" });

When the local microphone or camera is enabled, replace the sender track on the matching transceiver and set its direction to sendrecv. When a local track is removed, replace it with null and set the direction back to recvonly. If the track identity changes, renegotiate.

If DAVE is enabled, attach encoded frame transforms to every sender and receiver before media flows. Browser support may be exposed through RTCRtpScriptTransform or through the older RTCRtpSender.createEncodedStreams() and RTCRtpReceiver.createEncodedStreams() APIs.

Local Offer Processing

The browser's full local offer is not sent to the voice server. Discord web clients derive three pieces of state from the offer:

SDP fragment: Sent as data in Select Protocol
Codec list: Sent as codecs in Select Protocol
Outbound streams: Kept locally to synthesize the eventual browser remote answer from the server-provided SDP data

Outbound Streams

For every media section in the browser offer, record:

Field	Source	Description
`type`	`m=<type>`	`audio` or `video`
`mid`	`a=mid:<mid>`	Browser media-section ID
`direction`	media direction attribute	One of `sendrecv`, `sendonly`, `recvonly`, `inactive`

This list is later used to generate one answer media section for each offered media section. Do not remove, reorder, or collapse entries in this list.

Codec Extraction

Extract codecs from the offer's a=rtpmap and a=fmtp lines. Discord clients advertise only codecs that are present in the browser offer. Opus is used for audio. For video, modern clients prefer H265 when it is enabled and present, otherwise they use H264 first, followed by VP8 and VP9.

For each codec:

Find the a=rtpmap:<payload> <codec>/<clock> entry for the codec name.
Set payload_type to that RTP payload number.
For video, find a matching RTX payload by locating an a=fmtp:<rtx-payload> apt=<payload> line whose apt points back to the video payload, then find the corresponding a=rtpmap:<rtx-payload> rtx/90000 line.
Set rtx_payload_type to the RTX payload number for video, or null for audio.
Assign codec priority by codec order within each media type, multiplied by 1000 on the wire.

The codec order used by modern clients is:

Audio: opus
Video: H265, H264, VP8, and VP9

For browser-generated offers, the browser chooses the dynamic payload types. Do not rewrite browser WebRTC payload types to UDP defaults. For example, Opus is usually payload type 111 in browser offers, not the default UDP Opus payload type 120. Non-browser WebRTC stacks that construct their own offer may choose different dynamic payload types, but the payload numbers must remain consistent across the local SDP, Select Protocol codec list, and generated answer.

Local SSRC Extraction

When a local media section is sendrecv, extract the local SSRCs from a=ssrc lines:

Media	Extracted value	Source
audio	Audio SSRC	First audio `a=ssrc:<ssrc> cname:...` in a `sendrecv` audio section
video	Video SSRC	First video `a=ssrc:<ssrc> cname:...` in a `sendrecv` video section
video	RTX SSRC	Last video `a=ssrc:<ssrc> cname:...` in a `sendrecv` video section

The local audio/video SSRCs are used for Speaking, Video, and DAVE sender state. The video and RTX SSRCs should also be reflected in the Video payload's stream parameters.

Select Protocol SDP Fragment

The data field in a WebRTC Select Protocol payload is not the full local SDP. It is a stripped SDP fragment built from the browser's local offer after ICE gathering has completed.

Build it from the full local SDP using this exact rule:

Keep every line matching ^a=(extmap-allow-mixed|ice-|fingerprint|extmap:).
Keep only a=rtpmap lines for Opus, VP8, and the RTX payload whose apt points at VP8.
Remove duplicates.
Join the remaining lines with \n.

Other video codecs are still advertised in the codecs field of Select Protocol when they are present in the browser offer; their a=rtpmap lines are not included in this SDP fragment.

In pseudocode:

const vp8Codec = codecs.find((codec) => codec.name === "VP8");

const data = localSdp
  .split(/\r?\n/)
  .filter((line) => {
    if (/^a=(extmap-allow-mixed|ice-|fingerprint|extmap:)/i.test(line)) return true;
    if (/^a=rtpmap:\d+\s+opus\//i.test(line)) return true;
    if (/^a=rtpmap:\d+\s+VP8\//i.test(line)) return true;
    const rtxRtpmap = /^a=rtpmap:(\d+)\s+rtx\//i.exec(line);
    return rtxRtpmap != null && Number(rtxRtpmap[1]) === vp8Codec?.rtx_payload_type;
  })
  .filter((line, index, lines) => lines.indexOf(line) === index)
  .join("\n");

The fragment intentionally does not contain v=, o=, s=, t=, m=, c=, a=group, a=mid, a=setup, a=rtcp-mux, a=sendrecv, a=recvonly, a=ssrc, a=fmtp, or a=rtcp-fb lines.

Example Select Protocol SDP Fragment

a=extmap-allow-mixed
a=ice-ufrag:9WZo
a=ice-pwd:vcfFowC3gQI1KHu0Fm5ZTXum
a=ice-options:trickle
a=fingerprint:sha-256 71:20:4C:BE:C2:D0:B7:9B:73:5B:4B:29:7C:32:41:25:D8:D2:BC:66:74:D3:93:98:B3:0D:01:F7:67:19:01:13
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:2 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01
a=extmap:4 urn:ietf:params:rtp-hdrext:sdes:mid
a=rtpmap:111 opus/48000/2
a=extmap:14 urn:ietf:params:rtp-hdrext:toffset
a=extmap:13 urn:3gpp:video-orientation
a=extmap:5 http://www.webrtc.org/experiments/rtp-hdrext/playout-delay
a=extmap:6 http://www.webrtc.org/experiments/rtp-hdrext/video-content-type
a=extmap:7 http://www.webrtc.org/experiments/rtp-hdrext/video-timing
a=extmap:8 http://www.webrtc.org/experiments/rtp-hdrext/color-space
a=extmap:10 urn:ietf:params:rtp-hdrext:sdes:rtp-stream-id
a=extmap:11 urn:ietf:params:rtp-hdrext:sdes:repaired-rtp-stream-id
a=rtpmap:96 VP8/90000
a=rtpmap:97 rtx/90000

Server SDP Validation

For WebRTC, Session Description contains sdp instead of mode and secret_key. The server sdp must include the transport information needed to construct the browser remote answer.

Validate at least the following before generating the answer:

Required data	Required line pattern or field
DTLS fingerprint	`a=fingerprint:...`
ICE username fragment	`a=ice-ufrag:...`
ICE password	`a=ice-pwd:...`
ICE candidate	`a=candidate:...`
Connection address	`c=<nettype> <addrtype> <connection-address>`

The c= line must have at least three space-separated components. If any of these are absent, the SDP cannot produce a valid browser remote description.

Generating the Browser Remote Answer

The sdp value from Session Description is not enough by itself to describe all remote users and browser transceivers. Discord web clients synthesize a complete RTCSessionDescription of type answer by combining:

The server sdp transport/codec template,
The selected audio_codec and video_codec,
The selected audio, video, and RTX payload types from the local offer,
The local offer's outbound stream list (type, mid, direction),
Known remote user audio/video SSRCs from Speaking and Video, and
The RTP header extensions extracted from the local offer.

The generated answer has the following session-level shape:

v=0
o=- 1420070400000 0 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE <space-separated mids>
a=msid-semantic: WMS *

The BUNDLE mids are the mid values of every generated media section that has a mid.

For every media section from the local offer's outbound stream list, generate exactly one answer media section, in the same order. The answer direction is based on the offer direction:

Offer direction	Answer direction when a remote SSRC is assigned	Answer direction when no remote SSRC is assigned
`recvonly`	`sendonly`	`inactive`
`sendonly`	`recvonly`	`recvonly`
`sendrecv`	`sendrecv`	`recvonly`
`inactive`	`inactive`	`inactive`

The remote answer must keep the same m-line count and order as the local offer. If another remote user is discovered and there are not enough inactive receive transceivers for that media type, add more recvonly transceivers and create a new local offer before generating the next answer. Do not reorder existing transceivers.

The transceiver and remote-user assignment rules above describe browser clients. Custom WebRTC stacks that construct their own local offer with fixed send/receive m-lines can generate a simpler answer for those offered m-lines, as long as the answer preserves the offer's m-line count and order and uses the negotiated transport, codec, SSRC, and RTP extension values consistently.

Each generated media section uses:

Property	Value
`m=` protocol	`UDP/TLS/RTP/SAVPF`
`a=setup`	`passive` for the answer
`a=mid`	The original offered media section's `mid`
`a=rtcp-mux`	Present
payloads	Selected codec payload, plus RTX payload for video when used

Custom WebRTC stacks that construct their own answers may include a=ice-lite for easier implementation.

Answer Audio Media Sections

For an audio media section:

SDP field	Value
`a=rtpmap`	Selected audio payload with `opus/48000/2`
`a=fmtp`	For Opus: `minptime=10;useinbandfec=1;usedtx=<0 or 1>`
`a=maxptime`	`60`
`a=rtcp-fb`	`transport-cc`, optionally `nack`, except in Firefox-specific handling
`a=extmap`	Audio level and transport-wide congestion control, when offered

usedtx is 0 when the local client is sending video and 1 otherwise.

Answer Video Media Sections

For a video media section:

SDP field	Value
`a=rtpmap`	Selected video payload with a 90 kHz clock rate
`a=fmtp`	`x-google-max-bitrate=<kbps>`
H264 `a=fmtp`	Also include `level-asymmetry-allowed=1;packetization-mode=1;profile-level-id=42e01f`
`a=rtcp-fb`	`ccm fir`, `nack`, `nack pli`, `goog-remb`, and `transport-cc`
RTX `a=rtpmap`	RTX payload with `rtx/90000`, when RTX is used
RTX `a=fmtp`	`apt=<video-payload>`
`a=extmap`	Video timestamp, orientation, congestion-control, and playout-delay extensions, when offered

Video answers include RTX by appending the RTX payload to the media payload list and adding the RTX rtpmap and fmtp lines.

Answer SSRC and MSID Lines

For an assigned stream, generate SSRC metadata from the remote user ID and SSRC. For a primary SSRC S, user ID U, and media sentinel a for audio or v for video:

a=ssrc:S cname:U-S
a=ssrc:S msid:U-S <sentinel>U-S
a=ssrc:S mslabel:U-S
a=ssrc:S label:<sentinel>U-S

In Unified Plan, browsers generally require the media-level a=msid and only the cname SSRC attribute:

a=msid:U-S <sentinel>U-S
a=ssrc:S cname:U-S

For video with RTX, include an FID SSRC group and matching SSRC metadata for the retransmission SSRC. The web SDP generator derives the answer-side retransmission SSRC as the primary video SSRC plus one; it does not consume the rtx_ssrc value from a received Video stream object when building the answer:

a=ssrc-group:FID <video-ssrc> <rtx-ssrc>

Transceivers, SSRCs, and Remote Users

Discord still uses voice Gateway events to identify users and SSRCs. WebRTC clients should not rely only on browser track arrival order to identify speakers.

Use Speaking events to learn a user's audio SSRC. Use Video events to learn a user's video SSRC and stream parameters, including any rtx_ssrc metadata reported by the Gateway. The browser answer assigns receive media sections from the primary audio and video SSRCs; RTX metadata in the answer is generated from the assigned primary video SSRC. When a new remote SSRC appears, ensure there is a receive transceiver available for the corresponding media type and renegotiate if necessary.

In Unified Plan, the generated remote description should assign incoming SSRCs to media sections by mid. When there are more remote audio or video streams than inactive receive transceivers, add new recvonly transceivers and create a new offer before applying the next answer. Do not reorder existing transceivers because browsers require remote answers to keep the offer's m-line order.

Incoming MediaStreamTrack objects can be mapped back to users using the SSRC/user mapping from Gateway events and, where present, the SDP msid/stream labels. Treat the voice Gateway SSRC mapping as authoritative.

Local Media State

The WebRTC transport uses normal browser MediaStreamTrack objects for local microphone and camera media. Muting should stop microphone media from being sent; web clients can do this by disabling the local audio track and sending a non-speaking state. Replacing the sender track with null is used when the local stream or track is removed. Camera changes require renegotiation when the video track changes.

When a local camera stream starts, clients should also send Opcode 12 Video with the current audio SSRC, video SSRC, RTX SSRC, and stream parameters. When camera stops, send another Video payload indicating an inactive or zero video SSRC state, depending on the negotiated state.

Screenshare/Go Live streams use separate stream connections, as described in Streams, even when the transport for that stream connection is WebRTC.

WebRTC RTP Header Extensions

The Select Protocol SDP fragment sends every a=extmap line from the local offer. During answer generation, include only extensions that make sense for the media section. Note that not all extensions are necessarily offered in every WebRTC client or available in every browser.

Browser clients let the browser serialize RTP header extensions. Custom packetizers should use the negotiated extension IDs from their SDP and include Discord-required extensions themselves; in particular, video packets are expected to carry the playout-delay extension when that extension is negotiated.

The common Discord extension URIs are listed in Native RTP Header Extensions, but the numeric IDs in WebRTC are the IDs from SDP, not the native UDP IDs. For example, a browser offer might use audio-level as a=extmap:1 ... and transport-wide CC as a=extmap:3 ..., while the native UDP video map uses transport-wide CC ID 5.

Congestion, Quality, and Sink Wants

After WebRTC is connected, the client should continue sending Media Sink Wants for remote video streams.

WebRTC clients should monitor RTCPeerConnection.getStats() for packet loss, jitter, frames, bitrate, and round-trip time. Discord clients use these stats for connection quality, video quality, ping display, stream health, and analytics.

WebRTC and DAVE

DAVE negotiation uses the same voice Gateway fields and opcodes for UDP and WebRTC. The WebRTC-specific requirement is that encoded frame encryption/decryption must be attached to each relevant sender and receiver. If encoded transforms are unavailable, the client should advertise DAVE protocol version 0 and expect calls that require E2EE to close with the relevant voice close code.

Speaking

To notify the voice server that you are speaking or have stopped speaking, send an Opcode 5 Speaking payload:

Speaking Structure

Field	Type	Description
speaking ¹	integer	The speaking flags
ssrc	integer	The SSRC of the speaking user
user_id ²	snowflake	The user ID of the speaking user
delay? ³	integer	The speaking packet delay

¹ For Gateway v3 and below, this field is a boolean.

² Only sent by the voice server.

³ Not sent by the voice server.

Speaking Flags

Value	Name	Description
1 << 0	VOICE	Normal transmission of voice audio
1 << 1	SOUNDSHARE	Transmission of context audio for video, no speaking indicator
1 << 2	PRIORITY	Priority speaker, lowering audio of other speakers

Example Speaking (Send)

{
  "op": 5,
  "d": {
    "speaking": 5,
    "delay": 0,
    "ssrc": 1
  }
}

When a different user's speaking state is updated, and for each user with a speaking state at connection start, the voice server will send an Opcode 5 Speaking payload:

Example Speaking (Receive)

{
  "op": 5,
  "d": {
    "speaking": 5,
    "ssrc": 2,
    "user_id": "852892297661906993"
  }
}

Voice Data Interpolation

When there's a break in the sent data, the packet transmission shouldn't simply stop. Instead, send five frames of silence (0xF8, 0xFF, 0xFE) before stopping to avoid unintended Opus interpolation with subsequent transmissions.

Likewise, when you receive these five frames of silence, you know that the user has stopped speaking.

Video

To notify the voice server that you are sending video, send an Opcode 12 Video payload:

Video Structure

Field	Type	Description
audio_ssrc	integer	On send, this connection's audio SSRC from Ready. On receive, the remote user's audio SSRC associated with this video state
video_ssrc	integer	On send, the selected primary outbound video SSRC, or `0` when clearing video. On receive, the remote user's primary video SSRC
rtx_ssrc ¹	integer	On send, the RTX SSRC paired with `video_ssrc`, or `0` when clearing video. This should match the selected stream's `rtx_ssrc` when RTX is active
streams	array[stream object]	Current video stream state. For simulcast, this is the authoritative list of primary and RTX SSRCs for every layer. Send an empty array when clearing local video
user_id ²	snowflake	The user ID of the video user

¹ The top-level rtx_ssrc is not sent by the voice server. Received stream objects can still include rtx_ssrc.

² Only sent by the voice server.

Example Video (Send)

{
  "op": 12,
  "d": {
    "audio_ssrc": 13959,
    "video_ssrc": 13960,
    "rtx_ssrc": 13961,
    "streams": [
      {
        "type": "video",
        "rid": "100",
        "ssrc": 13960,
        "active": true,
        "quality": 100,
        "rtx_ssrc": 13961,
        "max_bitrate": 9000000,
        "max_framerate": 60,
        "max_resolution": {
          "type": "source",
          "width": 0,
          "height": 0
        }
      }
    ]
  }
}

When a different user's video state is updated, and for each user with a video state at connection start, the voice server will send an Opcode 12 Video payload:

Example Video (Receive)

{
  "op": 12,
  "d": {
    "user_id": "852892297661906993",
    "audio_ssrc": 13959,
    "video_ssrc": 13960,
    "streams": [
      {
        "ssrc": 13960,
        "rtx_ssrc": 13961,
        "rid": "100",
        "quality": 100,
        "max_resolution": {
          "width": 0,
          "type": "source",
          "height": 0
        },
        "max_framerate": 60,
        "active": true
      }
    ]
  }
}

Sending Video

Video state is negotiated in three places:

Identify advertises whether this voice connection supports video and which local simulcast RIDs the client supports.
Ready assigns the actual primary and RTX SSRCs for those streams.
Video announces which of those assigned streams are currently active.

The top-level video_ssrc/rtx_ssrc pair should point at the selected primary outbound stream. The streams array carries the full active video state, including every simulcast layer a receiver may map or request.

When a local source is paused, send another Video payload with that stream's active flag set to false. When resuming, send active: true and prefer sending a keyframe as soon as possible so receivers can decode without waiting for an old reference frame.

RTP packetization, header extensions, and RTX retransmission details are covered in Video RTP and RTCP and RTX.

Receiving Video

Receiving clients should cache SSRC ownership from every received Video payload:

audio_ssrc maps the user's audio stream.
Each stream ssrc maps primary video RTP for that user.
Each stream rtx_ssrc maps repaired video RTP for the same stream.
rid and quality identify the simulcast layer represented by the stream.

Video RTP packets are not self-describing enough to choose a user, stream, or sink without this state. If a packet arrives before the Video event that maps its SSRC, queue it briefly or drop it; do not assume all unknown video packets belong to the speaking audio SSRC.

Receivers request layers through Media Sink Wants. The SFU may still send packets during transitions, so receivers should tolerate short overlap between old and new layer choices.

Resuming Voice Connection

When your client detects that its connection has been severed, it should open a new WebSocket connection. Once the new connection has been opened, your client should send an Opcode 7 Resume payload:

Resume Structure

Field	Type	Description
server_id	snowflake	The ID of the guild, private channel, stream, or lobby being connected to
channel_id ²	snowflake	The ID of the channel being connected to
session_id	string	The session ID of the current session
token	string	The voice token for the current session
seq_ack? ¹	integer	The last received sequence number

¹ Only available on Gateway v8 and above.

² Only required for Gateway v9 and above.

Example Resume

{
  "op": 7,
  "d": {
    "server_id": "41771983423143937",
    "channel_id": "127121515262115840",
    "session_id": "30f32c5d54ae86130fc4a215c7474263",
    "token": "66d29164ee8cd919",
    "seq_ack": 10
  }
}

If successful, the voice server will respond with an Opcode 9 Resumed to signal that your client is now resumed:

Example Resumed

{
  "op": 9,
  "d": null
}

If the resume is unsuccessful—for example, due to an invalid session—the WebSocket connection will close with the appropriate close code. You should then follow the Connecting flow to reconnect.

Buffered Resume

Since version 8, the Gateway can resend buffered messages that have been lost upon resume. To support this, the Gateway includes a sequence number with all messages that may need to be re-sent.

Example Message With Sequence Number

{
  "op": 5,
  "d": {
    "speaking": 0,
    "delay": 0,
    "ssrc": 110
  },
  "seq": 10
}

A client using Gateway v8 must include the last sequence number they received under the data d key as seq_ack in both the Opcode 3 Heartbeat and Opcode 7 Resume payloads. If no sequence numbered messages have been received, seq_ack can be omitted or included with a value of -1.

The Gateway uses a fixed bit length sequence number and handles wrapping the sequence number around. Since Gateway messages will always arrive in order, a client only needs to retain the last sequence number they have seen.

If the session is successfully resumed, the Gateway will respond with an Opcode 9 Resumed and will re-send any messages that the client did not receive.

The resume may be unsuccessful if the buffer for the session no longer contains a message that has been missed. In this case the session will be closed and you should then follow the Connecting flow to reconnect.

Connected Clients

Client Connections

At connection start, and when a client thereafter connects to voice, the voice server will send a series of events. This includes an Opcode 11 Clients Connect containing every connected user, as well as individual Opcode 18 Client Flags and Opcode 20 Client Platform for each user.

These events are meant to inform a new client of all existing clients and their flags/platform, and inform existing clients of a newly-connected client.

Clients Connect Structure

Field	Type	Description
user_ids	array[snowflake]	The IDs of the users that connected

Example Clients Connect

{
  "op": 11,
  "d": {
    "user_ids": ["852892297661906993"]
  }
}

Client Flags Structure

Field	Type	Description
user_id	snowflake	The ID of the user that connected
flags	?integer	The user's voice flags

Voice Flags

Value	Name	Description
1 << 0	CLIPS_ENABLED	User has clips enabled
1 << 1	ALLOW_VOICE_RECORDING	User has allowed their voice to be recorded in another user's clips
1 << 2	ALLOW_ANY_VIEWER_CLIPS	User has allowed stream viewers to clip them

Example Client Flags

{
  "op": 18,
  "d": {
    "user_id": "852892297661906993",
    "flags": 3
  }
}

Client Platform Structure

Field	Type	Description
user_id	snowflake	The ID of the user that connected
platform	?integer	The user's voice platform

Voice Platform

Value	Name	Description
0	DESKTOP	Desktop-based client
1	MOBILE	Mobile client
2	XBOX	Xbox integration
3	PLAYSTATION	PlayStation integration

Example Client Platform

{
  "op": 20,
  "d": {
    "user_id": "852892297661906993",
    "platform": 0
  }
}

Client Disconnections

When a user disconnects from voice, the voice server will send an Opcode 13 Client Disconnect:

When received, the SSRC of the user should be discarded.

Client Disconnect Structure

Field	Type	Description
user_id	snowflake	The ID of the user that disconnected

Example Client Disconnect

{
  "op": 13,
  "d": {
    "user_id": "852892297661906993"
  }
}

Simulcasting

The voice server supports simulcasting, allowing clients to send multiple video layers and allowing receivers to request the layer that best fits the current view. A full-size focused video can request quality 100, while a thumbnail, background stream, or muted/off-screen user can request lower quality or 0.

Simulcast state is described by stream objects. The rid identifies the RTP stream ID, quality describes the layer's intended quality, ssrc identifies primary RTP, rtx_ssrc identifies retransmissions, and active tells receivers whether the sender currently intends to transmit that layer.

Camera video commonly offers two layers: one full-size stream at quality 100, and another reduced-quality stream at 50. Stream connections commonly offer one screen layer at quality 100. The client proposes RIDs in Identify, but the Ready payload assigns the real SSRCs.

Media Sink Wants is the control message for desired receive and send quality. A receiving client sends Opcode 15 Media Sink Wants to tell the SFU which remote SSRCs it wants and at what quality. The voice server may also send Opcode 15 Media Sink Wants to tell a sender which of its local SSRCs should currently be active or reduced.

The keys in the payload are primary media SSRCs, not RTX SSRCs. A special key of any applies to otherwise unspecified streams. Values are 0 through 100, where 0 disables a stream and 100 requests the highest available layer. The optional pixelCounts object gives the SFU approximate rendered pixel counts for each SSRC, which helps it choose between layers when a view is resized.

A sender should treat server-sent wants as dynamic encoder input, not as the sole source of Video stream state. If a layer is wanted at 0, pause that layer and announce it inactive when appropriate. If a layer is wanted at a lower quality, reduce bitrate, resolution, framerate, or choose a lower RID rather than continuing to send the full layer. Receivers should keep sending updated wants as views appear, disappear, resize, pin, or move between foreground and background.

Media Sink Wants Structure

Field	Type	Description
{ssrc}?	integer	Desired quality for the stream with the matching SSRC key (0-100)
any?	integer	Desired quality for all otherwise unspecified streams (0-100)
pixelCounts?	object[integer, number]	Desired approximate pixel count for each stream, keyed by SSRC

Example Media Sink Wants

{
  "op": 15,
  "d": {
    "8964": 100,
    "any": 50,
    "pixelCounts": {
      "8964": 1189844.5769597634
    }
  }
}

Voice Backend Version

For analytics, the client may want to receive information about the voice backend's current version. To do so, send an Opcode 16 Voice Backend Version with an empty payload.

Voice Backend Version Structure

Field	Type	Description
voice	string	The voice backend's version
rtc_worker	string	The WebRTC worker's version

Example Voice Backend Version (Send)

{
  "op": 16,
  "d": {}
}

In response, the voice server will send an Opcode 16 Voice Backend Version payload with the versions:

Example Voice Backend Version (Receive)

{
  "op": 16,
  "d": {
    "voice": "0.9.1",
    "rtc_worker": "0.3.35"
  }
}

No Route

If a client cannot establish any usable RTC route after selecting a protocol, it may send an Opcode 32 No Route payload with an empty payload. This informs the voice server that connection setup failed at the RTC transport layer.

Example No Route

{
  "op": 32,
  "d": {}
}

Streams

Stream connections operate in a similar fashion to regular voice connections. In fact, on the protocol side, they are identical and use all of the payloads and processes described above. The main differences are within the Gateway protocol, as streams are started and joined differently to regular voice connections.

Connecting to Streams

To start or join a stream, the client must first be connected to the voice instance that the stream is hosted on. Then, send a Create Stream or Watch Stream payload to the Gateway.

If our request succeeded, as with voice, you must wait for the Gateway to respond with two events—a Stream Create event and a Stream Server Update. You can then use the information provided in these events to establish a connection to the stream server as outlined in Connecting to Voice. Note that the server_id and channel_id used when identifying will be provided in the Stream Create event.

Note that if joining a stream fails, the Gateway will instead respond with a Stream Delete event which will contain the reason for the failure.

Stream Media Connections

A stream uses a separate RTC connection from the parent voice channel. The parent voice connection must remain connected so the user remains in the voice instance, but the stream has its own WebSocket, transport, state, and media packets.

If the stream includes application or system audio, send that audio on the stream RTC connection as Opus. Clients should mark this with the SOUNDSHARE speaking flag rather than normal voice speaking, so viewers can treat it as contextual stream audio.

Stream viewers connect to the stream RTC connection and request the stream SSRCs they want with Media Sink Wants. Do not use the parent voice connection's audio or video SSRCs for stream media; the stream connection has its own SSRC namespace. Additionally, do not attempt to send media to streams you are viewing. Only stream owners should transmit data to the RTC connection.

For stream E2EE, clients must use a stream-specific MLS group ID rather than the voice channel ID. Current stream RTC uses the media-session ID, which is one less than the stream rtc_server_id, as the DAVE/MLS group ID.

If a stream becomes unavailable, reset RTP receive, RTCP feedback, DAVE, and transport state for that stream RTC connection. The parent voice connection and its media state are independent and unaffected.