Understanding the RTP Protocol
The Real-time Transport Protocol (RTP) is an Internet-standard transport protocol for real-time audio, video, and other time-sensitive data transmission, defined in RFC 3550. This comprehensive guide covers all essential aspects of RTP including packet structure, RTCP, profiles, signaling integration, and security.
| Protocol Fundamentals | Control & Feedback | Session Management | Advanced Topics |
|---|---|---|---|
Introduction
The Real-time Transport Protocol (RTP) is an Internet-standard transport protocol for real-time audio, video, and other time-sensitive data transmission, defined in RFC 3550. RTP provides end-to-end delivery services for media streams, including sequence numbering, timestamping, payload type identification, and monitoring. These features allow receivers to detect packet loss, restore proper order, and synchronize playback of incoming media.
RTP is typically used on top of UDP for its low overhead and latency (though it can operate over other transports). Unlike TCP, RTP/UDP does not guarantee delivery or ordering, nor does it provide congestion control or quality-of-service (QoS) guarantees itself. Instead, it tolerates some packet loss and reordering as a trade-off for timely delivery, making it well-suited for interactive communications where a lost packet is preferable to a delayed one. RTP is broadly used in VoIP telephony, video conferencing, live media streaming, and similar applications requiring real-time data transfer.
RTP is designed with companion protocols and profiles to fully support real-time sessions. Each RTP session is accompanied by the RTP Control Protocol (RTCP), which provides out-of-band control, feedback, and minimal session management. RTP itself is a flexible framework intended to be tailored by profiles for specific application domains. For example, the standard profile for audio/video conferencing (RTP/AVP, RFC 3551) defines codec-specific payload type assignments. RTP's open design allows extension or modification of header fields via profiles or header extensions, enabling broad adaptability. At the same time, RTP was built to scale from simple one-to-one calls to large multicast conferences: it supports multiple participants, multi-stream sessions, and can operate over IP multicast for group communication.
| Feature | Description |
|---|---|
| Transport | UDP (typically), can use other transports |
| Reliability | No guarantee - tolerates loss for timeliness |
| Companion Protocol | RTCP for control, feedback, and monitoring |
| Profiles | Extensible via profiles (AVP, AVPF, SAVP, etc.) |
| Scalability | Supports unicast and multicast, one-to-one to large conferences |
| Session Setup | External (SIP/SDP, H.323, WebRTC signaling) |
In summary, RTP's role is to carry real-time media streams with minimal delay, while leaving call setup, control, and other aspects to separate protocols. For instance, Session Initiation Protocol (SIP) is often used to negotiate and establish RTP sessions (exchanging IP addresses, ports, and codec info via SDP), after which RTP transports the actual media independently. This separation of signaling (SIP/SDP) and media transport (RTP/RTCP) is fundamental in VoIP and multimedia systems: SIP handles connection management, and RTP handles the voice/video transmission over the connection. RTCP, running alongside RTP on a parallel path, feeds back reception quality and participant information, allowing endpoints to monitor and adapt to network conditions.
RTP Packet Structure
Every RTP packet consists of a header followed by a payload (the media data). The fixed RTP header is 12 bytes long and contains fields that enable proper delivery and playback of real-time media. The header may be followed by optional CSRC identifiers and an extension header (if used). The payload carries the media data (audio frames, video packets, etc.), and padding may be added at the end if needed.
RTP Header Fields
| Field | Bits | Description |
|---|---|---|
| Version (V) | 2 | RTP version number. Always 2 for RFC 3550. |
| Padding (P) | 1 | If set, packet contains padding bytes at the end. Last byte indicates padding count. Padding may be used for certain encryption modes or to align packets to fixed block sizes. |
| Extension (X) | 1 | If set, header extension follows CSRC list. This allows application or profile-specific extensions to RTP (defined in RFC 3550 §5.3.1 and extended by RFC 5285/8285 for general use). |
| CSRC Count (CC) | 4 | Number of CSRC identifiers (0-15). These 32-bit CSRC IDs, if present, follow the SSRC field and list the sources that have contributed to the payload (used when a mixer combines streams). |
| Marker (M) | 1 | Profile-specific meaning. Video: last packet of frame. Audio: start of talkspurt. The interpretation is defined by the applicable RTP profile or payload format. |
| Payload Type (PT) | 7 | Media format identifier. Static (0-95) or dynamic (96-127). Combined with out-of-band signaling (SDP), ensures sender and receiver agree on the media encoding. |
| Sequence Number | 16 | Increments by 1 per packet. Starts random. Wraps at 65535. Allows receivers to detect packet loss and reorder packets that arrive out of order. |
| Timestamp | 32 | Sampling instant of first byte. Clock rate depends on payload format (8000 Hz for 8 kHz audio, 90 kHz for video). Essential for jitter calculation and synchronization. |
| SSRC | 32 | Synchronization source identifier. Randomly chosen, unique per session. Receivers use SSRC to demultiplex incoming RTP streams and track each source separately. |
| CSRC List | 0-480 | Contributing source IDs (used by mixers). Up to 15 × 32-bit IDs. When a mixer creates a combined stream, it includes original source SSRCs here. |
Detailed Field Descriptions
Sequence Number: The 16-bit packet sequence number increments by one for each RTP packet sent. It is initialized to a random value for each new RTP session. After reaching 65,535 it wraps around to 0. Using the sequence numbers, a receiver can reconstruct the original packet order even though RTP/UDP does not guarantee delivery order.
Timestamp: The 32-bit timestamp reflects the sampling instant of the first byte in the payload. The timestamp clock rate is dependent on the payload format (for example, 8000 Hz for 8 kHz audio, 90 kHz for video in many profiles). Like the sequence, the timestamp starts from a random base value. Each RTP packet's timestamp is incremented by the amount of time covered by the payload data it carries. When multiple media streams (e.g., audio and video) are in use, their RTP timestamps can be correlated via RTCP so that playback is synchronized.
SSRC: The 32-bit Synchronization Source identifier uniquely labels the source of a stream within an RTP session. Each RTP packet carries a single SSRC indicating its origin. The SSRC is randomly chosen by each sender to be globally unique within the session, so that if multiple participants are sending, their streams can be told apart. If an SSRC collision occurs (two senders accidentally pick the same value), the protocol has rules to resolve it (senders adopt new SSRCs and send RTCP "BYE" for the old ID).
Example RTP Header:
V=2, P=0, X=0, CC=0, M=1, PT=96, Seq=12345, Timestamp=0x30551980, SSRC=0x1A2B3C4D
This indicates: RTP version 2, no padding/extension, no CSRCs, marker set (end of frame), payload type 96 (dynamic - e.g., H.264), sequence 12345, with SSRC 0x1A2B3C4D identifying the stream's source. The payload data would follow, encoded according to the specified format.
Header Extensions
In addition to the fixed header fields, RTP allows an optional header extension if the X bit is set. The extension, if present, consists of a 16-bit length field and a defined format for additional fields that are not covered by the standard header. Originally, RFC 3550 defined a simple mechanism for one 16-bit-header extension (intended for limited experimental use). This was later generalized by RFC 5285/8285, which introduced one-byte and two-byte header extension formats allowing multiple identified extension elements to be included.
Header extensions are negotiated via signaling (e.g., using the a=extmap attribute in SDP) and can carry extra metadata like audio levels, video frame metadata, etc., on a per-packet basis.
| Extension Type | Description | Negotiation |
|---|---|---|
| One-byte header | Up to 14 extension elements, each 1-16 bytes | SDP a=extmap
|
| Two-byte header | Larger extensions, more flexibility | SDP a=extmap
|
| Common uses | Audio levels, video orientation, timing info | Signaling agreement |
Padding
The RTP packet may end with padding bytes (if P flag is 1) which are not part of the media payload. Padding is often used when encryption algorithms require fixed block sizes or to align RTP packets to certain length boundaries. The last byte of padding contains a count of how many padding bytes are appended.
RTP Control Protocol (RTCP)
RTP's data transport is augmented by the Real-time Transport Control Protocol (RTCP), defined in the same specification (RFC 3550). RTCP packets are sent periodically by each participant in an RTP session to convey control and quality information. Unlike RTP packets (which carry media), RTCP packets do not transport media payload; instead, they carry statistics and metadata about the RTP streams, and perform a few management functions for the session.
RTCP serves several important roles:
Quality of Service Feedback: RTCP provides feedback on network conditions to all session members. Each receiver reports how well it is receiving the RTP stream(s) – including metrics like fraction of packets lost, cumulative packets lost, interarrival jitter, and last timestamp received. This feedback is contained in Receiver Report (RR) packets or in Sender Report (SR) packets (which include reception stats from the sender's perspective as a receiver of any other streams). These reports allow senders and network managers to gauge the quality (packet loss rates, jitter, round-trip time) and adapt if necessary (e.g., by reducing bitrates or switching codecs on poor links).
Inter-media Synchronization: RTCP Sender Reports include an NTP timestamp and the corresponding RTP timestamp of the sender's stream at a moment in time. This mapping allows receivers to synchronize multiple streams (for example, audio and video) from the same sender. By correlating RTP timestamps of different streams to a common wall-clock time (NTP), the receiver can play audio and video in lockstep. This is crucial in conferences where audio and video come from the same source but are separate RTP streams.
Participant Identification and Session Control: RTCP carries source descriptors (SDES), which include canonical identifiers and optional information for participants. For example, each participant sends a CNAME (Canonical Name) item – a persistent identifier like "user@host" that uniquely identifies the participant across restarts and SSRC changes. CNAME ties together multiple RTP streams from the same person (so that their audio and video streams can be associated). Other SDES items can include a user name, email, phone number, tool name, etc., which can be used to identify participants in loosely controlled sessions. RTCP's BYE packet is used to indicate a participant is leaving the session, helping with member management.
Minimal Session Control: While RTP/RTCP is not a full session signaling protocol, RTCP acts as a keep-alive and light coordination mechanism. It monitors the number of participants and can adjust its reporting interval to scale to large groups. It also provides a way to inform all participants of changes (e.g., BYE goodbye signals an endpoint departure). Any higher-level control (like inviting users, negotiating media formats) is outside RTCP's scope and handled by protocols like SIP or H.323.
RTCP Packet Types
RTCP communication is done with compound packets that may contain several RTCP messages. The five basic RTCP packet types defined in RFC 3550 are:
| Type | Code | Name | Sender | Contents |
|---|---|---|---|---|
| SR | 200 | Sender Report | Active senders | NTP/RTP timestamps, packet/byte counts, reception reports |
| RR | 201 | Receiver Report | Non-senders | Fraction lost, cumulative loss, jitter, LSR, DLSR |
| SDES | 202 | Source Description | All participants | CNAME (required), NAME, EMAIL, PHONE, LOC, TOOL, NOTE |
| BYE | 203 | Goodbye | Leaving participant | SSRC of departing stream, optional reason |
| APP | 204 | Application-specific | Application-defined | Custom data for experimental features |
SR (Sender Report): Packet type 200, sent by active senders at the RTCP interval. It includes an NTP timestamp and the sender's RTP timestamp to enable synchronization, as well as sender's packet count and byte count. Following the sender info, an SR carries a set of reception report blocks (one per RTP stream the sender is receiving) with loss and jitter stats. The SR thus combines two functions: it provides sender info for the sender's own stream, and receiver info about others' streams.
RR (Receiver Report): Type 201, sent by participants that are not active senders. An RR contains one report block per RTP stream received, detailing the fraction of packets lost, cumulative loss, highest sequence number received, interarrival jitter, and timing info for round-trip delay calculation. Receivers send these to give feedback to each sender about the quality of reception. If there are no RTP packets received, the non-sending party still sends an RR with no report blocks as a keep-alive.
SDES (Source Description): Type 202, used to convey descriptive information about sources. The most crucial SDES item is CNAME, which every participant must send, and is included in every compound RTCP packet. CNAME is a unique identifier for the participant that remains constant even if they change SSRC (for example, due to collision or application restart). Other optional SDES items (NAME, EMAIL, PHONE, LOC, etc.) provide human-readable identification and contact info.
BYE (Goodbye): Type 203, indicates an endpoint is leaving the session. It contains the SSRC of the departing stream and may include a short reason for leaving. When a BYE is received, other participants remove that SSRC from their session participant list. If a participant just disappears (no BYE), others will eventually time-out that SSRC after several RTCP intervals of no reception.
APP (Application-specific): Type 204, a packet type earmarked for experimental or application-defined use. It allows organizations to define their own RTCP packet formats for features not covered by the standard types. Newer extensions like feedback messages use different type codes in extended profiles.
RTCP Compound Packet Structure
Per RFC 3550, each RTCP compound packet must:
- Start with SR or RR
- Include SDES with at least CNAME
- Optionally include BYE, APP, or other packets
This means that when an endpoint sends RTCP, it typically transmits: an SR (or RR) + SDES (with CNAME, and optionally NAME/EMAIL/etc.) + any other needed control packets, all concatenated into one UDP packet. This design amortizes overhead and ensures each RTCP report is identifiable.
+--------+--------+--------+--------+ | SR or RR (required first) | +--------+--------+--------+--------+ | SDES with CNAME (required) | +--------+--------+--------+--------+ | BYE (optional) | +--------+--------+--------+--------+ | APP (optional) | +--------+--------+--------+--------+
RTCP Bandwidth Control
The frequency of RTCP packets is controlled to avoid too much overhead: by default, RTCP traffic is limited to ~5% of the session bandwidth. In a typical setting, RTP data consumes 95% and RTCP 5% of the configured bandwidth. Furthermore, the 5% RTCP share is split so that all senders together use about 1.25% (i.e. 25% of 5%) and receivers use 3.75% (the remaining 75% of 5%). This weighting prevents feedback implosion when many receivers and few senders are present.
| Parameter | Value | Description |
|---|---|---|
| Total RTCP bandwidth | ~5% of session bandwidth | Prevents control overhead from dominating |
| Senders share | 25% of RTCP (1.25% total) | For SR packets |
| Receivers share | 75% of RTCP (3.75% total) | For RR packets |
| Minimum interval | 5 seconds (AVP profile) | Between RTCP reports |
| Scaling | Randomized, participant-based | Adapts to session size |
The RTCP interval for each participant is randomized and scaled by the number of participants, so that as a session grows, each node sends RTCP less frequently. This algorithm allows RTCP to scale to large multicast groups (hundreds or thousands of members) without overwhelming the network.
Receiver Report Fields
| Field | Size | Description |
|---|---|---|
| SSRC of source | 32 bits | Which stream this report is about |
| Fraction lost | 8 bits | Packets lost / packets expected since last RR (0-255 = 0%-100%) |
| Cumulative lost | 24 bits | Total packets lost since session start |
| Extended highest seq | 32 bits | Highest sequence number received (with rollover) |
| Interarrival jitter | 32 bits | Statistical variance of packet inter-arrival time |
| Last SR (LSR) | 32 bits | Middle 32 bits of NTP timestamp from last SR |
| Delay since last SR (DLSR) | 32 bits | Time between receiving SR and sending this RR |
Overall, RTCP is what makes RTP a monitored transport: it provides the necessary feedback loop for adaptive streaming and gives all participants insight into the session. For example, an application can display network statistics (latency, loss) gathered via RTCP, or automatically switch video quality if reports indicate high packet loss. RTCP's design is modular – additional report types (like for video-specific feedback or detailed metrics) have been added in other RFCs (e.g., RFC 4585 defines RTP/AVPF for immediate feedback messages, RFC 3611 defines extended reports).
RTP Profiles and Payload Types
Profiles in RTP define how the protocol is used for specific classes of applications, including default payload type assignments and any header extensions or modifications. RTP itself (RFC 3550) is a general framework; a profile specification fills in the details for an application domain. The most widely used profile is the RTP Audio/Video Profile (AVP) specified in RFC 3551.
| Profile | RFC | Description |
|---|---|---|
| RTP/AVP | 3551 | Audio/Video Profile - standard A/V conferencing |
| RTP/AVPF | 4585 | AVP with Feedback - immediate RTCP feedback (PLI, NACK) |
| RTP/SAVP | 3711 | Secure AVP - SRTP encryption and authentication |
| RTP/SAVPF | 5124 | Secure AVPF - SRTP with feedback |
RTP/AVP defines a set of static payload type (PT) numbers for common audio and video encodings, and guidelines for using RTP in audio/video conferences. A profile also sets the default clock rates for payload types and might specify how often RTCP packets should be sent. For instance, RFC 3551 (AVP) recommends a minimum RTCP interval of 5 seconds.
The RTP/AVPF profile (RFC 4585) introduces new RTCP feedback message types (like Picture Loss Indication, PLI, or Receiver Estimated Maximum Bitrate, REMB) to support more interactive feedback for video streaming. Secure RTP (RTP/SAVP) from RFC 3711 is essentially the AVP profile with security enhancements (encryption and authentication of RTP/RTCP).
Static Payload Types (AVP Profile)
| PT | Encoding | Clock Rate | Type |
|---|---|---|---|
| 0 | PCMU (G.711 μ-law) | 8000 Hz | Audio |
| 3 | GSM | 8000 Hz | Audio |
| 4 | G723 | 8000 Hz | Audio |
| 8 | PCMA (G.711 A-law) | 8000 Hz | Audio |
| 9 | G722 | 8000 Hz | Audio |
| 18 | G729 | 8000 Hz | Audio |
| 31 | H261 | 90000 Hz | Video |
| 32 | MPV (MPEG-1/2 Video) | 90000 Hz | Video |
| 34 | H263 | 90000 Hz | Video |
| 96-127 | Dynamic | Negotiated | Any |
Dynamic Payload Type Negotiation
RTP allows payload type numbers to be negotiated dynamically since the pool of static types is limited. Dynamic PTs (typically 96 and above) have no predefined encoding and must be defined in the session setup (for example via an SDP offer/answer in SIP). SDP attributes like a=rtpmap: and a=fmtp: describe the codec name, clock rate, and any format parameters for each dynamic PT.
For instance, an SDP might map PT 96 to "H264/90000" indicating H.264 video with a 90 kHz timestamp clock, or PT 101 to "telephone-event/8000" for DTMF tones (RFC 4733 telephony events). RFC 4733 is a payload format that assigns dynamic PTs for telephone keypad tones and other telephony signals over RTP – illustrating that RTP payloads are not limited to "media" in the traditional sense.
m=audio 4000 RTP/AVP 96 97 a=rtpmap:96 opus/48000/2 a=rtpmap:97 telephone-event/8000 a=fmtp:97 0-16 m=video 4002 RTP/AVP 98 a=rtpmap:98 H264/90000 a=fmtp:98 profile-level-id=42e01f
There are dozens of RTP payload format specifications (RFCs) for virtually every audio codec (Opus, AAC, etc.), video codec (H.264, VP8, AV1, etc.), text (RFC 4103 for real-time text), even MIDI (RFC 6295) and other data.
Marker Bit Usage
One key aspect defined in profiles is M bit usage. The Marker bit's meaning is profile-dependent:
| Media Type | M=1 Meaning | Purpose |
|---|---|---|
| Video | Last packet of frame | Frame boundary detection |
| Audio (silence suppression) | First packet after silence | Talkspurt indication |
| RFC 4733 DTMF | End of DTMF event | Event boundary |
Reserved Payload Types
The AVP profile specifies that PT 72–73 are reserved (not to be used for payloads) because those values correspond to RTCP packet type octets (200 and 201) if interpreted as an 8-bit number with the marker bit, which aids in RTP/RTCP demultiplexing. This reservation was made so that if RTP and RTCP arrive on the same port, a packet with a "payload type" 72 or 73 would actually be recognized as an RTCP SR or RR.
In summary, RTP profiles (like AVP, AVPF, SAVP, etc.) build on the base RTP/RTCP specification to tailor it to specific use cases, while payload format specs define how to carry particular media formats in RTP. The existence of profiles means that RTP can accommodate new media types and features without redesigning the core protocol – one simply adds a new profile or payload format that all endpoints in a session agree to use. Despite this flexibility, all RTP variants retain the same fundamental packet structure and semantics at a high level, so that generic RTP monitoring and mixing tools can operate across profiles.
Session Setup and Signaling
It's important to understand that RTP itself does not provide session establishment, codec negotiation, or media session coordination – it only handles the transport of media once the session is agreed upon. In typical deployments (e.g., Voice over IP), these functions are handled by a separate signaling protocol such as SIP (Session Initiation Protocol) or H.323. SIP is used to set up, modify, and tear down calls or conferences, and it uses the Session Description Protocol (SDP) to negotiate media parameters between endpoints.
SDP conveys the details of the media streams: what codec(s) to use, what RTP payload type numbers correspond to those codecs, the IP addresses and UDP ports where each side will send/receive RTP and RTCP, whether media is sent or recv-only, etc.
A typical call flow goes like this: SIP INVITE messages are exchanged between endpoints (possibly through proxy servers) to agree on a session. The INVITE and 200 OK messages carry an SDP offer and answer. For example, Alice's phone might offer to send/receive audio with RTP at IP 10.1.1.5 port 4000 (RTP) and 4001 (RTCP) using codecs PCMU (payload type 0) and Opus (dynamic payload 98), and video at port 4002/4003 with H.264 (payload 102). Bob's phone replies with an answer selecting one audio codec (say PCMU) and one video codec, and provides his IP/port where he will receive RTP/RTCP.
Once this exchange is done, both sides know each other's RTP transport addresses (IP + port for RTP, and typically port+1 for RTCP unless multiplexing is signaled), the codec to use, and the payload type numbers that map to that codec. They then begin sending RTP packets to each other directly (and RTCP periodically), carrying the agreed media. From this point, SIP is out of the picture until something like a call hold or termination happens; the media flows independently of the SIP signaling.
Transport Ports
| Convention | RTP Port | RTCP Port | Notes |
|---|---|---|---|
| Traditional | Even number (e.g., 4000) | RTP + 1 (e.g., 4001) | Separate ports |
| RTCP Mux | Same as RTP | Same as RTP | SDP: a=rtcp-mux
|
| Demultiplexing | PT 0-127 | PT >= 200 (RTCP types) | By first byte |
Traditionally, RTP uses an even-numbered UDP port and the next odd-numbered port for RTCP. SDP indicates this in the "m=" lines (e.g., m=audio 4000 RTP/AVP 0 98 means audio via RTP/AVP on port 4000, and implicitly RTCP on 4001 unless otherwise specified).
Modern usage can negotiate RTP/RTCP multiplexing on the same port (called "RTCP mux"), which is signaled by an SDP attribute a=rtcp-mux. If both sides agree, RTP and RTCP packets are sent to the same UDP port. RFC 5761 updates RTP to describe this multiplexing and recommends using it to ease NAT traversal and simplify port management.
Codec and Payload Negotiation
The mapping of dynamic payload type numbers to actual codec formats is achieved through signaling (SDP rtpmap and fmtp attributes). Both ends maintain a table of PT → codec (with clock rate, channels, etc.). During the session, the RTP packets simply refer to the codec via the PT in each packet.
If a sender needs to switch codecs mid-session (to adapt to bandwidth or due to a participant's capabilities), it can do so by sending RTP with a different payload type (one that was agreed on in SDP). For example, an RTP stream might switch from PT 0 (PCMU audio) to PT 8 (PCMA) on the fly, and the receiver will decode accordingly since those were agreed. However, introducing a codec not agreed on is not allowed without renegotiation.
SDP Media Description
v=0 o=alice 2890844526 2890844526 IN IP4 10.1.1.5 s=VoIP Call c=IN IP4 10.1.1.5 t=0 0 m=audio 4000 RTP/AVP 0 8 96 a=rtpmap:0 PCMU/8000 a=rtpmap:8 PCMA/8000 a=rtpmap:96 opus/48000/2 a=rtcp-mux a=sendrecv
Session Management
SIP (or other signaling like Jingle/XMPP in WebRTC, or RTSP for streaming) can manage when media starts/stops, put sessions on hold (by indicating port 0 or sendonly/recvonly in SDP), add new media streams, remove streams, etc. RTP itself will just start or stop receiving packets accordingly. RTCP can signal some things like BYE when an endpoint is done, but higher-level coordination (like a SIP BYE to end the call) usually precedes or accompanies the end of RTP flow.
In multicast scenarios or conference bridges, a session may not involve SIP at all – sometimes static configuration or other means (like an announcement that "send RTP to multicast 224.x.y.z:5000") is used. But the prevalent usage in VoIP and unified communications is RTP with SIP/SDP. This combination is so common that people loosely talk about "SIP calls" carrying voice, but technically SIP does the call setup, RTP does the heavy lifting of media transport.
A practical takeaway is: if troubleshooting a call, one can examine SIP messages to see if a call was established and codec agreed, and then examine RTP packets to see if media flowed. The separation also means quality issues (choppy audio) are often due to RTP path problems (packet loss, jitter) rather than SIP issues.
Mixers and Translators
RTP is designed to support more complex network topologies than simple point-to-point streams. It introduces the concepts of mixers and translators – intermediate nodes that participate in an RTP session to facilitate scenarios like multi-party mixing, media format conversion, or firewall traversal. These devices operate at the RTP layer (not just IP routing), so they understand RTP headers and can modify or generate RTP packets on behalf of other participants.
| Translator | Mixer |
|---|---|
|
|
|
| Feature | Translator | Mixer |
|---|---|---|
| SSRC handling | Preserves original SSRCs | Creates new SSRC, lists originals in CSRC |
| Stream output | Separate streams forwarded | Single combined stream |
| Use cases | Relay, transcoding, firewall traversal | Audio conferencing, video compositing |
| Timing | Maintains original timing | Becomes timing master |
| Bandwidth | Sum of all streams | Single stream (reduced) |
| RTCP | Forwards or adjusts | Generates own SR, sends RR upstream |
Translator
An RTP translator is an intermediate system that forwards RTP packets, possibly with some changes, while preserving the original sender's SSRC identity. Translators are typically used to bridge disjoint network domains or to convert encodings. Examples include:
- A translator that takes a multicast RTP stream and replicates it as unicast RTP streams to individual clients (multicast-to-unicast relay)
- An encryption translator that decrypts incoming secure RTP and re-encrypts it with a different key for a different domain
- A translator that transcodes audio (e.g., from a high-bandwidth codec to a low-bandwidth codec) without mixing multiple sources together
The key is that a translator does not mix streams – it passes through each source stream separately, maintaining the original SSRC. If it changes the media encoding, it must adjust RTP sequence numbers and timestamps accordingly to not break the receiver's processing. Simple translators might not modify RTP at all, just forward packets (e.g., an RTP packets reflector or an IPv4-to-IPv6 RTP gateway).
Notably, a translator doesn't generate its own RTP stream – it's relaying someone else's stream. Therefore, translators do not have their own SSRC unless they need to send their own RTCP or act as a partial participant.
Mixer
An RTP mixer receives RTP streams from one or more sources, combines or processes them, and outputs a new RTP stream with its own SSRC that represents the mixture. Mixers are used when combining multiple media streams is desired, such as in audio conference bridges (summing audio from several participants into one stream) or video mixers (compositing several video feeds into one layout).
A mixer is an active participant in the RTP session; it assigns a new SSRC for the mixed stream it generates (and it will send RTCP reports for that SSRC as a sender). To preserve information about the contributors, a mixer uses the CSRC list in the RTP header of each mixed packet to list the SSRCs of the sources that went into the mix.
For instance, if a mixer M is combining audio from sources S1 and S2, the RTP packets sent by M will have SSRC=M, and in each packet's header CSRC list it might include S1 and S2 (the contributing sources for that packet's audio). This way, downstream receivers can display indicators of which participants are currently speaking.
A mixer also needs to adjust timing – inputs may have different RTP timestamps and clock drift. The mixer will generate new timestamps for its outgoing packets (it is the timing master for its stream). By generating its own synchronized stream, a mixer breaks end-to-end timing: receivers cannot directly synchronize an original source's stream with another if a mixer is in between.
The advantages of mixers include bandwidth efficiency and simplicity for receivers: e.g., in a 10-party audio conference, instead of each endpoint sending 9 separate streams to others, a central audio mixer can combine and send one stream to each with everyone's voices mixed. The disadvantages are reduced flexibility (clients can't individually control volume or selection of participants in the mix).
Topology Considerations
Both mixers and translators connect multiple RTP "clouds" or groups. All participants linked by mixers/translators form a single RTP session logically, sharing a common SSRC space. For example, in a mixed conference with one mixer, all end systems plus the mixer share the same session and all SSRCs must be unique across them.
To illustrate: imagine a scenario with two local networks connected by an RTP translator (maybe due to a firewall). All senders on Net A and Net B choose unique SSRCs. The translator passes their RTP packets across, unchanged except possibly IP/port and maybe payload conversion. The session is still one RTP session – everyone's SSRC is visible to everyone.
Alternatively, consider an MCU (multipoint control unit) acting as a central mixer for video: each participant sends their video to the MCU. The MCU (mixer) composes a grid of videos and sends out a single video stream from SSRC=MCU to all receivers. That stream's CSRC list might include the SSRCs of everyone currently visible in the grid.
In summary, mixers and translators allow RTP to be used flexibly in multi-party or complex networks. They are explicitly supported by the protocol (RFC 3550 devotes Section 7 to their behavior). A translator keeps streams separate and unchanged in identity, whereas a mixer merges streams and becomes the new source.
Security Considerations (SRTP)
The RTP/RTCP specification initially included an optional encryption method (specified in RFC 3550 Appendix A.7 and Section 9.1) that could encrypt RTP payloads and RTCP packets for confidentiality. However, the built-in method was not very strong by modern standards. Recognizing the need for robust security, the IETF developed Secure RTP (SRTP), defined in RFC 3711, as a profile of RTP that provides encryption, message authentication, and integrity for RTP and RTCP.
SRTP is nowadays the de facto way to secure RTP in applications like SIP calls (when using ZRTP or SDES key exchange) and is mandatory in WebRTC (via DTLS-SRTP).
| Component | Encryption | Purpose |
|---|---|---|
| RTP Header | Cleartext | SSRC, seq, timestamp visible for routing |
| Payload | AES Encrypted | Media content protected |
| Auth Tag | HMAC-SHA1 | Integrity verification (typically 10 bytes) |
SRTP basics: SRTP is essentially an additional layer that processes RTP packets before transmission and after reception. It uses strong cryptographic algorithms (the default is AES) to encrypt the RTP payload (and/or header extensions) and uses an authentication algorithm (such as HMAC-SHA1) to ensure integrity – so that packets cannot be tampered with undetected.
The RTP header is mostly left in the clear (except possibly the extension and padding bits which can be encrypted depending on policy) to allow intermediaries to still parse SSRCs, sequence numbers, etc., and to let header compression or mixers function if needed. Each SRTP packet carries a cryptographic auth tag (authentication code) after the RTP payload. Similarly, SRTCP secures RTCP packets.
SRTP itself defines the cipher framework; it is signaled as an RTP profile (called RTP/SAVP – Secure Audio/Video Profile) in protocols like SDP. For instance, an SDP can indicate m=audio 4000 RTP/SAVP 0 to signal that RTP will be run with SRTP.
SRTP Key Exchange Methods
The actual encryption keys must be determined out-of-band:
| Method | RFC | Description | Usage |
|---|---|---|---|
| SDES | 4568 | Keys in SDP (deprecated) | Legacy SIP |
| DTLS-SRTP | 5764 | In-band DTLS handshake | WebRTC (mandatory) |
| ZRTP | 6189 | In-call DH exchange | Opportunistic encryption |
| MIKEY | 3830 | Multimedia Internet KEYing | Group scenarios |
DTLS-SRTP
DTLS-SRTP (RFC 5764) defines a standard way to perform the SRTP key exchange using an in-band DTLS handshake on the RTP/RTCP ports – this is what WebRTC and many modern systems use, ensuring keys are agreed securely and even providing perfect forward secrecy. With DTLS-SRTP, the RTP packets can be multiplexed on the same port with DTLS and other data by examining the first byte (there's a demultiplexing scheme to know if a packet is DTLS handshake or RTP/RTCP or STUN for ICE).
DTLS-SRTP provides:
- Perfect forward secrecy (DH key exchange)
- Certificate fingerprint verification (via SDP)
- Multiplexing on same port as RTP/RTCP/STUN
SDP attributes for DTLS-SRTP:
a=fingerprint:sha-256 AB:CD:EF:... a=setup:actpass a=rtcp-mux
Security Benefits
By using SRTP, we gain:
- Confidentiality – media content is not exposed on the wire
- Integrity/Authentication – packets cannot be forged or altered without detection
- Replay protection – SRTP has a replay window and sequence counter to drop old or duplicated packets
The overhead of SRTP is minimal – typically just a few bytes for the auth tag and maybe an IV (initialization vector), but it was designed to be bandwidth efficient so that it can be used even on low-bandwidth calls.
Additional Security Considerations
One must note that using SRTP doesn't change how RTP looks to the networking layer (it's still UDP packets, just with encrypted payloads). However, some features like mixers or translators need to be aware: a mixer can't decode and mix an SRTP stream unless it has the keys, so typically mixing is done either after decryption (if mixer is trusted with keys) or at endpoints.
Besides encryption, RTP/RTCP sessions can be a vector for denial of service (sending floods of RTP to a target, or misbehaving by sending excessive RTCP). Implementations should verify that RTP/RTCP packets they process come from expected addresses and possibly rate-limit certain reports. Also, participants should choose strong random SSRCs and CNAMEs to avoid fingerprinting.
Practical note: If you see a=crypto lines in SDP or a=fingerprint (for DTLS), those are indicators of SRTP being used. If using Wireshark to debug, SRTP packets won't decode as audio unless you supply the keys (since they're encrypted), whereas plain RTP will.
Monitoring RTP with VoIPmonitor
Understanding RTP protocol theory is essential, but in production environments you need tools to monitor, analyze, and troubleshoot RTP streams in real-time. VoIPmonitor is an open-source network packet sniffer with commercial frontend designed specifically for VoIP quality monitoring and troubleshooting. It passively captures and analyzes SIP signaling, RTP/RTCP streams, and WebRTC traffic to provide comprehensive visibility into call quality.
Real-Time RTP Stream Analysis
VoIPmonitor performs deep inspection of every RTP packet, extracting and correlating all the header fields discussed in this article:
| RTP Element | What VoIPmonitor Monitors | Practical Benefit |
|---|---|---|
| Sequence Numbers | Detects gaps, duplicates, and out-of-order packets | Pinpoint exact moment of packet loss |
| Timestamps | Calculates inter-arrival jitter per RFC 3550 | Measure network timing stability |
| SSRC/CSRC | Tracks individual streams and mixer sources | Correlate quality per participant |
| Payload Type | Identifies codec changes during call | Detect codec negotiation issues |
| Marker Bit | Detects frame boundaries and talkspurts | Analyze silence suppression behavior |
Quality Metrics and MOS Calculation
VoIPmonitor calculates voice quality metrics in real-time using the ITU-T G.107 E-model, providing:
- MOS (Mean Opinion Score) - Predicts perceived voice quality on a 1-5 scale based on measured network impairments
- R-Factor - Transmission rating factor (0-100) that feeds into MOS calculation
- Jitter - Packet delay variation measured from RTP timestamps, as described in the RTCP Receiver Report Fields section
- Packet Loss - Both total and burst loss patterns that severely impact quality
- PDV (Packet Delay Variation) - One-way delay measurements when possible
The MOS calculation incorporates all these factors plus codec-specific impairment factors (Ie) to predict how users would rate the call quality:
| MOS Range | Quality | User Perception |
|---|---|---|
| 4.3 - 5.0 | Excellent | Toll quality, no perceivable degradation |
| 4.0 - 4.3 | Good | Slight degradation, acceptable for business |
| 3.6 - 4.0 | Fair | Some users dissatisfied |
| 3.1 - 3.6 | Poor | Many users dissatisfied |
| 1.0 - 3.1 | Bad | Nearly all users dissatisfied |
RTCP Data Integration
VoIPmonitor leverages both computed metrics (from RTP packet analysis) and RTCP reports sent by endpoints. The Receiver Report data provides endpoint-verified quality measurements, giving you two independent views:
- Sender-side metrics - Computed from captured RTP packets
- Receiver-side metrics - Reported via RTCP from the actual receiving endpoint
Comparing these helps identify whether quality issues are in the network path or at the endpoints.
Packet-Level Deep Inspection
For detailed troubleshooting, VoIPmonitor allows drill-down to individual RTP packets with Wireshark-like detail views. You can:
- Examine exact timing and sequence numbers
- Identify specific packets causing quality degradation
- View full packet captures (PCAP) for any call
- Analyze RTP stream graphs showing jitter and loss over time
- Detect one-way audio and silence/clipping issues
Advanced Detection Features
Beyond basic metrics, VoIPmonitor provides specialized detection:
- One-way audio detection - Automatically identifies asymmetric audio flow with visual comparison
- Silence detection - Finds calls with unusual silence patterns
- Audio clipping - Detects signal overload causing distortion
- SRTP decryption - Analyze encrypted RTP when keys are available
Deployment Options
VoIPmonitor operates as a passive network sniffer, meaning it monitors traffic without affecting call quality:
- SPAN/Mirror ports - Connect to switch mirror port
- Network TAPs - Hardware TAPs for high-volume environments
- RSPAN/ERSPAN - Remote monitoring across network segments
- SBC integration - Native packet duplication from AudioCodes, Ribbon, Oracle SBC, Cisco CUBE
This passive approach is ideal for production monitoring where you need visibility into RTP quality without impacting the actual media path.
For more information, visit voipmonitor.org or see the VoIPmonitor documentation.
Troubleshooting RTP
| Issue | Symptoms | RTCP Indicator | Solution |
|---|---|---|---|
| Packet loss | Choppy audio, video artifacts | High fraction lost in RR | Check network path, QoS |
| High jitter | Audio gaps, video stuttering | High jitter value in RR | Increase jitter buffer, check network |
| One-way audio | Only one party hears | No RTP received | Check NAT, firewall, SDP IPs |
| No media | Complete silence | No RR/SR packets | Verify signaling, check ports |
| Codec mismatch | Garbled audio | PT doesn't match expected | Verify SDP negotiation |
| Clock drift | A/V desync over time | Compare SR timestamps | Use RTCP for sync |
Wireshark RTP Analysis
Key filters for RTP analysis:
rtp # All RTP packets rtcp # All RTCP packets rtp.ssrc == 0x1234abcd # Specific stream rtp.marker == 1 # Frame boundaries rtcp.pt == 200 # Sender Reports rtcp.pt == 201 # Receiver Reports
Telephony > RTP > RTP Streams - shows all streams with statistics
VoIPmonitor for Production Environments
While Wireshark is excellent for ad-hoc packet analysis, production VoIP environments benefit from dedicated monitoring solutions. VoIPmonitor provides:
- Automatic call correlation - Links RTP streams to SIP calls without manual filtering
- Historical analysis - Search and analyze past calls by quality metrics
- Alerting - Real-time notifications when MOS drops below threshold
- Scalability - Handle 100,000+ concurrent calls
- PCAP export - Generate Wireshark-compatible captures for any call
This makes it ideal for ongoing quality assurance and rapid troubleshooting in carrier and enterprise environments.
Quick Reference Tables
RTP Header Bit Layout
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SSRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CSRC list | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Clock Rates by Media Type
| Media Type | Typical Clock Rate | Examples |
|---|---|---|
| Narrowband audio | 8000 Hz | G.711, G.729, GSM |
| Wideband audio | 16000 Hz | G.722.1, AMR-WB |
| Full-band audio | 48000 Hz | Opus |
| Video | 90000 Hz | H.264, VP8, H.265 |
Conclusion
We have covered the essential aspects of RTP as specified in RFC 3550 and related RFCs – including its packet format, companion RTCP protocol, usage with SIP/SDP, profiles, mixers/translators, and security enhancements. RTP (v2) has proven to be a flexible, scalable framework for real-time communication, standing the test of time since its first standardization (RFC 1889 in 1996, updated by RFC 3550 in 2003).
Its design separates concerns: media content delivery is handled efficiently at the transport layer, while details of session setup and control are left to higher-level protocols. RTP's extensions and profiles (from AVP to AVPF to SAVP and beyond) demonstrate how it can evolve: new codecs, feedback mechanisms, and security features have been incorporated without altering the fundamental protocol. This makes RTP a cornerstone of VoIP, streaming, and telepresence systems – from traditional SIP phone calls to WebRTC peer-to-peer video chats, chances are RTP is carrying the media under the hood.
For those looking to deepen their understanding beyond this summary, the full RFC 3550 is recommended reading (it includes many details like jitter calculation, RTCP scheduling algorithms, etc.). Additionally, RFC 3551 (A/V Profile) provides insight into payload type mappings and considerations for audio/video usage.
In summary, RTP provides the real-time delivery capabilities – sequencing, timing, mixing, feedback – that enable interactive voice, video, and other media applications to work over the unpredictable Internet. With the help of RTCP, it can adapt to network conditions and provide monitoring. Through profiles and signaling, it can support any media format securely and efficiently. This makes RTP a powerful and indispensable tool in the Internet protocol suite for real-time communication.
References
Primary Standards
- RFC 3550 - RTP: A Transport Protocol for Real-Time Applications
- RFC 3551 - RTP Profile for Audio and Video Conferences (AVP)
- RFC 3711 - The Secure Real-time Transport Protocol (SRTP)
- RFC 4585 - Extended RTP Profile for RTCP-Based Feedback (AVPF)
Extensions and Updates
- RFC 5761 - Multiplexing RTP Data and Control Packets on a Single Port
- RFC 5764 - DTLS Extension to Establish Keys for SRTP
- RFC 6051 - Rapid Synchronisation of RTP Flows
- RFC 6222 - Guidelines for Choosing RTCP Canonical Names (CNAMEs)
- RFC 8285 - A General Mechanism for RTP Header Extensions
- RFC 3611 - RTP Control Protocol Extended Reports (RTCP XR)
- RFC 5104 - Codec Control Messages in AVPF
Payload Formats
- RFC 4733 - RTP Payload for DTMF Digits, Telephony Tones
- RFC 6184 - RTP Payload Format for H.264 Video
- RFC 7587 - RTP Payload Format for Opus Speech and Audio Codec
- RFC 4103 - RTP Payload for Text Conversation