Understanding the RTP Protocol

The Real-time Transport Protocol (RTP) is an Internet-standard transport protocol for real-time audio, video, and other time-sensitive data transmission, defined in RFC 3550. This comprehensive guide covers all essential aspects of RTP including packet structure, RTCP, profiles, signaling integration, and security.

Quick Navigation

Protocol Fundamentals	Control & Feedback	Session Management	Advanced Topics
Introduction RTP Packet Structure Header Fields Reference Header Extensions	RTCP Overview RTCP Packet Types Bandwidth Control Quality Feedback	Profiles & Payload Types SIP/SDP Integration Transport Ports Codec Negotiation	Mixers & Translators SRTP Security DTLS-SRTP RTP Monitoring References

Introduction

The Real-time Transport Protocol (RTP) is an Internet-standard transport protocol for real-time audio, video, and other time-sensitive data transmission, defined in RFC 3550. RTP provides end-to-end delivery services for media streams, including sequence numbering, timestamping, payload type identification, and monitoring. These features allow receivers to detect packet loss, restore proper order, and synchronize playback of incoming media.

RTP is typically used on top of UDP for its low overhead and latency (though it can operate over other transports). Unlike TCP, RTP/UDP does not guarantee delivery or ordering, nor does it provide congestion control or quality-of-service (QoS) guarantees itself. Instead, it tolerates some packet loss and reordering as a trade-off for timely delivery, making it well-suited for interactive communications where a lost packet is preferable to a delayed one. RTP is broadly used in VoIP telephony, video conferencing, live media streaming, and similar applications requiring real-time data transfer.

RTP is designed with companion protocols and profiles to fully support real-time sessions. Each RTP session is accompanied by the RTP Control Protocol (RTCP), which provides out-of-band control, feedback, and minimal session management. RTP itself is a flexible framework intended to be tailored by profiles for specific application domains. For example, the standard profile for audio/video conferencing (RTP/AVP, RFC 3551) defines codec-specific payload type assignments. RTP's open design allows extension or modification of header fields via profiles or header extensions, enabling broad adaptability. At the same time, RTP was built to scale from simple one-to-one calls to large multicast conferences: it supports multiple participants, multi-stream sessions, and can operate over IP multicast for group communication.

**RTP Key Characteristics**
Feature	Description
Transport	UDP (typically), can use other transports
Reliability	No guarantee - tolerates loss for timeliness
Companion Protocol	RTCP for control, feedback, and monitoring
Profiles	Extensible via profiles (AVP, AVPF, SAVP, etc.)
Scalability	Supports unicast and multicast, one-to-one to large conferences
Session Setup	External (SIP/SDP, H.323, WebRTC signaling)

In summary, RTP's role is to carry real-time media streams with minimal delay, while leaving call setup, control, and other aspects to separate protocols. For instance, Session Initiation Protocol (SIP) is often used to negotiate and establish RTP sessions (exchanging IP addresses, ports, and codec info via SDP), after which RTP transports the actual media independently. This separation of signaling (SIP/SDP) and media transport (RTP/RTCP) is fundamental in VoIP and multimedia systems: SIP handles connection management, and RTP handles the voice/video transmission over the connection. RTCP, running alongside RTP on a parallel path, feeds back reception quality and participant information, allowing endpoints to monitor and adapt to network conditions.

RTP Packet Structure

Every RTP packet consists of a header followed by a payload (the media data). The fixed RTP header is 12 bytes long and contains fields that enable proper delivery and playback of real-time media. The header may be followed by optional CSRC identifiers and an extension header (if used). The payload carries the media data (audio frames, video packets, etc.), and padding may be added at the end if needed.

RTP Header Fields

**RTP Header Fields Reference**
Field	Bits	Description
Version (V)	2	RTP version number. Always 2 for RFC 3550.
Padding (P)	1	If set, packet contains padding bytes at the end. Last byte indicates padding count. Padding may be used for certain encryption modes or to align packets to fixed block sizes.
Extension (X)	1	If set, header extension follows CSRC list. This allows application or profile-specific extensions to RTP (defined in RFC 3550 §5.3.1 and extended by RFC 5285/8285 for general use).
CSRC Count (CC)	4	Number of CSRC identifiers (0-15). These 32-bit CSRC IDs, if present, follow the SSRC field and list the sources that have contributed to the payload (used when a mixer combines streams).
Marker (M)	1	Profile-specific meaning. Video: last packet of frame. Audio: start of talkspurt. The interpretation is defined by the applicable RTP profile or payload format.
Payload Type (PT)	7	Media format identifier. Static (0-95) or dynamic (96-127). Combined with out-of-band signaling (SDP), ensures sender and receiver agree on the media encoding.
Sequence Number	16	Increments by 1 per packet. Starts random. Wraps at 65535. Allows receivers to detect packet loss and reorder packets that arrive out of order.
Timestamp	32	Sampling instant of first byte. Clock rate depends on payload format (8000 Hz for 8 kHz audio, 90 kHz for video). Essential for jitter calculation and synchronization.
SSRC	32	Synchronization source identifier. Randomly chosen, unique per session. Receivers use SSRC to demultiplex incoming RTP streams and track each source separately.
CSRC List	0-480	Contributing source IDs (used by mixers). Up to 15 × 32-bit IDs. When a mixer creates a combined stream, it includes original source SSRCs here.

Detailed Field Descriptions

Sequence Number: The 16-bit packet sequence number increments by one for each RTP packet sent. It is initialized to a random value for each new RTP session. After reaching 65,535 it wraps around to 0. Using the sequence numbers, a receiver can reconstruct the original packet order even though RTP/UDP does not guarantee delivery order.

Timestamp: The 32-bit timestamp reflects the sampling instant of the first byte in the payload. The timestamp clock rate is dependent on the payload format (for example, 8000 Hz for 8 kHz audio, 90 kHz for video in many profiles). Like the sequence, the timestamp starts from a random base value. Each RTP packet's timestamp is incremented by the amount of time covered by the payload data it carries. When multiple media streams (e.g., audio and video) are in use, their RTP timestamps can be correlated via RTCP so that playback is synchronized.

SSRC: The 32-bit Synchronization Source identifier uniquely labels the source of a stream within an RTP session. Each RTP packet carries a single SSRC indicating its origin. The SSRC is randomly chosen by each sender to be globally unique within the session, so that if multiple participants are sending, their streams can be told apart. If an SSRC collision occurs (two senders accidentally pick the same value), the protocol has rules to resolve it (senders adopt new SSRCs and send RTCP "BYE" for the old ID).

Example RTP Header:

V=2, P=0, X=0, CC=0, M=1, PT=96, Seq=12345, Timestamp=0x30551980, SSRC=0x1A2B3C4D

This indicates: RTP version 2, no padding/extension, no CSRCs, marker set (end of frame), payload type 96 (dynamic - e.g., H.264), sequence 12345, with SSRC 0x1A2B3C4D identifying the stream's source. The payload data would follow, encoded according to the specified format.

Header Extensions

In addition to the fixed header fields, RTP allows an optional header extension if the X bit is set. The extension, if present, consists of a 16-bit length field and a defined format for additional fields that are not covered by the standard header. Originally, RFC 3550 defined a simple mechanism for one 16-bit-header extension (intended for limited experimental use). This was later generalized by RFC 5285/8285, which introduced one-byte and two-byte header extension formats allowing multiple identified extension elements to be included.

Header extensions are negotiated via signaling (e.g., using the a=extmap attribute in SDP) and can carry extra metadata like audio levels, video frame metadata, etc., on a per-packet basis.

Extension Type	Description	Negotiation
One-byte header	Up to 14 extension elements, each 1-16 bytes	SDP `a=extmap`
Two-byte header	Larger extensions, more flexibility	SDP `a=extmap`
Common uses	Audio levels, video orientation, timing info	Signaling agreement

Padding

The RTP packet may end with padding bytes (if P flag is 1) which are not part of the media payload. Padding is often used when encryption algorithms require fixed block sizes or to align RTP packets to certain length boundaries. The last byte of padding contains a count of how many padding bytes are appended.

RTP Control Protocol (RTCP)

RTP's data transport is augmented by the Real-time Transport Control Protocol (RTCP), defined in the same specification (RFC 3550). RTCP packets are sent periodically by each participant in an RTP session to convey control and quality information. Unlike RTP packets (which carry media), RTCP packets do not transport media payload; instead, they carry statistics and metadata about the RTP streams, and perform a few management functions for the session.

RTCP serves several important roles:

Quality of Service Feedback: RTCP provides feedback on network conditions to all session members. Each receiver reports how well it is receiving the RTP stream(s) – including metrics like fraction of packets lost, cumulative packets lost, interarrival jitter, and last timestamp received. This feedback is contained in Receiver Report (RR) packets or in Sender Report (SR) packets (which include reception stats from the sender's perspective as a receiver of any other streams). These reports allow senders and network managers to gauge the quality (packet loss rates, jitter, round-trip time) and adapt if necessary (e.g., by reducing bitrates or switching codecs on poor links).

Inter-media Synchronization: RTCP Sender Reports include an NTP timestamp and the corresponding RTP timestamp of the sender's stream at a moment in time. This mapping allows receivers to synchronize multiple streams (for example, audio and video) from the same sender. By correlating RTP timestamps of different streams to a common wall-clock time (NTP), the receiver can play audio and video in lockstep. This is crucial in conferences where audio and video come from the same source but are separate RTP streams.

Participant Identification and Session Control: RTCP carries source descriptors (SDES), which include canonical identifiers and optional information for participants. For example, each participant sends a CNAME (Canonical Name) item – a persistent identifier like "user@host" that uniquely identifies the participant across restarts and SSRC changes. CNAME ties together multiple RTP streams from the same person (so that their audio and video streams can be associated). Other SDES items can include a user name, email, phone number, tool name, etc., which can be used to identify participants in loosely controlled sessions. RTCP's BYE packet is used to indicate a participant is leaving the session, helping with member management.

Minimal Session Control: While RTP/RTCP is not a full session signaling protocol, RTCP acts as a keep-alive and light coordination mechanism. It monitors the number of participants and can adjust its reporting interval to scale to large groups. It also provides a way to inform all participants of changes (e.g., BYE goodbye signals an endpoint departure). Any higher-level control (like inviting users, negotiating media formats) is outside RTCP's scope and handled by protocols like SIP or H.323.

RTCP Packet Types

RTCP communication is done with compound packets that may contain several RTCP messages. The five basic RTCP packet types defined in RFC 3550 are:

**RTCP Packet Types (RFC 3550)**
Type	Code	Name	Sender	Contents
SR	200	Sender Report	Active senders	NTP/RTP timestamps, packet/byte counts, reception reports
RR	201	Receiver Report	Non-senders	Fraction lost, cumulative loss, jitter, LSR, DLSR
SDES	202	Source Description	All participants	CNAME (required), NAME, EMAIL, PHONE, LOC, TOOL, NOTE
BYE	203	Goodbye	Leaving participant	SSRC of departing stream, optional reason
APP	204	Application-specific	Application-defined	Custom data for experimental features

SR (Sender Report): Packet type 200, sent by active senders at the RTCP interval. It includes an NTP timestamp and the sender's RTP timestamp to enable synchronization, as well as sender's packet count and byte count. Following the sender info, an SR carries a set of reception report blocks (one per RTP stream the sender is receiving) with loss and jitter stats. The SR thus combines two functions: it provides sender info for the sender's own stream, and receiver info about others' streams.

RR (Receiver Report): Type 201, sent by participants that are not active senders. An RR contains one report block per RTP stream received, detailing the fraction of packets lost, cumulative loss, highest sequence number received, interarrival jitter, and timing info for round-trip delay calculation. Receivers send these to give feedback to each sender about the quality of reception. If there are no RTP packets received, the non-sending party still sends an RR with no report blocks as a keep-alive.

SDES (Source Description): Type 202, used to convey descriptive information about sources. The most crucial SDES item is CNAME, which every participant must send, and is included in every compound RTCP packet. CNAME is a unique identifier for the participant that remains constant even if they change SSRC (for example, due to collision or application restart). Other optional SDES items (NAME, EMAIL, PHONE, LOC, etc.) provide human-readable identification and contact info.

BYE (Goodbye): Type 203, indicates an endpoint is leaving the session. It contains the SSRC of the departing stream and may include a short reason for leaving. When a BYE is received, other participants remove that SSRC from their session participant list. If a participant just disappears (no BYE), others will eventually time-out that SSRC after several RTCP intervals of no reception.

APP (Application-specific): Type 204, a packet type earmarked for experimental or application-defined use. It allows organizations to define their own RTCP packet formats for features not covered by the standard types. Newer extensions like feedback messages use different type codes in extended profiles.

RTCP Compound Packet Structure

Per RFC 3550, each RTCP compound packet must:

Start with SR or RR
Include SDES with at least CNAME
Optionally include BYE, APP, or other packets

This means that when an endpoint sends RTCP, it typically transmits: an SR (or RR) + SDES (with CNAME, and optionally NAME/EMAIL/etc.) + any other needed control packets, all concatenated into one UDP packet. This design amortizes overhead and ensures each RTCP report is identifiable.

+--------+--------+--------+--------+
|   SR or RR (required first)       |
+--------+--------+--------+--------+
|   SDES with CNAME (required)      |
+--------+--------+--------+--------+
|   BYE (optional)                  |
+--------+--------+--------+--------+
|   APP (optional)                  |
+--------+--------+--------+--------+

RTCP Bandwidth Control

The frequency of RTCP packets is controlled to avoid too much overhead: by default, RTCP traffic is limited to ~5% of the session bandwidth. In a typical setting, RTP data consumes 95% and RTCP 5% of the configured bandwidth. Furthermore, the 5% RTCP share is split so that all senders together use about 1.25% (i.e. 25% of 5%) and receivers use 3.75% (the remaining 75% of 5%). This weighting prevents feedback implosion when many receivers and few senders are present.

Parameter	Value	Description
Total RTCP bandwidth	~5% of session bandwidth	Prevents control overhead from dominating
Senders share	25% of RTCP (1.25% total)	For SR packets
Receivers share	75% of RTCP (3.75% total)	For RR packets
Minimum interval	5 seconds (AVP profile)	Between RTCP reports
Scaling	Randomized, participant-based	Adapts to session size

The RTCP interval for each participant is randomized and scaled by the number of participants, so that as a session grows, each node sends RTCP less frequently. This algorithm allows RTCP to scale to large multicast groups (hundreds or thousands of members) without overwhelming the network.

Receiver Report Fields

**RTCP Receiver Report Block Fields**
Field	Size	Description
SSRC of source	32 bits	Which stream this report is about
Fraction lost	8 bits	Packets lost / packets expected since last RR (0-255 = 0%-100%)
Cumulative lost	24 bits	Total packets lost since session start
Extended highest seq	32 bits	Highest sequence number received (with rollover)
Interarrival jitter	32 bits	Statistical variance of packet inter-arrival time
Last SR (LSR)	32 bits	Middle 32 bits of NTP timestamp from last SR
Delay since last SR (DLSR)	32 bits	Time between receiving SR and sending this RR

Overall, RTCP is what makes RTP a monitored transport: it provides the necessary feedback loop for adaptive streaming and gives all participants insight into the session. For example, an application can display network statistics (latency, loss) gathered via RTCP, or automatically switch video quality if reports indicate high packet loss. RTCP's design is modular – additional report types (like for video-specific feedback or detailed metrics) have been added in other RFCs (e.g., RFC 4585 defines RTP/AVPF for immediate feedback messages, RFC 3611 defines extended reports).

RTP Profiles and Payload Types

Profiles in RTP define how the protocol is used for specific classes of applications, including default payload type assignments and any header extensions or modifications. RTP itself (RFC 3550) is a general framework; a profile specification fills in the details for an application domain. The most widely used profile is the RTP Audio/Video Profile (AVP) specified in RFC 3551.

**Common RTP Profiles**
Profile	RFC	Description
RTP/AVP	3551	Audio/Video Profile - standard A/V conferencing
RTP/AVPF	4585	AVP with Feedback - immediate RTCP feedback (PLI, NACK)
RTP/SAVP	3711	Secure AVP - SRTP encryption and authentication
RTP/SAVPF	5124	Secure AVPF - SRTP with feedback

RTP/AVP defines a set of static payload type (PT) numbers for common audio and video encodings, and guidelines for using RTP in audio/video conferences. A profile also sets the default clock rates for payload types and might specify how often RTCP packets should be sent. For instance, RFC 3551 (AVP) recommends a minimum RTCP interval of 5 seconds.

The RTP/AVPF profile (RFC 4585) introduces new RTCP feedback message types (like Picture Loss Indication, PLI, or Receiver Estimated Maximum Bitrate, REMB) to support more interactive feedback for video streaming. Secure RTP (RTP/SAVP) from RFC 3711 is essentially the AVP profile with security enhancements (encryption and authentication of RTP/RTCP).

Static Payload Types (AVP Profile)

**Common Static Payload Types**
PT	Encoding	Clock Rate	Type
0	PCMU (G.711 μ-law)	8000 Hz	Audio
3	GSM	8000 Hz	Audio
4	G723	8000 Hz	Audio
8	PCMA (G.711 A-law)	8000 Hz	Audio
9	G722	8000 Hz	Audio
18	G729	8000 Hz	Audio
31	H261	90000 Hz	Video
32	MPV (MPEG-1/2 Video)	90000 Hz	Video
34	H263	90000 Hz	Video
96-127	Dynamic	Negotiated	Any

Dynamic Payload Type Negotiation

RTP allows payload type numbers to be negotiated dynamically since the pool of static types is limited. Dynamic PTs (typically 96 and above) have no predefined encoding and must be defined in the session setup (for example via an SDP offer/answer in SIP). SDP attributes like a=rtpmap: and a=fmtp: describe the codec name, clock rate, and any format parameters for each dynamic PT.

For instance, an SDP might map PT 96 to "H264/90000" indicating H.264 video with a 90 kHz timestamp clock, or PT 101 to "telephone-event/8000" for DTMF tones (RFC 4733 telephony events). RFC 4733 is a payload format that assigns dynamic PTs for telephone keypad tones and other telephony signals over RTP – illustrating that RTP payloads are not limited to "media" in the traditional sense.

m=audio 4000 RTP/AVP 96 97
a=rtpmap:96 opus/48000/2
a=rtpmap:97 telephone-event/8000
a=fmtp:97 0-16

m=video 4002 RTP/AVP 98
a=rtpmap:98 H264/90000
a=fmtp:98 profile-level-id=42e01f

There are dozens of RTP payload format specifications (RFCs) for virtually every audio codec (Opus, AAC, etc.), video codec (H.264, VP8, AV1, etc.), text (RFC 4103 for real-time text), even MIDI (RFC 6295) and other data.

Marker Bit Usage

One key aspect defined in profiles is M bit usage. The Marker bit's meaning is profile-dependent:

Media Type	M=1 Meaning	Purpose
Video	Last packet of frame	Frame boundary detection
Audio (silence suppression)	First packet after silence	Talkspurt indication
RFC 4733 DTMF	End of DTMF event	Event boundary

Reserved Payload Types

The AVP profile specifies that PT 72–73 are reserved (not to be used for payloads) because those values correspond to RTCP packet type octets (200 and 201) if interpreted as an 8-bit number with the marker bit, which aids in RTP/RTCP demultiplexing. This reservation was made so that if RTP and RTCP arrive on the same port, a packet with a "payload type" 72 or 73 would actually be recognized as an RTCP SR or RR.

In summary, RTP profiles (like AVP, AVPF, SAVP, etc.) build on the base RTP/RTCP specification to tailor it to specific use cases, while payload format specs define how to carry particular media formats in RTP. The existence of profiles means that RTP can accommodate new media types and features without redesigning the core protocol – one simply adds a new profile or payload format that all endpoints in a session agree to use. Despite this flexibility, all RTP variants retain the same fundamental packet structure and semantics at a high level, so that generic RTP monitoring and mixing tools can operate across profiles.

Session Setup and Signaling

It's important to understand that RTP itself does not provide session establishment, codec negotiation, or media session coordination – it only handles the transport of media once the session is agreed upon. In typical deployments (e.g., Voice over IP), these functions are handled by a separate signaling protocol such as SIP (Session Initiation Protocol) or H.323. SIP is used to set up, modify, and tear down calls or conferences, and it uses the Session Description Protocol (SDP) to negotiate media parameters between endpoints.

SDP conveys the details of the media streams: what codec(s) to use, what RTP payload type numbers correspond to those codecs, the IP addresses and UDP ports where each side will send/receive RTP and RTCP, whether media is sent or recv-only, etc.

A typical call flow goes like this: SIP INVITE messages are exchanged between endpoints (possibly through proxy servers) to agree on a session. The INVITE and 200 OK messages carry an SDP offer and answer. For example, Alice's phone might offer to send/receive audio with RTP at IP 10.1.1.5 port 4000 (RTP) and 4001 (RTCP) using codecs PCMU (payload type 0) and Opus (dynamic payload 98), and video at port 4002/4003 with H.264 (payload 102). Bob's phone replies with an answer selecting one audio codec (say PCMU) and one video codec, and provides his IP/port where he will receive RTP/RTCP.

Once this exchange is done, both sides know each other's RTP transport addresses (IP + port for RTP, and typically port+1 for RTCP unless multiplexing is signaled), the codec to use, and the payload type numbers that map to that codec. They then begin sending RTP packets to each other directly (and RTCP periodically), carrying the agreed media. From this point, SIP is out of the picture until something like a call hold or termination happens; the media flows independently of the SIP signaling.

Transport Ports

Convention	RTP Port	RTCP Port	Notes
Traditional	Even number (e.g., 4000)	RTP + 1 (e.g., 4001)	Separate ports
RTCP Mux	Same as RTP	Same as RTP	SDP: `a=rtcp-mux`
Demultiplexing	PT 0-127	PT >= 200 (RTCP types)	By first byte

Traditionally, RTP uses an even-numbered UDP port and the next odd-numbered port for RTCP. SDP indicates this in the "m=" lines (e.g., m=audio 4000 RTP/AVP 0 98 means audio via RTP/AVP on port 4000, and implicitly RTCP on 4001 unless otherwise specified).

Modern usage can negotiate RTP/RTCP multiplexing on the same port (called "RTCP mux"), which is signaled by an SDP attribute a=rtcp-mux. If both sides agree, RTP and RTCP packets are sent to the same UDP port. RFC 5761 updates RTP to describe this multiplexing and recommends using it to ease NAT traversal and simplify port management.

Codec and Payload Negotiation

The mapping of dynamic payload type numbers to actual codec formats is achieved through signaling (SDP rtpmap and fmtp attributes). Both ends maintain a table of PT → codec (with clock rate, channels, etc.). During the session, the RTP packets simply refer to the codec via the PT in each packet.

If a sender needs to switch codecs mid-session (to adapt to bandwidth or due to a participant's capabilities), it can do so by sending RTP with a different payload type (one that was agreed on in SDP). For example, an RTP stream might switch from PT 0 (PCMU audio) to PT 8 (PCMA) on the fly, and the receiver will decode accordingly since those were agreed. However, introducing a codec not agreed on is not allowed without renegotiation.

SDP Media Description

v=0
o=alice 2890844526 2890844526 IN IP4 10.1.1.5
s=VoIP Call
c=IN IP4 10.1.1.5
t=0 0
m=audio 4000 RTP/AVP 0 8 96
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:96 opus/48000/2
a=rtcp-mux
a=sendrecv

Session Management

SIP (or other signaling like Jingle/XMPP in WebRTC, or RTSP for streaming) can manage when media starts/stops, put sessions on hold (by indicating port 0 or sendonly/recvonly in SDP), add new media streams, remove streams, etc. RTP itself will just start or stop receiving packets accordingly. RTCP can signal some things like BYE when an endpoint is done, but higher-level coordination (like a SIP BYE to end the call) usually precedes or accompanies the end of RTP flow.

In multicast scenarios or conference bridges, a session may not involve SIP at all – sometimes static configuration or other means (like an announcement that "send RTP to multicast 224.x.y.z:5000") is used. But the prevalent usage in VoIP and unified communications is RTP with SIP/SDP. This combination is so common that people loosely talk about "SIP calls" carrying voice, but technically SIP does the call setup, RTP does the heavy lifting of media transport.

A practical takeaway is: if troubleshooting a call, one can examine SIP messages to see if a call was established and codec agreed, and then examine RTP packets to see if media flowed. The separation also means quality issues (choppy audio) are often due to RTP path problems (packet loss, jitter) rather than SIP issues.

Mixers and Translators

RTP is designed to support more complex network topologies than simple point-to-point streams. It introduces the concepts of mixers and translators – intermediate nodes that participate in an RTP session to facilitate scenarios like multi-party mixing, media format conversion, or firewall traversal. These devices operate at the RTP layer (not just IP routing), so they understand RTP headers and can modify or generate RTP packets on behalf of other participants.

Translator	Mixer

**Mixer vs Translator Comparison**
Feature	Translator	Mixer
SSRC handling	Preserves original SSRCs	Creates new SSRC, lists originals in CSRC
Stream output	Separate streams forwarded	Single combined stream
Use cases	Relay, transcoding, firewall traversal	Audio conferencing, video compositing
Timing	Maintains original timing	Becomes timing master
Bandwidth	Sum of all streams	Single stream (reduced)
RTCP	Forwards or adjusts	Generates own SR, sends RR upstream

Translator

An RTP translator is an intermediate system that forwards RTP packets, possibly with some changes, while preserving the original sender's SSRC identity. Translators are typically used to bridge disjoint network domains or to convert encodings. Examples include:

A translator that takes a multicast RTP stream and replicates it as unicast RTP streams to individual clients (multicast-to-unicast relay)
An encryption translator that decrypts incoming secure RTP and re-encrypts it with a different key for a different domain
A translator that transcodes audio (e.g., from a high-bandwidth codec to a low-bandwidth codec) without mixing multiple sources together

The key is that a translator does not mix streams – it passes through each source stream separately, maintaining the original SSRC. If it changes the media encoding, it must adjust RTP sequence numbers and timestamps accordingly to not break the receiver's processing. Simple translators might not modify RTP at all, just forward packets (e.g., an RTP packets reflector or an IPv4-to-IPv6 RTP gateway).

Notably, a translator doesn't generate its own RTP stream – it's relaying someone else's stream. Therefore, translators do not have their own SSRC unless they need to send their own RTCP or act as a partial participant.

Mixer

An RTP mixer receives RTP streams from one or more sources, combines or processes them, and outputs a new RTP stream with its own SSRC that represents the mixture. Mixers are used when combining multiple media streams is desired, such as in audio conference bridges (summing audio from several participants into one stream) or video mixers (compositing several video feeds into one layout).

A mixer is an active participant in the RTP session; it assigns a new SSRC for the mixed stream it generates (and it will send RTCP reports for that SSRC as a sender). To preserve information about the contributors, a mixer uses the CSRC list in the RTP header of each mixed packet to list the SSRCs of the sources that went into the mix.

For instance, if a mixer M is combining audio from sources S1 and S2, the RTP packets sent by M will have SSRC=M, and in each packet's header CSRC list it might include S1 and S2 (the contributing sources for that packet's audio). This way, downstream receivers can display indicators of which participants are currently speaking.

A mixer also needs to adjust timing – inputs may have different RTP timestamps and clock drift. The mixer will generate new timestamps for its outgoing packets (it is the timing master for its stream). By generating its own synchronized stream, a mixer breaks end-to-end timing: receivers cannot directly synchronize an original source's stream with another if a mixer is in between.

The advantages of mixers include bandwidth efficiency and simplicity for receivers: e.g., in a 10-party audio conference, instead of each endpoint sending 9 separate streams to others, a central audio mixer can combine and send one stream to each with everyone's voices mixed. The disadvantages are reduced flexibility (clients can't individually control volume or selection of participants in the mix).

Topology Considerations

Both mixers and translators connect multiple RTP "clouds" or groups. All participants linked by mixers/translators form a single RTP session logically, sharing a common SSRC space. For example, in a mixed conference with one mixer, all end systems plus the mixer share the same session and all SSRCs must be unique across them.

To illustrate: imagine a scenario with two local networks connected by an RTP translator (maybe due to a firewall). All senders on Net A and Net B choose unique SSRCs. The translator passes their RTP packets across, unchanged except possibly IP/port and maybe payload conversion. The session is still one RTP session – everyone's SSRC is visible to everyone.

Alternatively, consider an MCU (multipoint control unit) acting as a central mixer for video: each participant sends their video to the MCU. The MCU (mixer) composes a grid of videos and sends out a single video stream from SSRC=MCU to all receivers. That stream's CSRC list might include the SSRCs of everyone currently visible in the grid.

In summary, mixers and translators allow RTP to be used flexibly in multi-party or complex networks. They are explicitly supported by the protocol (RFC 3550 devotes Section 7 to their behavior). A translator keeps streams separate and unchanged in identity, whereas a mixer merges streams and becomes the new source.

Security Considerations (SRTP)

The RTP/RTCP specification initially included an optional encryption method (specified in RFC 3550 Appendix A.7 and Section 9.1) that could encrypt RTP payloads and RTCP packets for confidentiality. However, the built-in method was not very strong by modern standards. Recognizing the need for robust security, the IETF developed Secure RTP (SRTP), defined in RFC 3711, as a profile of RTP that provides encryption, message authentication, and integrity for RTP and RTCP.

SRTP is nowadays the de facto way to secure RTP in applications like SIP calls (when using ZRTP or SDES key exchange) and is mandatory in WebRTC (via DTLS-SRTP).

**SRTP Packet Structure**
Component	Encryption	Purpose
RTP Header	Cleartext	SSRC, seq, timestamp visible for routing
Payload	AES Encrypted	Media content protected
Auth Tag	HMAC-SHA1	Integrity verification (typically 10 bytes)

SRTP basics: SRTP is essentially an additional layer that processes RTP packets before transmission and after reception. It uses strong cryptographic algorithms (the default is AES) to encrypt the RTP payload (and/or header extensions) and uses an authentication algorithm (such as HMAC-SHA1) to ensure integrity – so that packets cannot be tampered with undetected.

The RTP header is mostly left in the clear (except possibly the extension and padding bits which can be encrypted depending on policy) to allow intermediaries to still parse SSRCs, sequence numbers, etc., and to let header compression or mixers function if needed. Each SRTP packet carries a cryptographic auth tag (authentication code) after the RTP payload. Similarly, SRTCP secures RTCP packets.

SRTP itself defines the cipher framework; it is signaled as an RTP profile (called RTP/SAVP – Secure Audio/Video Profile) in protocols like SDP. For instance, an SDP can indicate m=audio 4000 RTP/SAVP 0 to signal that RTP will be run with SRTP.

SRTP Key Exchange Methods

The actual encryption keys must be determined out-of-band:

Method	RFC	Description	Usage
SDES	4568	Keys in SDP (deprecated)	Legacy SIP
DTLS-SRTP	5764	In-band DTLS handshake	WebRTC (mandatory)
ZRTP	6189	In-call DH exchange	Opportunistic encryption
MIKEY	3830	Multimedia Internet KEYing	Group scenarios

DTLS-SRTP

DTLS-SRTP (RFC 5764) defines a standard way to perform the SRTP key exchange using an in-band DTLS handshake on the RTP/RTCP ports – this is what WebRTC and many modern systems use, ensuring keys are agreed securely and even providing perfect forward secrecy. With DTLS-SRTP, the RTP packets can be multiplexed on the same port with DTLS and other data by examining the first byte (there's a demultiplexing scheme to know if a packet is DTLS handshake or RTP/RTCP or STUN for ICE).

DTLS-SRTP provides:

Perfect forward secrecy (DH key exchange)
Certificate fingerprint verification (via SDP)
Multiplexing on same port as RTP/RTCP/STUN

SDP attributes for DTLS-SRTP:

a=fingerprint:sha-256 AB:CD:EF:...
a=setup:actpass
a=rtcp-mux

Security Benefits

By using SRTP, we gain:

Confidentiality – media content is not exposed on the wire
Integrity/Authentication – packets cannot be forged or altered without detection
Replay protection – SRTP has a replay window and sequence counter to drop old or duplicated packets

The overhead of SRTP is minimal – typically just a few bytes for the auth tag and maybe an IV (initialization vector), but it was designed to be bandwidth efficient so that it can be used even on low-bandwidth calls.

Additional Security Considerations

One must note that using SRTP doesn't change how RTP looks to the networking layer (it's still UDP packets, just with encrypted payloads). However, some features like mixers or translators need to be aware: a mixer can't decode and mix an SRTP stream unless it has the keys, so typically mixing is done either after decryption (if mixer is trusted with keys) or at endpoints.

Besides encryption, RTP/RTCP sessions can be a vector for denial of service (sending floods of RTP to a target, or misbehaving by sending excessive RTCP). Implementations should verify that RTP/RTCP packets they process come from expected addresses and possibly rate-limit certain reports. Also, participants should choose strong random SSRCs and CNAMEs to avoid fingerprinting.

Practical note: If you see a=crypto lines in SDP or a=fingerprint (for DTLS), those are indicators of SRTP being used. If using Wireshark to debug, SRTP packets won't decode as audio unless you supply the keys (since they're encrypted), whereas plain RTP will.

Monitoring RTP with VoIPmonitor

Understanding RTP protocol theory is essential, but in production environments you need tools to monitor, analyze, and troubleshoot RTP streams in real-time. VoIPmonitor is an open-source network packet sniffer with commercial frontend designed specifically for VoIP quality monitoring and troubleshooting. It passively captures and analyzes SIP signaling, RTP/RTCP streams, and WebRTC traffic to provide comprehensive visibility into call quality.

Real-Time RTP Stream Analysis

VoIPmonitor performs deep inspection of every RTP packet, extracting and correlating all the header fields discussed in this article:

**VoIPmonitor RTP Analysis Capabilities**
RTP Element	What VoIPmonitor Monitors	Practical Benefit
Sequence Numbers	Detects gaps, duplicates, and out-of-order packets	Pinpoint exact moment of packet loss
Timestamps	Calculates inter-arrival jitter per RFC 3550	Measure network timing stability
SSRC/CSRC	Tracks individual streams and mixer sources	Correlate quality per participant
Payload Type	Identifies codec changes during call	Detect codec negotiation issues
Marker Bit	Detects frame boundaries and talkspurts	Analyze silence suppression behavior

Quality Metrics and MOS Calculation

VoIPmonitor calculates voice quality metrics in real-time from jitter buffer simulation, providing:

MOS (Mean Opinion Score) - Predicts perceived voice quality on a 1-5 scale based on measured network impairments (jitter, packet loss)
Jitter - Packet delay variation measured from RTP timestamps, as described in the RTCP Receiver Report Fields section
Packet Loss - Both total and burst loss patterns that severely impact quality
PDV (Packet Delay Variation) - One-way delay measurements when possible

Note: VoIPmonitor does not calculate the R-Factor metric. R-Factor is considered redundant to the VoIPmonitor MOS score, as monitoring MOS provides equivalent information. See the R-Factor definition for details.

The MOS calculation incorporates all these factors plus codec-specific impairment factors (Ie) to predict how users would rate the call quality:

**MOS Score Interpretation**
MOS Range	Quality	User Perception
4.3 - 5.0	Excellent	Toll quality, no perceivable degradation
4.0 - 4.3	Good	Slight degradation, acceptable for business
3.6 - 4.0	Fair	Some users dissatisfied
3.1 - 3.6	Poor	Many users dissatisfied
1.0 - 3.1	Bad	Nearly all users dissatisfied

RTCP Data Integration

VoIPmonitor leverages both computed metrics (from RTP packet analysis) and RTCP reports sent by endpoints. The Receiver Report data provides endpoint-verified quality measurements, giving you two independent views:

Sender-side metrics - Computed from captured RTP packets
Receiver-side metrics - Reported via RTCP from the actual receiving endpoint

Comparing these helps identify whether quality issues are in the network path or at the endpoints.

Packet-Level Deep Inspection

For detailed troubleshooting, VoIPmonitor allows drill-down to individual RTP packets with Wireshark-like detail views. You can:

Examine exact timing and sequence numbers
Identify specific packets causing quality degradation
View full packet captures (PCAP) for any call
Analyze RTP stream graphs showing jitter and loss over time
Detect one-way audio and silence/clipping issues

Advanced Detection Features

Beyond basic metrics, VoIPmonitor provides specialized detection:

One-way audio detection - Automatically identifies asymmetric audio flow with visual comparison
Silence detection - Finds calls with unusual silence patterns
Audio clipping - Detects signal overload causing distortion
SRTP decryption - Analyze encrypted RTP when keys are available

Deployment Options

VoIPmonitor operates as a passive network sniffer, meaning it monitors traffic without affecting call quality:

SPAN/Mirror ports - Connect to switch mirror port
Network TAPs - Hardware TAPs for high-volume environments
RSPAN/ERSPAN - Remote monitoring across network segments
SBC integration - Native packet duplication from AudioCodes, Ribbon, Oracle SBC, Cisco CUBE

This passive approach is ideal for production monitoring where you need visibility into RTP quality without impacting the actual media path.

For more information, visit voipmonitor.org or see the VoIPmonitor documentation.

Troubleshooting RTP

**Common RTP Issues and Diagnostics**
Issue	Symptoms	RTCP Indicator	Solution
Packet loss	Choppy audio, video artifacts	High fraction lost in RR	Check network path, QoS
High jitter	Audio gaps, video stuttering	High jitter value in RR	Increase jitter buffer, check network
One-way audio	Only one party hears	No RTP received	Check NAT, firewall, SDP IPs
No media	Complete silence	No RR/SR packets	Verify signaling, check ports
Codec mismatch	Garbled audio	PT doesn't match expected	Verify SDP negotiation
Clock drift	A/V desync over time	Compare SR timestamps	Use RTCP for sync

Wireshark RTP Analysis

Key filters for RTP analysis:

rtp                          # All RTP packets
rtcp                         # All RTCP packets
rtp.ssrc == 0x1234abcd       # Specific stream
rtp.marker == 1              # Frame boundaries
rtcp.pt == 200               # Sender Reports
rtcp.pt == 201               # Receiver Reports

Telephony > RTP > RTP Streams - shows all streams with statistics

VoIPmonitor for Production Environments

While Wireshark is excellent for ad-hoc packet analysis, production VoIP environments benefit from dedicated monitoring solutions. VoIPmonitor provides:

Automatic call correlation - Links RTP streams to SIP calls without manual filtering
Historical analysis - Search and analyze past calls by quality metrics
Alerting - Real-time notifications when MOS drops below threshold
Scalability - Handle 100,000+ concurrent calls
PCAP export - Generate Wireshark-compatible captures for any call

This makes it ideal for ongoing quality assurance and rapid troubleshooting in carrier and enterprise environments.

Quick Reference Tables

RTP Header Bit Layout

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|  CC   |M|     PT      |       Sequence Number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Timestamp                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             SSRC                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          CSRC list                            |
|                             ....                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Clock Rates by Media Type

Media Type	Typical Clock Rate	Examples
Narrowband audio	8000 Hz	G.711, G.729, GSM
Wideband audio	16000 Hz	G.722.1, AMR-WB
Full-band audio	48000 Hz	Opus
Video	90000 Hz	H.264, VP8, H.265

Conclusion

We have covered the essential aspects of RTP as specified in RFC 3550 and related RFCs – including its packet format, companion RTCP protocol, usage with SIP/SDP, profiles, mixers/translators, and security enhancements. RTP (v2) has proven to be a flexible, scalable framework for real-time communication, standing the test of time since its first standardization (RFC 1889 in 1996, updated by RFC 3550 in 2003).

Its design separates concerns: media content delivery is handled efficiently at the transport layer, while details of session setup and control are left to higher-level protocols. RTP's extensions and profiles (from AVP to AVPF to SAVP and beyond) demonstrate how it can evolve: new codecs, feedback mechanisms, and security features have been incorporated without altering the fundamental protocol. This makes RTP a cornerstone of VoIP, streaming, and telepresence systems – from traditional SIP phone calls to WebRTC peer-to-peer video chats, chances are RTP is carrying the media under the hood.

For those looking to deepen their understanding beyond this summary, the full RFC 3550 is recommended reading (it includes many details like jitter calculation, RTCP scheduling algorithms, etc.). Additionally, RFC 3551 (A/V Profile) provides insight into payload type mappings and considerations for audio/video usage.

In summary, RTP provides the real-time delivery capabilities – sequencing, timing, mixing, feedback – that enable interactive voice, video, and other media applications to work over the unpredictable Internet. With the help of RTCP, it can adapt to network conditions and provide monitoring. Through profiles and signaling, it can support any media format securely and efficiently. This makes RTP a powerful and indispensable tool in the Internet protocol suite for real-time communication.

References

Primary Standards

RFC 3550 - RTP: A Transport Protocol for Real-Time Applications
RFC 3551 - RTP Profile for Audio and Video Conferences (AVP)
RFC 3711 - The Secure Real-time Transport Protocol (SRTP)
RFC 4585 - Extended RTP Profile for RTCP-Based Feedback (AVPF)

Extensions and Updates

RFC 5761 - Multiplexing RTP Data and Control Packets on a Single Port
RFC 5764 - DTLS Extension to Establish Keys for SRTP
RFC 6051 - Rapid Synchronisation of RTP Flows
RFC 6222 - Guidelines for Choosing RTCP Canonical Names (CNAMEs)
RFC 8285 - A General Mechanism for RTP Header Extensions
RFC 3611 - RTP Control Protocol Extended Reports (RTCP XR)
RFC 5104 - Codec Control Messages in AVPF

Payload Formats

RFC 4733 - RTP Payload for DTMF Digits, Telephony Tones
RFC 6184 - RTP Payload Format for H.264 Video
RFC 7587 - RTP Payload Format for Opus Speech and Audio Codec
RFC 4103 - RTP Payload for Text Conversation

Related Protocols

RFC 4566 - SDP: Session Description Protocol
RFC 3261 - SIP: Session Initiation Protocol

External Resources

AI Summary for RAG

Summary: Comprehensive guide to the Real-time Transport Protocol (RTP) as defined in RFC 3550. Covers RTP packet structure (12-byte fixed header with V/P/X/CC/M/PT fields, sequence number, timestamp, SSRC), payload types (static 0-95 and dynamic 96-127), and header extensions (RFC 5285/8285). Explains RTCP companion protocol with packet types (SR, RR, SDES, BYE, APP), quality feedback mechanisms (fraction lost, cumulative loss, jitter, LSR, DLSR), and bandwidth control (~5% of session bandwidth). Details RTP profiles (AVP, AVPF, SAVP, SAVPF), session setup via SIP/SDP, transport port conventions (even RTP, odd RTCP, or RTCP-mux), and dynamic payload negotiation using rtpmap/fmtp SDP attributes. Covers mixers (combine streams, new SSRC with CSRC list) vs translators (preserve original SSRCs). Security section explains SRTP (RFC 3711) with AES encryption, key exchange methods (SDES, DTLS-SRTP, ZRTP, MIKEY), and DTLS-SRTP mandatory for WebRTC. Includes VoIPmonitor integration for production RTP monitoring with MOS calculation, jitter buffer simulation, and RTCP data correlation.

Keywords: RTP, Real-time Transport Protocol, RFC 3550, RTCP, RTP header, sequence number, timestamp, SSRC, CSRC, payload type, jitter, packet loss, MOS, Mean Opinion Score, codec, SIP, SDP, RTP/AVP, RTP/AVPF, SRTP, DTLS-SRTP, encryption, mixer, translator, Sender Report, Receiver Report, VoIPmonitor, voice quality, video streaming, WebRTC, G.711, Opus, H.264

Key Questions:

What is the RTP packet structure and header format?
How does RTP sequence numbering work for packet loss detection?
What is the difference between static and dynamic payload types?
How does RTCP provide quality feedback for RTP streams?
What metrics are in an RTCP Receiver Report (fraction lost, jitter, DLSR)?
How do SIP and SDP negotiate RTP media sessions?
What is the difference between an RTP mixer and translator?
How does SRTP encrypt RTP media streams?
What is DTLS-SRTP and why is it mandatory in WebRTC?
How does VoIPmonitor calculate MOS from RTP stream analysis?
What causes one-way audio in RTP calls?
How to troubleshoot RTP packet loss and jitter issues?