Something I end up explaining relatively often has to do with all the various ways you can stream video encapsulated in the Real-time Transport Protocol, or RTP, and still claim to be standards compliant.
Some background: RTP is used primarily to stream either H.264 or MPEG-4 video. RTP is a system protocol that provides mechanisms to synchronize the presentation different streams – for instance audio and video. As such, it performs some of the same functions as an MPEG-2 transport or program stream.
RTP – which you can read about in great detail in RFC 3550 – is codec-agnostic. This means it is possible to carry a large number of codec types inside RTP; for each protocol, the IETF defines an RTP profile that specifies any codec-specific details of mapping data from the codec into RTP packets. Profiles are defined for H.264, MPEG-4 video and audio, and many more. Even VC-1 – the “standardized” form of Windows Media Video – has an RTP profile.
In my opinion, the standards are a mess in this area. It should be possible to meet all the various requirements on streaming video with one or at most two different methods for streaming. But ultimately standards bodies are committees: each person puts in a pretty color, and the result comes out grey.
In fact, the original standards situation around MPEG-4 video was so confused that a group of large companies formed the Internet Streaming Media Alliance, or ISMA. ISMA’s role is basically to wade into all the different options presented in the standards and create a meta-standard – currently ISMA 2.0 – that ties a number of other standards documents together and tells you how to build a working system that will interoperate with other systems.
In any event, there are a number of predominant ways to send MPEG-4 or H.264 video using RTP, all of which follow some relevant standards. If you’re writing a decoder, you’ll normally need to address all of them, so here’s a quick overview.
In an environment where there is one source of a video stream and many viewers, ideally each frame of video and audio would only transit the network once. This is how multicast delivery works. In a multicast network, each viewer must retrieve an SDP file through some unspecified mechanism, which in practice is usually HTTP. Once retrieved, the SDP file gives enough information for the viewer to find the multicast streams on the network and begin playback.
In the Multicast delivery scenario, each individual stream is sent on a pair of different UDP ports – one for data and the second for the related RTP Control Protocol or RTCP. That means for a video program consisting of a video stream and two audio streams, you’ll actually see packets being delivered to six UDP ports:
Timestamps in the RTP headers can be used to synchronize the presentation of the various streams.
As a side note, RTCP is almost vestigial for most applications. It’s specified in RFC 3550 along with RTP. If you’re implementing a decoder you’ll need to listen on the RTCP ports, but you can almost ignore any data sent to you. The exceptions are the sender report, which you’ll need in order to match up the timestamps between the streams, and the BYE, which some sources will send as they tear down a stream.
Multicast video delivery works best for live content. Because each viewer is viewing the same stream, it’s not possible for individual viewers to be able to pause, seek, rewind or fast-forward the stream.
It’s also possible to send unicast video over UDP, with one copy of the video transiting the network for each client. Unicast delivery can be used for both live and stored content. In the stored content case, additional control commands can be used to pause, seek, and enter fast forward and rewind modes.
Normally in this case, the player first establishes a control connection to a server using the Real Time Streaming Protocol, or RTSP. In theory RTSP can be used over either UDP or TCP, but in practice it is almost always used over TCP.
The player is normally started with an rtsp:// URL, and this causes it to connect over TCP to the RTSP server. After some back-and-forth between the player and the RTSP server, during which the server sends the client an SDP file describing the stream, the server begins sending video to the client over UDP. As with the multicast delivery case, a pair of UDP ports is used for each of the elementary streams.
For seekable streams, once the video is playing, the player has additional control using RTSP: It can cause playback to pause, or seek to a different position, or enter fast forward or rewind mode.
I’m not a fan of streaming video over TCP. In the event a packet is lost in the network, it’s usually worse to wait for a retransmission (which is what happens with TCP’s guaranteed delivery) than it is just to allow the resulting video glitch to pass through to the user (which is what happens with UDP).
However, there are a handful of different networking configurations that would block UDP video; in particular, firewalls historically have interacted badly with the two modes of UDP delivery summarized above.
So the RTSP RFC, in section 10.12, briefly outlines a mode of interleaving the RTP and RTCP packets onto the existing TCP connection being used for RTSP. Each RTP and RTCP packet is given a four-byte prefix and dropped onto the TCP stream. The result is that the player connects to the RTSP server, and all communication flows over a single TCP connection between the two.
You would think RTSP Interleaved mode, being designed to transmit video across firewalls, would be the end, but it turns out that many firewalls aren’t configured to allow connections to TCP port 554, the well-known port for an RTSP server.
So Apple invented a method of mapping the entire RTSP Interleaved communication on top of HTTP, meaning the video ultimately flows across TCP port 80. To my knowledge, this HTTP Tunneled mode is not standardized in any official RFC, but it’s so widely implemented that it has become a de-facto standard.