/** \page page_rtp_module_internals RTP sink and source module internals This document explains the architecture of PipeWire's RTP module. \tableofcontents # Introduction {#rtp-module-internals-introduction} The "RTP module" actually refers to a set of three modules which share source code: - \ref page_module_rtp_sink "RTP sink module" : Creates an RTP sink node and exposes it to the graph. This sink node places PCM audio into an internal ring buffer. This ring buffer is the source for the data of outgoing packets. The RTP timestamps may be synchronized against PTP time, depending on what buffer mode is used. This module also has a special "separate PTP sender" mode, where the actual send portion is done by an internal mini graph that runs on a special PTP based graph driver. - \ref page_module_rtp_source "RTP source module" : Creates an RTP source node and exposes it to the graph. This source node receives RTP packets and places their PCM data into an internal ring buffer. The node's process callback reads from that ring buffer and outputs that data to the graph. Depending on what mode is used, the position that the ring buffer is read from may be synchronized against a PTP time source. - \ref page_module_rtp_sap "SAP module" : Announces SAP sessions via multicast, and also listens for SAP sessions. If it discovers another SAP session, it instantiates the RTP source module, which in turn creates and exposes its RTP source node. See RFC 2974 for more about SAP. For notes about the configuration, see the individual module documentation. # RTP stream details {#rtp-module-internals-stream-details} The core of the RTP sink and source modules is the `rtp_stream`. This is built around a \ref pw_stream "PipeWire stream". This stream can operate in the `PW_DIRECTION_INPUT` direction (used by the RTP sink module) or in the `PW_DIRECTION_OUTPUT` direction (used by the RTP source module). The `rtp_stream` is implemented in `stream.c` and `stream.h`. `stream.c` includes `audio.c`, `midi.c`, `opus.c`. These handle media subtype specific setups, teardowns, and data processing: - `audio.c` corresponds to `SPA_MEDIA_SUBTYPE_raw` and handles PCM audio. - `midi.c` corresponds to `SPA_MEDIA_SUBTYPE_control` and handles MIDI. - `opus.c` is similar to `audio.c`, but corresponds to `SPA_MEDIA_SUBTYPE_opus`, and encodes PCM audio to Opus prior to sending out RTP packets and decodes Opus encoded audio from incoming RTP packets. The process callback in `rtp_stream` is set by these sources depending on the media subtype. Other, `rtp_stream` specific callbacks like a flush timeout handler are also set by these sources, since they are media subtype specific. The RTP sink and source modules are configured via properties, represented by `pw_properties`. Both support "stream.props" values inside their properties. These values in turn are child `pw_properties` instances that are passed directly to their `rtp_stream` instances. The modules also copy some of the values of their own properties into that child `pw_properties` instance. The exact list of values that are copied over depends on the module. But, this means that some values can be set directly in the module properties, or inside the stream.props properties. One example of this would be `sess.ts-direct`. \note This document refers to this as "copying to the stream properties". Actually, a value is copied from the module's properties to the stream properties if and only if that value is not already set in the stream properties. If it is, the already existing value takes priority. `audio.c` is by far the most complex of the media subtype handlers. All three handlers have some notion of the direct timestamp and constant latency modes, but `audio.c` is (currently) the only one with the fully reworked implementation that this document describes (the `impl->actual_max_buffer_size` modulo scheme, `impl->ts_align`, device delay compensation, and the exact over/underrun thresholds). `midi.c` and `opus.c` still carry their own, simpler direct-vs-constant-latency handling and a `TODO` to converge on the `audio.c` approach. `audio.c` also features the separate PTP sender mode, which the other two do not have at all. ## Ring buffer and wrap-around behavior {#rtp-module-internals-ring-buffer-behavior} The `rtp_stream` sets up a fixed-size ring buffer. Its size is derived from the `sess.buffer-size` property, in bytes. Note that this is a *stream* property: it is read by `rtp_stream_new()` from the properties it is handed, and - unlike e.g. `sess.ts-direct` - neither the sink nor the source module copies it over from its own properties, so in practice it can only be set inside `stream.props`. The `sess.buffer-size` value is not used verbatim. `rtp_stream_new()` derives two quantities from it: - `impl->buffer_size` is `sess.buffer-size` rounded *up* to the next power of two (via `SPA_ROUND_UP_POW2_32()`), and is the size of the actual allocation (that is, of `impl->buffer`). It is a power of two because the `midi.c` and `opus.c` handlers wrap their indices with a bit mask (`impl->buffer_mask`, and `impl->buffer_mask2` against the half-sized `impl->buffer_size2`) rather than a modulo, and masking only wraps correctly for power-of-two sizes. `impl->buffer_size` is generally *not* an integer multiple of the stride. - `impl->actual_max_buffer_size` is `impl->buffer_size` rounded *down* to an integer multiple of the stride (via `SPA_ROUND_DOWN()`). This is used by `audio.c`, which - unlike `midi.c` and `opus.c` - wraps via a modulo against this value. `audio.c` was reworked to do this to fix the stride-alignment problem described below; `midi.c` and `opus.c` still use the mask scheme and carry a `TODO` to converge on it. The actual, allocated buffer is present as `impl->buffer`. This is the pure data storage buffer, without any read or write index. \note `impl->buffer` and `impl->target_buffer` are not to be confused. The former is the actual buffer, while the latter is the session latency, converted to RTP samples. Furthermore, `sess.buffer-size` and the session latency must be picked such that `impl->target_buffer` worth of samples fits within the buffer. Since `impl->target_buffer` is in samples while `impl->actual_max_buffer_size` is in bytes, this means `impl->target_buffer * stride` must not exceed `impl->actual_max_buffer_size` (equivalently, `impl->target_buffer` must not exceed `impl->actual_max_buffer_size / stride`). The stride value depends on the media subtype, and is set internally by `rtp_stream_new()`. The buffer contents are always interleaved when the number of channels is greater than 1 and the data is raw audio (so, this does not apply to MIDI for example). The stride value specifies the unit size inside the buffer that contains audio data for all channels, played at the exact same time. In the PCM case, the stride is (num_channels * bytes_per_pcm_sample). \note It is important to keep in mind that the way the read and write index are handled in this ring buffer deviates somewhat from standard ring buffer usage in typical producer-consumer schemes, especially in the direct timestamp mode (more on that further below). The read and write index logic is handled by `impl->ring`. Both read and write indices increase monotonically (as free-running values) unless they are resynchronized. Because they are free-running rather than being wrapped at the buffer boundary, the fill level is simply their difference, and that is what removes the usual ambiguity about whether the ring buffer is empty or full. When accessing the actual buffer contents, an index is first turned into a byte offset (see below), and that offset is then reduced to the buffer bounds - in `audio.c` by taking it modulo `impl->actual_max_buffer_size`, and in `midi.c` and `opus.c` by masking it with `impl->buffer_mask` / `impl->buffer_mask2`. Reducing modulo `impl->actual_max_buffer_size` (rather than the raw `impl->buffer_size`) is essential for the buffer modes to work properly (explained further below). The read and write indices are given in RTP sample units. To access data in the buffer, the indices are multiplied by the stride to get a byte offset. This also means that the buffer size (which is given in bytes) must be an integer multiple of the stride size - otherwise, the read and write indices may refer to places in the buffer that cannot contain a full data set for all channels. For example, if the stride is 6, and the buffer size is 100, then when the read index is 16, the byte offset would be 16*6 = 96 - but there, only 4 bytes could be read, not 6. For this reason, the buffer size is internally rounded down to the nearest integer multiple of the stride size, as mentioned above. In the RTP sink module, the `rtp_stream` appends data to the ring buffer at its write index, except for when a resynchronization happens - the write index is then reset to match the `spa_io_clock.position` value (scaled to RTP sample units). One resynchronization always happens at startup. The RTP timestamps of outgoing packets are derived from the ring buffer's read index. In the RTP source module, `rtp_stream` reads data from the ring buffer depending on the buffer mode. More on that further below. ## Threading model and data processing {#rtp-module-internals-threading-model} Most of the code in `stream.c` runs in the stream's main loop, while most of the code in the media subtype handlers (`audio.c` etc.) runs in the stream's data loop. `stream_start()` is called by `on_stream_state_changed()`when the stream's state changes to `PW_STREAM_STATE_STREAMING`. At that stage, the stream's data loop is running, but the stream's PipeWire graph node is not yet attached to the data loop, so no data processing takes place at this time. The attachment happens after `on_stream_state_changed()` finished. This means that while `stream_start()` is run from the main loop, it is safe to set internal states that are accessed and modified by other functions that run in the data loop. Similarly, `stream_stop()` is called by `on_stream_state_changed()`when the stream's state changes to `PW_STREAM_STATE_PAUSED`. (It is not called however if the `node.always-process` in the stream.props properties in the RTP source module is set to true.) At that stage, the stream's graph node has already been detached from the data loop. It therefore is safe for `stream_stop()` to touch internal states that normally would be accessed by functions that run in the data loop. The media subtype handlers each have an init function, like `rtp_audio_init()`. This is one of the functions from these handlers that runs in the main loop, since these init functions are called by `rtp_stream_new()`. The other functions are: - `stop_timer()` (called by `stream_start()`) - `resend_packets()` (RAOP specific - not used by the RTP sink or source modules) - `deinit()` (called by `rtp_stream_destroy()`) Everything else in the media subtype handlers runs in the data loop, with the exception of `ptp_sender_process()` in `audio.c`, which runs under the separate PTP sender's own driver and may have a separate data loop. `audio.c` has two extra specialties: 1. It aggregates the contents of the ring buffer such that it can split it up into RTP packets with the specified packet time (see `rtp.ptime` in the module and stream properties). Depending on how full the ring buffer is, it may decide to send out some of its contents within the current graph cycle, and may use a timer (which runs in the data loop) to schedule the output of the remaining data later, to not risk an xrun by blocking the data loop in the current graph cycle for too long. 2. The separate PTP sender mode is driven by its own driver. More on that mode is documented further below. # Buffer modes {#rtp-module-internals-buffer-modes} \note Read the buffer modes documentation in \ref page_module_rtp_source first if not already done. Also, this section specifically describes how the buffer modes in `audio.c` are handled. `midi.c` and `opus.c` do branch on `impl->direct_timestamp` too, but with their own, simpler handling (and aligning those with what `audio.c` does is an open `TODO`); the detailed behavior described here is `audio.c` specific. The buffer mode only has a minor influence on the RTP sink module. In the constant latency mode, `impl->ts_align` is used in resynchronization cases to avoid a discontinuity in the outgoing RTP timestamps. In the direct timestamp mode, `impl->ts_align` is not used. The rest of the buffer mode documentation is about the behavior on the receiving side, that is, how the RTP source module uses the `rtp_stream`. In both modes, received data is inserted into the ring buffer according to the RTP timestamp. This timestamp is first shifted into the future by the value of `impl->target_buffer`. Then, the ring buffer's write index is advanced. It is expected by the code that the sender produces continuous timestamps; that is, `rtp_timestamp_of_packet_2 = rtp_timestamp_of_packet_1 + rtp_samples_per_packet`. In certain cases, resynchronization may take place; the read and write indices are then reset; the read index is set to the timestamp of the next incoming RTP packet, while the write index is set to that packet timestamp + `impl->target_buffer`; that is, the write index is set to be ahead of the read index by the session latency in samples. The write index is advanced in `rtp_audio_receive()`, the read index is advanced in `rtp_audio_process_playback()`. ## Constant latency mode {#rtp-module-internals-constant-latency-mode} As mentioned in the RTP source module documentation, this is the default mode, where the fill level is kept at a steady value, which is `impl->target_buffer`. If the fill level is above or below this, a DLL is used to compute an error rate, which then is fed into the ASRC of the `pw_stream` the `rtp_stream` is based on. The estimated amount of samples that are "in-flight" (that is, samples that already were sent out but not yet received or which arrived right after the last graph cycle) are also factored into this computation. This establishes a control loop that resamples the audio data as needed to maintain the fill level at `impl->target_buffer`. Should the difference between the target and the actual fill level exceed a threshold, the ring buffer indices are resynchronized. More concretely, the thresholds work as follows. An *underrun* is detected when fewer samples are available than the current graph cycle needs (`avail < wanted`); the missing samples are filled with silence and the sync state is dropped. An *overrun* on the read side is detected when the fill level exceeds `SPA_MIN(target_buffer * 8, impl->buffer_size / stride)`; the excess is dropped by advancing the read index so that only `target_buffer` worth of data remains (a soft correction, not a full resync). Here `target_buffer` is the device-delay-adjusted target (see below), i.e. `impl->target_buffer` minus the device delay - the two coincide only when the device delay is zero. On the write side (`rtp_audio_receive()`), a fill level exceeding the ring capacity `impl->buffer_size / stride` sets `impl->have_sync` to false, forcing a full resync. \note The factor of 8 in `target_buffer * 8` is an arbitrarily / empirically chosen headroom multiplier: it sets how far the fill level may run above the target before the buffered data is treated as stale. It is *not* a unit conversion - in particular, it is unrelated to the eight bits in a byte, despite the superficial resemblance. The `impl->buffer_size / stride` term merely caps this bound at the physical ring capacity, in samples. If the device delay (specified by the `pw_time.delay` value) is nonzero, then it is subtracted from `impl->target_buffer`, and the result is then used as the target fill level instead of `impl->target_buffer` directly. ## Direct timestamp mode {#rtp-module-internals-direct-timestamp-mode} Since this mode requires that the graph drivers of sender and receiver are somehow synchronized, it implies that, if the sender's and the receiver's \ref spa_io_clock::position values are sampled at the exact same moment, they are identical. In practice, they usually deviate a bit. This deviation is the time sync error, and the time synchronization mechanism that is used tries to keep this sync error as minimal as possible. The aforementioned incoming RTP timestamp shift by `impl->target_buffer` plays a crucial role here, since it makes sure the transport delay (which is what the session latency specifies in this mode) is accounted for. This mode is called "direct timestamp" mode since, unlike in the constant latency mode, the `rtp_audio_process_playback()` function directly reads from the ring buffer at an index that is derived from \ref spa_io_clock::position , even if this position jumps around. There is some logic to detect underruns and substitute missing data with silence, but discontinuities otherwise have no lasting effect. The driver must ensure that the \ref spa_io_clock::position value increases steadily (except in major discontinuity cases); clock drift compensation is done by the driver by adjusting the graph invocation timings. See \ref page_driver for more. In this mode, the `rtp_stream` DLL is not used. # Separate PTP sender {#rtp-module-internals-separate-ptp-sender} This section covers the *internals* of the separate PTP sender. Its user-facing behavior - what it is for, how it is activated via `aes67.driver-group`, and its benefits and trade-offs - is documented in \ref page_module_rtp_sink . Only the `audio.c` media subtype handler supports this mode. When it is enabled, `rtp_audio_init()` in `audio.c` creates an internal `pw_filter` node that is kept isolated from the graph and is driven by the driver from the `aes67.driver-group` node group. When this separate PTP sender is active, `rtp_audio_process_capture()` behaves differently. Rather than computing a drift itself, it stores the sink driver's timing information (`impl->sink_nsec`, `impl->sink_next_nsec`, `impl->sink_resamp_delay`, `impl->sink_quantum`) for the sender to use. From that information, `ptp_sender_process()` estimates the current total delay and computes the error between it and the target. That error is fed into a separate dedicated DLL (`impl->ptp_dll`), which outputs a rate. That rate (`impl->ptp_corr`) is then applied as the ASRC's rate at the start of `rtp_audio_process_capture()`. The ASRC then produces larger or smaller amounts of data, filling the ring buffer to a larger or smaller degree, thus forming a control loop that keeps the fill level at a certain target (see below), similar to what the constant latency mode does. During the refilling state, no packets are sent out. The refilling state ends once the estimated total delay reaches `impl->target_buffer` (which is also what the control loop mentioned above targets). That estimated total delay is the sum of the current ring buffer fill level, the delay of the ASRC, and the estimated amount of samples that are "in-flight" (that is, samples that already were sent out but not yet received or which arrived right after the last graph cycle). Additionally, the sender contains code for checking for too severe deviations between the send progress and the current PTP time. The tolerance range is 2x the quantum size. If the deviation goes beyond that, a resynchronization (and consequently, another refilling) is performed. This catches cases where the separate sender is starved of data (that is, the main graph is lagging behind), and also cases when PTP discontinuities occur. A similar check exists for the node wake up times. The filter node is scheduled by its own driver, independently of the sink node, so their wake ups are not inherently aligned. It is therefore important to check that the filter wakes up within the bounds of the sink node's wake up times (with some tolerance); if it does not, a resynchronization is performed. */