Make sure we only make the buffer for the follower larger when we
downsample because then we need to ask for more data from the follower
to fill up a quantum.
Never try to make the follower buffer smaller than the quantum limit.
The reason is that the graph rate could be decreased dynamically and
then we would end up with too small buffers.
See #4490
Load multiple graphs with audioconvert.filter-graph.N where N is the
order where the graph is inserted/replaced. Run the graphs before the
channelmixer.
Graphs can be added and removed at runtime.
Instead of recalculating what to do every cycle, we can prepare a
static schedule and just run that. We only need to reevaluate it when
something changes.
For input streams, first run the resampler and then the channelmix. This
ensures that the channelmix is run with the rate of the graph instead
of the rate of the input. This is nicer because rate and quantum align
with the graph and the sample accurate volume ramps will work as
intended.
For output streams, leave the resampler after the channelmix for the same
reasons.
The current biquad calculations are based on RBJ's cookbook [1],
except for low-/highpass. Since the filter configuration is also
based on using the definition of Q, it makes sense to also align
the remaining calculations to use the same filter cookbook instead
of using resonance which doesn't result in the same coefficients
as when using Q.
[1] = https://www.w3.org/TR/audio-eq-cookbook/
Iterate the channels in the inner loop instead of the outer loop. This
makes it handle with 0 channels better but also does the more
complicated phase increment code only once for all channels. Also the
filters might stay in the cache for each channel now.
Add some padding to the delay buffer. If we wrap around, copy the
spilled samples to the front of the buffer. This makes it possible to
use the more optimized sse delay function in more cases.
Use a wrap around delay ringbuffer. We can then avoid some modulo
arithmetic and read more efficiently.
Also handle the delay convolver case better by reversing the taps and
reading the taps and delay buffer without extra overhead.
When the follower doesn't produce enough data for this many attempts,
bail and cause an xrun to avoid an infinite loop.
The limit of 8 cause real-life problems and should be larger. It should
probably depend on the expected size per cycle (node.latency) and the
current quantum but we don't always have this information.
See #4334
Use the helper instead of duplicating the same code.
Also add some helpers to parse a json array of uint32_t
Move some functions to convert between type name and id.
This gets the next key and value from an object. This function is better
because it will skip key/value pairs that don't fit in the array to hold
the key.
The previous code patter would stop parsing the object as soon as a key
larger than the available space was found.
Add spa_json_begin_array/object to replace
spa_json_init+spa_json_begin_array/object
This function is better because it does not waste a useless spa_json
structure as an iterator. The relaxed versions also error out when the
container is mismatched because parsing a mismatched container is not
going to give any results anyway.
First try to pass the format of the converter directly into the
follower. This allows us to avoid conversion when it can be avoided.
Iterate all follower formats (not just the first one) to find something
that intersects with the converter formats.
We don't need to use the raw audio format parsing functions, we can use
the more generic audio ones. This avoids some extra parsing for the
media type and subtype and will support compressed audio formats
as well when the converter handles this.
Move the check for the follower==target to the negotiate functions.
Refer to the target when doing operations. The converter reference
is just some internal element that may or may not be active at the
moment. If we have multiple converter elements, the current active
one will be in target.
While this is quite fast on x86 (order of a few microseconds), the
computation can take a few milliseconds on ARM (measured at 1.9ms (32000
-> 48000) and 3.3ms (32000 -> 44100) on a Cortex A53).
Let's precompute some common rates so that we can avoid this overhead on
each stream (or any other audioconvert) instantiation. The approach
taken here is to write a little program to create the resampler
instance, and run that on the host at compile-time to generate some
common rate conversions.
The IO_Buffers is used in the data thread to check if the port should be
scheduled or not. Make sure it is only set after we set buffers on the
port and cleared before the buffers are cleared.
Make sure we sync the port->io with the data thread.
See #4094
This provides access to GNU C library-style endian and byteswap functions.
Windows doesn't provide pre-processor defines for endianness, but
all current Windows architectures (X32, X64, ARM) are little-endian.
This is somewhat similar to the S32->F32 conversion improvements,
but here things a bit more tricky...
The main consideration is that the limits to which we clamp
must be valid 32-bit signed integers, but not all such integers
are exactly losslessly representable in `float32_t`.
For example it we'd clamp to `2147483647`,
that is actually a `2147483648.0f`,
and `2147483648` is not a valid 32-bit signed integer,
so the post-clamp conversion would basically be UB.
We don't have this problem for negative bound, though.
But as we know, any 25-bit signed integer is losslessly
round-trippable through float32_t, and since multiplying by 2
only changes the float's exponent, we can clamp to `2147483520`!
The algorithm of selection of the pre-clamping scale is unaffected.
This additionally avoids right-shift, and thus is even faster.
As `test_lossless_s32_lossless_subset` shows,
if the integer is in the form of s25+shift,
the maximal absolute error is finally zero.
Without going through `float`->`double`->`int`,
i'm not sure if the `float`->`int` conversion
can be improved further.
There's really no point in doing that s25_32 intermediate step,
to be honest i don't have a clue why the original implementation
did that \_(ツ)_/¯.
Both `S25_SCALE` and `S32_SCALE` are powers of two,
and thus are both exactly representable as floats,
and reprocial of power-of-two is also exactly representable,
so it's not like that rescaling results in precision loss.
This additionally avoids right-shift, and thus is even faster.
As `test_lossless_s32_lossless_subset` shows,
if the integer is in the form of s25+shift,
the maximal absolute error became even lower,
but not zero, because F32->S32 still goes through S25 intermediate.
I think we could theoretically do better,
but then the clamping becomes pretty finicky,
so i don't feel like touching that here.