Iterate the channels in the inner loop instead of the outer loop. This
makes it handle with 0 channels better but also does the more
complicated phase increment code only once for all channels. Also the
filters might stay in the cache for each channel now.
Add some padding to the delay buffer. If we wrap around, copy the
spilled samples to the front of the buffer. This makes it possible to
use the more optimized sse delay function in more cases.
Use a wrap around delay ringbuffer. We can then avoid some modulo
arithmetic and read more efficiently.
Also handle the delay convolver case better by reversing the taps and
reading the taps and delay buffer without extra overhead.
When the follower doesn't produce enough data for this many attempts,
bail and cause an xrun to avoid an infinite loop.
The limit of 8 cause real-life problems and should be larger. It should
probably depend on the expected size per cycle (node.latency) and the
current quantum but we don't always have this information.
See #4334
Use the helper instead of duplicating the same code.
Also add some helpers to parse a json array of uint32_t
Move some functions to convert between type name and id.
This gets the next key and value from an object. This function is better
because it will skip key/value pairs that don't fit in the array to hold
the key.
The previous code patter would stop parsing the object as soon as a key
larger than the available space was found.
Add spa_json_begin_array/object to replace
spa_json_init+spa_json_begin_array/object
This function is better because it does not waste a useless spa_json
structure as an iterator. The relaxed versions also error out when the
container is mismatched because parsing a mismatched container is not
going to give any results anyway.
First try to pass the format of the converter directly into the
follower. This allows us to avoid conversion when it can be avoided.
Iterate all follower formats (not just the first one) to find something
that intersects with the converter formats.
We don't need to use the raw audio format parsing functions, we can use
the more generic audio ones. This avoids some extra parsing for the
media type and subtype and will support compressed audio formats
as well when the converter handles this.
Move the check for the follower==target to the negotiate functions.
Refer to the target when doing operations. The converter reference
is just some internal element that may or may not be active at the
moment. If we have multiple converter elements, the current active
one will be in target.
While this is quite fast on x86 (order of a few microseconds), the
computation can take a few milliseconds on ARM (measured at 1.9ms (32000
-> 48000) and 3.3ms (32000 -> 44100) on a Cortex A53).
Let's precompute some common rates so that we can avoid this overhead on
each stream (or any other audioconvert) instantiation. The approach
taken here is to write a little program to create the resampler
instance, and run that on the host at compile-time to generate some
common rate conversions.
The IO_Buffers is used in the data thread to check if the port should be
scheduled or not. Make sure it is only set after we set buffers on the
port and cleared before the buffers are cleared.
Make sure we sync the port->io with the data thread.
See #4094
This provides access to GNU C library-style endian and byteswap functions.
Windows doesn't provide pre-processor defines for endianness, but
all current Windows architectures (X32, X64, ARM) are little-endian.
This is somewhat similar to the S32->F32 conversion improvements,
but here things a bit more tricky...
The main consideration is that the limits to which we clamp
must be valid 32-bit signed integers, but not all such integers
are exactly losslessly representable in `float32_t`.
For example it we'd clamp to `2147483647`,
that is actually a `2147483648.0f`,
and `2147483648` is not a valid 32-bit signed integer,
so the post-clamp conversion would basically be UB.
We don't have this problem for negative bound, though.
But as we know, any 25-bit signed integer is losslessly
round-trippable through float32_t, and since multiplying by 2
only changes the float's exponent, we can clamp to `2147483520`!
The algorithm of selection of the pre-clamping scale is unaffected.
This additionally avoids right-shift, and thus is even faster.
As `test_lossless_s32_lossless_subset` shows,
if the integer is in the form of s25+shift,
the maximal absolute error is finally zero.
Without going through `float`->`double`->`int`,
i'm not sure if the `float`->`int` conversion
can be improved further.
There's really no point in doing that s25_32 intermediate step,
to be honest i don't have a clue why the original implementation
did that \_(ツ)_/¯.
Both `S25_SCALE` and `S32_SCALE` are powers of two,
and thus are both exactly representable as floats,
and reprocial of power-of-two is also exactly representable,
so it's not like that rescaling results in precision loss.
This additionally avoids right-shift, and thus is even faster.
As `test_lossless_s32_lossless_subset` shows,
if the integer is in the form of s25+shift,
the maximal absolute error became even lower,
but not zero, because F32->S32 still goes through S25 intermediate.
I think we could theoretically do better,
but then the clamping becomes pretty finicky,
so i don't feel like touching that here.
At the very least, we should go through s25_32 intermediate
instead of s24_32, to avoid needlessly loosing 1 LSB precision bit.
That being said, i suspect it's still not doing the right thing.
Why are we silently dropping those 7 LSB bits?
Is that really the way to do it?
At the very least, we should go through s25_32 intermediate
instead of s24_32, to avoid needlessly loosing 1 LSB precision bit.
FIXME: the noise codepath is not covered with tests.
The largest integer that 32-bit floating point can exactly represent
is actually `(2^24)-1`, not`(2^23)-1` like the code assumes.
This means, whenever we use s24 as an intermediate step
to go between f32 and s32, we lose a bit of precision.
s25_32 is really a i32 with highest byte always being a sign byte.
Printing was done by adding
```
for(int e = 0; e != 13; ++e)
fprintf(stderr, "%16.32e,", ((float*)m1)[e]);
```
to `compare_mem`. I don't like how these tests work.
https://godbolt.org/z/abe94sedT
Can be used to group ports together. Mostly because they are all from
the same stream and split into multiple ports by audioconvert/adapter.
Also useful for the alsa sequence to group client ports together.
Also interesting when pw-filter would be able to handle streams in the
future to find out what ports belong to what streams.
We can remove most of the special async handling in adapter, filter and
stream because this is now handled in the core.
Add a node.data-loop property to assign the node to a named data-loop.
Assign the non-rt stream and filter to the main loop. This means that
the node fd will be added to the main-loop and will be woken up directly
without having to wake up the RT thread and invoke the process callback
in the main-loop first. Because non-RT implies async, we can do all of
this like we do our rt processing because the output will only be used
in the next cycle.