As Peter Meerwald <p.meerwald@bct-electronic.com> discovered, our ARM
svolume code performance is quite terrible when the incoming samples are
not word-aligned. This can very easily be the case, since the
architecture only requires that the samples be 16-bit aligned, and we
might end up running the innermost loop after processing modulo-4
samples. The performance degradation was ~50x on a Cortex A9
(Pandaboard).
This reworks the svolume logic to first consume enough samples to make
sure the rest is word aligned, and reordering the processing to work
with 4 samples at a time first, and then finally deal with the
remainder.
With this, performance is comparable for arbitrary alignments (~3x
faster than the C code).
CC libpulsecore_2.98_la-svolume_arm.lo
pulsecore/svolume_arm.c: In function 'pa_volume_s16ne_arm':
pulsecore/svolume_arm.c:50:8: warning: assignment discards 'const' qualifier from pointer target type [enabled by default]
Signed-off-by: Peter Meerwald <p.meerwald@bct-electronic.com>
assuming RAND_MAX is around 1<<31, rand() >> 1 generates large numbers as
random volume data; these likely causes saturated sample values after
applying the volume function -- not a good test
I was looking at a log, and noticed the following lines:
I [pulseaudio] svolume_mmx.c: Initialising MMX optimized functions.
I [pulseaudio] remap_mmx.c: Initialising MMX optimized remappers.
I [pulseaudio] svolume_sse.c: Initialising SSE2 optimized functions.
I [pulseaudio] remap_sse.c: Initialising SSE2 optimized remappers.
I [pulseaudio] sconv_sse.c: Initialising SSE2 optimized conversions.
It seemed odd that some messages were somewhat precise in
what functionality was initialized, while the svolume
messages told me that they had initialized just "functions".
So I made the svolume log messages more precise to match the
sconv and remap messages.
This makes the volume tests run in two loops and print the minimum,
maximum and standard deviation of readings from the inner loop. This
makes it easier to reason out performance drops (i.e. algorithmic
problems vs. other system issues such as processor contention).
Somewhere in the history of the MMX tests, the number of channels was
changed from 1 to 2, but the number of samples was not increased to make
it even (multiple of the frame size).
This ensures that the build does not fail if the ssat and pkhbt
instructions are not available (armv5te and below).
Fixes: http://www.pulseaudio.org/ticket/790
In the assembly optimized versions of SSE, a noise could occur when the
number of channels were 3,5,6 or 7. For MMX and ARM, this could occur
when the number of channels were 3.
Signed-off-by: David Henningsson <david.henningsson@canonical.com>