The two-pass blend image is created with VK_IMAGE_LAYOUT_UNDEFINED, so
on its first use loadOp=LOAD loads uninitialized memory. This oughtn't
be an issue, as we render onto it before we read it. These renders are
blends, so even opaque content is rendered with reference to an uninit
dst. This too ought to be fine: src*1 + dst*0 = src for all finite dst.
But the blend image pixfmt is VK_FORMAT_R16G16B16A16_SFLOAT, so uninit
pixels can be NaN, inf, or -inf, and now src*1 + dst*0 = NaN/inf/-inf.
This is bad enough assuming the uninitialized blend image holds random
bytes (2048/65536 values are not finite), even worse on any driver/GPU
with a framebuffer compression scheme that so happens to reliably read
NaNs from any uninitialized compressed image...
Most Mesa drivers happen not to do this perfectly valid thing, so this
is only reliably a problem (afaict) for honeykrisp i.e. AGX i.e. Asahi
Linux i.e. Apple Silicon, where after an upgrade to wlroots 0.20, sway
renders a black screen forever, unless you get quite lucky spamming VT
switches, in which case there's flickery garbage on exactly one of the
two swapchain buffers.
The blend image persists across frames, so it suffices to clear before
first real use. Rather than clear by hand, make a loadOp=CLEAR variant
of the render pass and use it for that first frame only. Adding a pass
sounds heavy, but render pass compatibility ignores loadOp and layouts
such that the new pass reuses the pipelines and framebuffer, and costs
one VkRenderPass object but not the usual pipeline/shader (re)compile.
rect_union_add takes a pixman_box32_t by value, and passes it along by
value to internal helpers. render_pass_mark_box_updated which is the
only caller receives the pixman_box32_t by reference, so just plumb it
through that way.
Results in a 13% improvement in CPU time when using the Vulkan renderer
on the stacked/clip200/1024 benchmarks on my machine.
Signed-off-by: Kenny Levinsen <kl@kl.wtf>
Similar to what we have already done for gles2. To simplify things we
use the staging ring buffer for the vertex buffers by extending the
usage bits, rather than introducing a separate pool.
Signed-off-by: Kenny Levinsen <kl@kl.wtf>
We are spending quite significant CPU time walking through the clip
rects, taking a pixman box, converting it to a wlr box, intersecting it
and ultimately converting it back to a pixman box before adding it to
the rect union.
Just intersect the clip region once ahead of time, and use pixman boxes
the entire way. This also makes it easy to bail early if nothing
intersects.
Gives a small 97.95% reduction in CPU time for the Vulkan renderer in
the grid/clip200/1024 benchmark.
Signed-off-by: Kenny Levinsen <kl@kl.wtf>
Implement a ring-buffer that uses timeline points to track and release
allocated spans. We add larger ring-buffers when it fills, and remove
them when they have been unused for many collection cycles.
Signed-off-by: Kenny Levinsen <kl@kl.wtf>
We'll need to grab textures from there in the next commit.
Also rename it to better reflect what it does: synchronize release
fences after a render pass has been submitted.
The important bit here is whether this is using a single or two
sub-passes. The flag isn't used for anything else.
Preparation for an upcoming one-subpass codepath.
Converting the LCMS2 transform to a 3D LUT early causes issues:
- It's a lossy process, the consumer will not be able to pick a
3D LUT size on their own.
- It requires unnecessary conversions and allocations: an intermediate
3D LUT is allocated, but the renderer already allocates one.
- It makes it harder to support arbitrary color transforms in the
renderer, because each type needs to be handled differently.
Instead, expose a function to evaluate a color transform, and use
that to build the 3D LUT in the renderer.