| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of stepping down to 4 pixels, then 2 px, then 1, with
essentially the same code, let's use maskload and maskstore to only load
and store the effective portions (instructions new in AVX2). The
secondary loop gets run at most twice, since there can be at most 7
pixels left.
This fixes an off-by-4 bug in the previous implementation (lines 1041
and 1186 should have had 7 instead of 3).
Change-Id: I4d4dadb709f1482fa8ccfffd157862e77ac508f6
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit c8c5ff19de1c34a99b8315e59015d115957b3584 introduced the solution
as a simple scaling up of the code in qdrawhelper_sse4.cpp, but it's bad
due to the way that the 256-bit unpack instructions work: the unpack-low
instruction unpacks the lower half of each half of the 256-bit register.
So we fix it up by inserting a permute4x64 that swaps the middle two
quarters of the 256-bit register (permute8x32 requires a __m256i
parameter, instead of an immediate).
This introduces an instruction that costs 3 cycles in each loop, but
since the AVX2 code has double the throughput compared to SSE4 code, it
should still be faster.
This problem does not affect the ARGB->ARGB32 code because that repacks
at the end.
Change-Id: I4d4dadb709f1482fa8ccfffd1578620b45166a4f
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
| |
Similar to the previous commit. This also removes the SSE4
implementations from Qt builds that use AVX2 throughout.
Change-Id: I251f00d706d646ed87b4fffd1577f96ed52a4cf4
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
| |
Change-Id: I251f00d706d646ed87b4fffd1577f84854e358a4
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We know what code we want it to generate, so I just replaced the
_mm_set1_epi64x() with the code we want it to generate. Except that GCC
sees through and tries to "optimize" my code... so that asm() statement
makes it separate the two operations.
This generates optimal code for both 32- and 64-bit. 64-bit:
vmovq %rdi, %xmm0
vpbroadcastq %xmm0, %ymm0
32-bit:
vmovq 8(%esp), %xmm0
vpbroadcastq %xmm0, %ymm0
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87976
Change-Id: I42a48bd64ccc41aebf84fffd15664109b97fe42b
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
|
|
|
| |
There were a few functions that passed vectors in parameters but did not
mark as vectorcall.
I've taken the opportunity to de-macroify one macro, but I'm not going
to do it for the rest.
Change-Id: I42a48bd64ccc41aebf84fffd1564bfc21faa2a14
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
|
|
| |
The implementation is almost the same 4-way-unrolled loop, but because
of the wider registers, we fill 128 bytes per loop. Unlike the SSE2
implementation, the AVX2 version uses unaligned stores and won't try to
align in the prologue, matching glibc's __memset_avx2 (also unaligned).
Change-Id: Iba4b5c183776497d8ee1fffd15637ccb2a7b83bc
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
| |
Use 16-bit multiplication as it is twice as fast as 32-bit
multiplication.
Change-Id: I64b529eaaed4ce2c59c64a0120e93cd132724156
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The simple scaling that only samples every input pixel once, can be
used with downscaling < 2x as well if we just handle the case where the
input can't be in the intermediate buffer.
At the same time the handling of the intermediate buffer has been moved
out of simple scale helper functions so the code can be shared and the
AVX2 optimizations also used for non-argb32pm formats.
Change-Id: I98d225ef8d4f2978480d09110c959b556c563b57
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
|
|
|
|
|
|
|
| |
Two small changes late in the review process were flawed.
Change-Id: I4b1f6e3fdb8e17000a2e11bc30aae1b29d9f43a9
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
| |
Speeds up RGB30 and ARGB32-unpremul painting.
Change-Id: I419afdf5c26ceffc0f7557b8f196035056178c9a
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During the container BoF session at the Qt Contributor Summit 2017 the
name of the signed size type became a subject of discussion in the
context of readability of code using this type and the intention of
using it for all length, size and count properties throughout the entire
framework in future versions of Qt.
This change proposes qsizetype as new name for qssize_t to emphasize the
readability of code over POSIX compatibility, the former being
potentially more relevant than the latter to the majority of users of
Qt.
Change-Id: Idb99cb4a8782703c054fa463a9e5af23a918e7f3
Reviewed-by: Samuel Gaist <samuel.gaist@edeltech.ch>
Reviewed-by: David Faure <david.faure@kdab.com>
|
|
|
|
|
|
|
|
|
|
| |
Calculates the correct offsets and coordinate transforms for the
intermediate buffer. This means we can conceptually simplify our
path switches instead of having downscale routines handling mirrored
upscaling.
Change-Id: I60efa7feaba80165672ca0ce064515fdf620869d
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Changes internal data-size and pointer calculations
to qssize_t.
Adds new sizeInBytes() accessor to read byte size, and
marks the old one deprecated.
Task-number: QTBUG-50912
Change-Id: Idf0c2010542b0ec1c9abef8afd02d6db07f43e6d
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
| |
Implement AVX2 versions of the three optimized paths of bilinear
texture transform.
Change-Id: Ie7199ef7dcce1e3457535fee35822d76afc0e8ba
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Manually vectorizing is significantly faster because we can optimize
for common cases like long stretches of opaque or transparent pixels.
This is both smaller and faster than the auto-vectorized version, it is
also much faster than the autovectorized version for AVX2 which then can
be removed.
Change-Id: I0fa80ce273a8387cc6cd084879822ad9bade385c
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The alpha channel of an RGB32 image was not properly ignored when doing
blending with partial opacity.
Now the alpha value is properly ignored, which is both more correct
and faster. This also makes SSE2 and AVX2 implementations match NEON
which was already doing the right thing (though had dead code for
doing it wrong).
Change-Id: I4613b8d70ed8c2e36ced10baaa7a4a55bd36a940
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
|
|
|
|
|
|
|
|
|
|
| |
Defines a structure that tells the compiler in no uncertain terms the
maximum number of times a loop can be run.
The reduces the size of qdrawhelper_avx2.o from 22kbytes to 11kbytes.
Change-Id: Ie3d6281b04b4be3332497c15f3dfe9f185e20507
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
| |
The order of the arguments to testc was wrong, it should have been the
other way. Replaced with testz to also get rid of setzero.
Change-Id: Iff968c140f9ca34c6bd7c7f04a3623fd8ec42e1c
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
|
|
|
|
|
|
|
|
| |
This patch adds AVX2 versions of the fast blending functions that we
already have SSE2 versions of.
Change-Id: Ifd1a22f7891b6208cb74929ad26095d12c5a1efb
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
| |
Removes the now unused QPixelLayout parameter, simplifies the colorTable
passing and prepares for adding dithering.
Change-Id: Iaf7698b248b857804d8921bf118e7cfabbabff87
Reviewed-by: Gunnar Sletta <gunnar@sletta.org>
|
|
|
|
|
|
|
|
|
|
|
| |
From Qt 5.7 -> LGPL v2.1 isn't an option anymore, see
http://blog.qt.io/blog/2016/01/13/new-agreement-with-the-kde-free-qt-foundation/
Updated license headers to use new LGPL header instead of LGPL21 one
(in those files which will be under LGPL v3)
Change-Id: I046ec3e47b1876cd7b4b0353a576b352e3a946d9
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
|
|
Following up on using GCC's autovectorizing for faster SSE4.1
premultiply, this patch adds specialized autovectorized versions
of premultiply for AVX2, giving another almost doubling in speed.
To make the speed up for AVX2 and also SSE4_1 available to non-GCC
compilers, the target-specific methods have been moved to separate
files.
Change-Id: I97ce05be67f4adeeb9a096eef80fd5fb662099f3
Reviewed-by: Gunnar Sletta <gunnar@sletta.org>
|