| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
Pick-to: 6.2
Change-Id: I2404fdfd43d3b4553760ad2f605175121cd31446
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
|
|
|
|
|
|
|
|
|
| |
Useful for some HDR representations and HDR rendering.
Change-Id: If6e8a661faa3d2afdf17b6ed4d8ff5c5b2aeb30e
Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org>
Reviewed-by: Tor Arne Vestbø <tor.arne.vestbo@qt.io>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
|
|
|
|
|
|
|
|
| |
And remove the direct conversion so we can get both the SIMD
optimization and threading applied.
Change-Id: Id032ea91cc40c1cbf1c8a1da0386de35aa36cfb5
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
| |
Task-number: QTBUG-84469
Change-Id: I366e845249203d80d640355a7780ac2f91a762f1
Reviewed-by: Tor Arne Vestbø <tor.arne.vestbo@qt.io>
Reviewed-by: Friedemann Kleint <Friedemann.Kleint@qt.io>
|
|
|
|
|
| |
Change-Id: I0beafa39d92550ea78e78a07b25ce1253cc6668d
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
|
|
|
|
|
|
|
|
| |
Can be used to make smaller binaries, and possibly speed up ARGB32
rendering on some platforms.
Change-Id: I7647b197ba7a6582187cc9736b7e0d752bd5bee5
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
|
|
|
|
|
|
|
| |
Improves the precision so 255 values map to 65535 exactly.
Change-Id: I366f408e8c6047d52acbed35e9d665249bbaba2b
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
| |
Not only is it fewer instructions but all the logic except for load and
store can be identical to the main loop.
Change-Id: I2caac0c7504d94e404bd8cfe5080aff07ba2d465
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
| |
The tails was off since f370410097f8cb8d8fdf6174b799497fe7fe0adf
Fixes: QTBUG-73440
Change-Id: If86178c6cad3f87d9b5f0f89e90354d49cd386a4
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of stepping down to 4 pixels, then 2 px, then 1, with
essentially the same code, let's use maskload and maskstore to only load
and store the effective portions (instructions new in AVX2). The
secondary loop gets run at most twice, since there can be at most 7
pixels left.
This fixes an off-by-4 bug in the previous implementation (lines 1041
and 1186 should have had 7 instead of 3).
Change-Id: I4d4dadb709f1482fa8ccfffd157862e77ac508f6
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit c8c5ff19de1c34a99b8315e59015d115957b3584 introduced the solution
as a simple scaling up of the code in qdrawhelper_sse4.cpp, but it's bad
due to the way that the 256-bit unpack instructions work: the unpack-low
instruction unpacks the lower half of each half of the 256-bit register.
So we fix it up by inserting a permute4x64 that swaps the middle two
quarters of the 256-bit register (permute8x32 requires a __m256i
parameter, instead of an immediate).
This introduces an instruction that costs 3 cycles in each loop, but
since the AVX2 code has double the throughput compared to SSE4 code, it
should still be faster.
This problem does not affect the ARGB->ARGB32 code because that repacks
at the end.
Change-Id: I4d4dadb709f1482fa8ccfffd1578620b45166a4f
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
| |
Similar to the previous commit. This also removes the SSE4
implementations from Qt builds that use AVX2 throughout.
Change-Id: I251f00d706d646ed87b4fffd1577f96ed52a4cf4
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
| |
Change-Id: I251f00d706d646ed87b4fffd1577f84854e358a4
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We know what code we want it to generate, so I just replaced the
_mm_set1_epi64x() with the code we want it to generate. Except that GCC
sees through and tries to "optimize" my code... so that asm() statement
makes it separate the two operations.
This generates optimal code for both 32- and 64-bit. 64-bit:
vmovq %rdi, %xmm0
vpbroadcastq %xmm0, %ymm0
32-bit:
vmovq 8(%esp), %xmm0
vpbroadcastq %xmm0, %ymm0
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87976
Change-Id: I42a48bd64ccc41aebf84fffd15664109b97fe42b
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
|
|
|
| |
There were a few functions that passed vectors in parameters but did not
mark as vectorcall.
I've taken the opportunity to de-macroify one macro, but I'm not going
to do it for the rest.
Change-Id: I42a48bd64ccc41aebf84fffd1564bfc21faa2a14
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
|
|
| |
The implementation is almost the same 4-way-unrolled loop, but because
of the wider registers, we fill 128 bytes per loop. Unlike the SSE2
implementation, the AVX2 version uses unaligned stores and won't try to
align in the prologue, matching glibc's __memset_avx2 (also unaligned).
Change-Id: Iba4b5c183776497d8ee1fffd15637ccb2a7b83bc
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
|
|
|
|
|
|
|
|
| |
Use 16-bit multiplication as it is twice as fast as 32-bit
multiplication.
Change-Id: I64b529eaaed4ce2c59c64a0120e93cd132724156
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The simple scaling that only samples every input pixel once, can be
used with downscaling < 2x as well if we just handle the case where the
input can't be in the intermediate buffer.
At the same time the handling of the intermediate buffer has been moved
out of simple scale helper functions so the code can be shared and the
AVX2 optimizations also used for non-argb32pm formats.
Change-Id: I98d225ef8d4f2978480d09110c959b556c563b57
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
|
|
|
|
|
|
|
| |
Two small changes late in the review process were flawed.
Change-Id: I4b1f6e3fdb8e17000a2e11bc30aae1b29d9f43a9
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
| |
Speeds up RGB30 and ARGB32-unpremul painting.
Change-Id: I419afdf5c26ceffc0f7557b8f196035056178c9a
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During the container BoF session at the Qt Contributor Summit 2017 the
name of the signed size type became a subject of discussion in the
context of readability of code using this type and the intention of
using it for all length, size and count properties throughout the entire
framework in future versions of Qt.
This change proposes qsizetype as new name for qssize_t to emphasize the
readability of code over POSIX compatibility, the former being
potentially more relevant than the latter to the majority of users of
Qt.
Change-Id: Idb99cb4a8782703c054fa463a9e5af23a918e7f3
Reviewed-by: Samuel Gaist <samuel.gaist@edeltech.ch>
Reviewed-by: David Faure <david.faure@kdab.com>
|
|
|
|
|
|
|
|
|
|
| |
Calculates the correct offsets and coordinate transforms for the
intermediate buffer. This means we can conceptually simplify our
path switches instead of having downscale routines handling mirrored
upscaling.
Change-Id: I60efa7feaba80165672ca0ce064515fdf620869d
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Changes internal data-size and pointer calculations
to qssize_t.
Adds new sizeInBytes() accessor to read byte size, and
marks the old one deprecated.
Task-number: QTBUG-50912
Change-Id: Idf0c2010542b0ec1c9abef8afd02d6db07f43e6d
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
| |
Implement AVX2 versions of the three optimized paths of bilinear
texture transform.
Change-Id: Ie7199ef7dcce1e3457535fee35822d76afc0e8ba
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Manually vectorizing is significantly faster because we can optimize
for common cases like long stretches of opaque or transparent pixels.
This is both smaller and faster than the auto-vectorized version, it is
also much faster than the autovectorized version for AVX2 which then can
be removed.
Change-Id: I0fa80ce273a8387cc6cd084879822ad9bade385c
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The alpha channel of an RGB32 image was not properly ignored when doing
blending with partial opacity.
Now the alpha value is properly ignored, which is both more correct
and faster. This also makes SSE2 and AVX2 implementations match NEON
which was already doing the right thing (though had dead code for
doing it wrong).
Change-Id: I4613b8d70ed8c2e36ced10baaa7a4a55bd36a940
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
|
|
|
|
|
|
|
|
|
|
| |
Defines a structure that tells the compiler in no uncertain terms the
maximum number of times a loop can be run.
The reduces the size of qdrawhelper_avx2.o from 22kbytes to 11kbytes.
Change-Id: Ie3d6281b04b4be3332497c15f3dfe9f185e20507
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
| |
The order of the arguments to testc was wrong, it should have been the
other way. Replaced with testz to also get rid of setzero.
Change-Id: Iff968c140f9ca34c6bd7c7f04a3623fd8ec42e1c
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
|
|
|
|
|
|
|
|
| |
This patch adds AVX2 versions of the fast blending functions that we
already have SSE2 versions of.
Change-Id: Ifd1a22f7891b6208cb74929ad26095d12c5a1efb
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
|
|
|
|
|
| |
Removes the now unused QPixelLayout parameter, simplifies the colorTable
passing and prepares for adding dithering.
Change-Id: Iaf7698b248b857804d8921bf118e7cfabbabff87
Reviewed-by: Gunnar Sletta <gunnar@sletta.org>
|
|
|
|
|
|
|
|
|
|
|
| |
From Qt 5.7 -> LGPL v2.1 isn't an option anymore, see
http://blog.qt.io/blog/2016/01/13/new-agreement-with-the-kde-free-qt-foundation/
Updated license headers to use new LGPL header instead of LGPL21 one
(in those files which will be under LGPL v3)
Change-Id: I046ec3e47b1876cd7b4b0353a576b352e3a946d9
Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
|
|
Following up on using GCC's autovectorizing for faster SSE4.1
premultiply, this patch adds specialized autovectorized versions
of premultiply for AVX2, giving another almost doubling in speed.
To make the speed up for AVX2 and also SSE4_1 available to non-GCC
compilers, the target-specific methods have been moved to separate
files.
Change-Id: I97ce05be67f4adeeb9a096eef80fd5fb662099f3
Reviewed-by: Gunnar Sletta <gunnar@sletta.org>
|