summaryrefslogtreecommitdiffstats
path: root/src/gui/painting/qdrawhelper_avx2.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Make the 64 bit raster backend an optional featureAllan Sandfeld Jensen2019-04-091-0/+6
| | | | | | | | Can be used to make smaller binaries, and possibly speed up ARGB32 rendering on some platforms. Change-Id: I7647b197ba7a6582187cc9736b7e0d752bd5bee5 Reviewed-by: Lars Knoll <lars.knoll@qt.io>
* Improve ARGB32ToRGBA64 conversionsAllan Sandfeld Jensen2019-02-071-16/+12
| | | | | | | Improves the precision so 255 values map to 65535 exactly. Change-Id: I366f408e8c6047d52acbed35e9d665249bbaba2b Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Switch epilogues of AVX2 conversions to single stepAllan Sandfeld Jensen2019-02-061-79/+79
| | | | | | | | Not only is it fewer instructions but all the logic except for load and store can be identical to the main loop. Change-Id: I2caac0c7504d94e404bd8cfe5080aff07ba2d465 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Fix convertARGBToARGB32PM_avx2 and convertARGBToRGBA64PM_avx2Allan Sandfeld Jensen2019-02-051-2/+2
| | | | | | | | The tails was off since f370410097f8cb8d8fdf6174b799497fe7fe0adf Fixes: QTBUG-73440 Change-Id: If86178c6cad3f87d9b5f0f89e90354d49cd386a4 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Use VPMASKMOV in the epilogue ARGB->ARGB{32,64} AVX2 epiloguesThiago Macieira2019-01-231-97/+47
| | | | | | | | | | | | | | Instead of stepping down to 4 pixels, then 2 px, then 1, with essentially the same code, let's use maskload and maskstore to only load and store the effective portions (instructions new in AVX2). The secondary loop gets run at most twice, since there can be at most 7 pixels left. This fixes an off-by-4 bug in the previous implementation (lines 1041 and 1186 should have had 7 instead of 3). Change-Id: I4d4dadb709f1482fa8ccfffd157862e77ac508f6 Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
* Fix the AVX2 ARGB->ARGB64 conversion codeThiago Macieira2019-01-151-5/+14
| | | | | | | | | | | | | | | | | | | | Commit c8c5ff19de1c34a99b8315e59015d115957b3584 introduced the solution as a simple scaling up of the code in qdrawhelper_sse4.cpp, but it's bad due to the way that the 256-bit unpack instructions work: the unpack-low instruction unpacks the lower half of each half of the 256-bit register. So we fix it up by inserting a permute4x64 that swaps the middle two quarters of the 256-bit register (permute8x32 requires a __m256i parameter, instead of an immediate). This introduces an instruction that costs 3 cycles in each loop, but since the AVX2 code has double the throughput compared to SSE4 code, it should still be faster. This problem does not affect the ARGB->ARGB32 code because that repacks at the end. Change-Id: I4d4dadb709f1482fa8ccfffd1578620b45166a4f Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
* Add AVX2 version of ARGB->ARGB32PMThiago Macieira2019-01-091-0/+138
| | | | | | | | Similar to the previous commit. This also removes the SSE4 implementations from Qt builds that use AVX2 throughout. Change-Id: I251f00d706d646ed87b4fffd1577f96ed52a4cf4 Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
* Add AVX2 version of the ARGB32->RGBA64PM codeThiago Macieira2019-01-091-0/+134
| | | | | Change-Id: I251f00d706d646ed87b4fffd1577f84854e358a4 Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
* Work around GCC bug in generating 64-bit population of SSE registerThiago Macieira2018-12-121-1/+12
| | | | | | | | | | | | | | | | | | | | We know what code we want it to generate, so I just replaced the _mm_set1_epi64x() with the code we want it to generate. Except that GCC sees through and tries to "optimize" my code... so that asm() statement makes it separate the two operations. This generates optimal code for both 32- and 64-bit. 64-bit: vmovq %rdi, %xmm0 vpbroadcastq %xmm0, %ymm0 32-bit: vmovq 8(%esp), %xmm0 vpbroadcastq %xmm0, %ymm0 See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87976 Change-Id: I42a48bd64ccc41aebf84fffd15664109b97fe42b Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
* Use Q_DECL_VECTORCALL in a few more placesThiago Macieira2018-12-111-6/+11
| | | | | | | | | | | There were a few functions that passed vectors in parameters but did not mark as vectorcall. I've taken the opportunity to de-macroify one macro, but I'm not going to do it for the rest. Change-Id: I42a48bd64ccc41aebf84fffd1564bfc21faa2a14 Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
* Add AVX2 versions of qt_memfill32 and qt_memfill64Thiago Macieira2018-12-111-1/+52
| | | | | | | | | | The implementation is almost the same 4-way-unrolled loop, but because of the wider registers, we fill 128 bytes per loop. Unlike the SSE2 implementation, the AVX2 version uses unaligned stores and won't try to align in the prologue, matching glibc's __memset_avx2 (also unaligned). Change-Id: Iba4b5c183776497d8ee1fffd15637ccb2a7b83bc Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
* Optimize intermediate_adder_avx2Allan Sandfeld Jensen2018-05-071-6/+7
| | | | | | | | Use 16-bit multiplication as it is twice as fast as 32-bit multiplication. Change-Id: I64b529eaaed4ce2c59c64a0120e93cd132724156 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Use simple scaling for downscaling less than 2xAllan Sandfeld Jensen2018-03-071-27/+29
| | | | | | | | | | | | | | The simple scaling that only samples every input pixel once, can be used with downscaling < 2x as well if we just handle the case where the input can't be in the intermediate buffer. At the same time the handling of the intermediate buffer has been moved out of simple scale helper functions so the code can be shared and the AVX2 optimizations also used for non-argb32pm formats. Change-Id: I98d225ef8d4f2978480d09110c959b556c563b57 Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io> Reviewed-by: Lars Knoll <lars.knoll@qt.io>
* Fix broken rendering of RGB30 and ARGB32 on machines with AVX2Allan Sandfeld Jensen2018-01-271-2/+2
| | | | | | | Two small changes late in the review process were flawed. Change-Id: I4b1f6e3fdb8e17000a2e11bc30aae1b29d9f43a9 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Add AVX2 optimized versions of the most basic RGB64 compositionsAllan Sandfeld Jensen2018-01-041-0/+165
| | | | | | | Speeds up RGB30 and ARGB32-unpremul painting. Change-Id: I419afdf5c26ceffc0f7557b8f196035056178c9a Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Improve readability of code that uses the Qt signed size typev5.10.0-rc2Simon Hausmann2017-11-281-1/+1
| | | | | | | | | | | | | | | | | During the container BoF session at the Qt Contributor Summit 2017 the name of the signed size type became a subject of discussion in the context of readability of code using this type and the intention of using it for all length, size and count properties throughout the entire framework in future versions of Qt. This change proposes qsizetype as new name for qssize_t to emphasize the readability of code over POSIX compatibility, the former being potentially more relevant than the latter to the majority of users of Qt. Change-Id: Idb99cb4a8782703c054fa463a9e5af23a918e7f3 Reviewed-by: Samuel Gaist <samuel.gaist@edeltech.ch> Reviewed-by: David Faure <david.faure@kdab.com>
* Fix handling of mirroring upscaling in simple bilinear upscalerAllan Sandfeld Jensen2017-08-101-7/+10
| | | | | | | | | | Calculates the correct offsets and coordinate transforms for the intermediate buffer. This means we can conceptually simplify our path switches instead of having downscale routines handling mirrored upscaling. Change-Id: I60efa7feaba80165672ca0ce064515fdf620869d Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
* Allow QImage with more than 2GByte of image dataAllan Sandfeld Jensen2017-07-081-1/+1
| | | | | | | | | | | | Changes internal data-size and pointer calculations to qssize_t. Adds new sizeInBytes() accessor to read byte size, and marks the old one deprecated. Task-number: QTBUG-50912 Change-Id: Idf0c2010542b0ec1c9abef8afd02d6db07f43e6d Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Add AVX2 optimized bilinear texture transformAllan Sandfeld Jensen2017-02-281-0/+414
| | | | | | | | Implement AVX2 versions of the three optimized paths of bilinear texture transform. Change-Id: Ie7199ef7dcce1e3457535fee35822d76afc0e8ba Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Manually vectorize ARGB32toARGB32PM for SSE4.1 and NEONAllan Sandfeld Jensen2017-01-311-13/+0
| | | | | | | | | | | | Manually vectorizing is significantly faster because we can optimize for common cases like long stretches of opaque or transparent pixels. This is both smaller and faster than the auto-vectorized version, it is also much faster than the autovectorized version for AVX2 which then can be removed. Change-Id: I0fa80ce273a8387cc6cd084879822ad9bade385c Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Fix blending of RGB32 on RGB32 with partial opacityAllan Sandfeld Jensen2016-12-031-5/+3
| | | | | | | | | | | | | The alpha channel of an RGB32 image was not properly ignored when doing blending with partial opacity. Now the alpha value is properly ignored, which is both more correct and faster. This also makes SSE2 and AVX2 implementations match NEON which was already doing the right thing (though had dead code for doing it wrong). Change-Id: I4613b8d70ed8c2e36ced10baaa7a4a55bd36a940 Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
* Avoid auto-vectorization of epilogues of manual vectorizationAllan Sandfeld Jensen2016-10-111-4/+4
| | | | | | | | | | Defines a structure that tells the compiler in no uncertain terms the maximum number of times a loop can be run. The reduces the size of qdrawhelper_avx2.o from 22kbytes to 11kbytes. Change-Id: Ie3d6281b04b4be3332497c15f3dfe9f185e20507 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Fix qt_blend_rgb32_on_rgb32_avx2Allan Sandfeld Jensen2016-09-301-1/+1
| | | | | | | | The order of the arguments to testc was wrong, it should have been the other way. Replaced with testz to also get rid of setzero. Change-Id: Iff968c140f9ca34c6bd7c7f04a3623fd8ec42e1c Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
* Add AVX2 versions of the fast blending functionsAllan Sandfeld Jensen2016-09-181-1/+304
| | | | | | | | This patch adds AVX2 versions of the fast blending functions that we already have SSE2 versions of. Change-Id: Ifd1a22f7891b6208cb74929ad26095d12c5a1efb Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Cleanup conversion parametersAllan Sandfeld Jensen2016-04-111-2/+2
| | | | | | | | Removes the now unused QPixelLayout parameter, simplifies the colorTable passing and prepares for adding dithering. Change-Id: Iaf7698b248b857804d8921bf118e7cfabbabff87 Reviewed-by: Gunnar Sletta <gunnar@sletta.org>
* Updated license headersJani Heikkinen2016-01-151-14/+20
| | | | | | | | | | | From Qt 5.7 -> LGPL v2.1 isn't an option anymore, see http://blog.qt.io/blog/2016/01/13/new-agreement-with-the-kde-free-qt-foundation/ Updated license headers to use new LGPL header instead of LGPL21 one (in those files which will be under LGPL v3) Change-Id: I046ec3e47b1876cd7b4b0353a576b352e3a946d9 Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
* Add AVX2 autovectorized versions of premultiplyAllan Sandfeld Jensen2015-03-101-0/+54
Following up on using GCC's autovectorizing for faster SSE4.1 premultiply, this patch adds specialized autovectorized versions of premultiply for AVX2, giving another almost doubling in speed. To make the speed up for AVX2 and also SSE4_1 available to non-GCC compilers, the target-specific methods have been moved to separate files. Change-Id: I97ce05be67f4adeeb9a096eef80fd5fb662099f3 Reviewed-by: Gunnar Sletta <gunnar@sletta.org>