Commit graph

24 commits

Author SHA1 Message Date
Iconoclast
cecd9976e8 optimized _mm_cmplt_epu16() composite
method A:  #define _mm_cmplt_epu16(m, n) _mm_cmpgt_epu16(n, m)
define  _mm_cmpgt_epu16(m, n) _mm_andnot_si128(\
    _mm_cmpeq_epi16(m, n), _mm_cmple_epu16(n, m)\
)
define  _mm_cmple_epu16(m, n) _mm_cmpeq_epi16(\
    _mm_subs_epu16(m, n), _mm_setzero_si128()
)
multiply.o:  3,524 bytes; multiply.s:  9,883 bytes

method B:  #define _mm_cmplt_epu16(m, n) _mm_cmplt_epi16(\
    _mm_xor_si128(m, _mm_setmin_epi16()), _mm_xor_si128(n, _mm_setmin_epi16())\
)
define  _mm_setmin_epi16() _mm_slli_epi16(_mm_allones_si128(), 15)
multiply.o:  3,504 bytes; multiply.s:  9,732 bytes
2018-03-18 20:57:51 -04:00
Iconoclast
ec3b55b48b syntactical nits 2018-03-18 18:47:07 -04:00
Iconoclast
8857d37876 Count loop iterations with unsigned int, not int.
Although functionally there is no difference (when just looping vector elements from 0 to 7) between using a signed int or an unsigned int, repeatedly seeing an inconsistent mix in usage between the two across different vector functions has been an ongoing distraction for years.  It should be the same everywhere, and between signed int and unsigned int, unsigned int is the type which always fits within size_t from stddef.h, the safe type for memory pointers and dereference indices.
2018-03-18 18:19:02 -04:00
Iconoclast
af06eddbdd fixed paste fail 2018-03-18 17:39:35 -04:00
Iconoclast
81c6bd1652 optimized MAC overflow carry when subtracting by -1 2018-03-18 17:15:29 -04:00
Iconoclast
b9e6b43ce5 vectorized VMACU
vmacu_old.asm function has 99 instructions.

vmacu_new.asm function has 50 instructions.
2018-03-18 10:28:13 -04:00
Iconoclast
cc4a5bb619 vectorized VMACF
New VMACF with manually written SSE2 is 45 instructions.

Old VMACF with auto-vectorized C code was 91 instructions.
2018-03-17 21:56:20 -04:00
unknown
71356b752a deleted VMACQ from the function table 2015-11-30 23:15:06 -05:00
unknown
7fb9850b68 fixed possible PIC linkage faults by moving merge to static 2015-08-16 09:51:04 -04:00
unknown
a4a7f4bd8e forgot to modernize a few types 2015-01-18 16:39:59 -05:00
unknown
bfd74741f9 force vectorization of unsigned multiply, overflow and VMADL clamp 2014-10-28 20:50:10 -04:00
unknown
55ad9ad9d8 optimized VMADN with static overflow, carry and multiply-add 2014-10-28 15:35:05 -04:00
unknown
6d17d19dc6 correspond VMUDM intrinsics to multiply-accumulate variation 2014-10-26 23:58:18 -04:00
unknown
5cce9f457e new algorithm for mixed signed * unsigned factorization 2014-10-23 16:45:01 -04:00
unknown
ef09b4eb5d redesign VMUDN with carry and overflow/underflow SSE logic 2014-10-22 22:45:34 -04:00
unknown
f810a85e31 refer unsigned overflow to `negative' mask 2014-10-21 19:58:10 -04:00
unknown
c7a468e3d7 corresponding optimizations to VMUDL (same multiply, diff. clamp) 2014-10-21 00:05:26 -04:00
unknown
9dbdcc490c restyled some optimization and fix 48-bit MADD sign-extension 2014-10-20 22:25:01 -04:00
unknown
b832e39a92 merged bi-arch VMULF template into optimized SIMD mulf 2014-10-20 00:57:40 -04:00
unknown
d768f51077 more direct multiply-add high operation without bi-arch template 2014-10-18 22:27:08 -04:00
unknown
79c5aa0cf4 removed bi-arch template for VMUDH 2014-10-17 22:43:43 -04:00
unknown
291e7fb10b remove bi-arch template for VMUDL as mudl was greatly simplified in SSE 2014-10-17 19:02:02 -04:00
unknown
158a4d0b60 pass only 2 XMM operands, w/ no return slot ifndef ARCH_MIN_SSE2 2014-10-16 00:43:37 -04:00
unknown
f1481dd39b restructured modular layout of the source, dropped some optional features 2014-10-09 16:45:55 -04:00