Iconoclast
cecd9976e8
optimized _mm_cmplt_epu16() composite
...
method A: #define _mm_cmplt_epu16(m, n) _mm_cmpgt_epu16(n, m)
define _mm_cmpgt_epu16(m, n) _mm_andnot_si128(\
_mm_cmpeq_epi16(m, n), _mm_cmple_epu16(n, m)\
)
define _mm_cmple_epu16(m, n) _mm_cmpeq_epi16(\
_mm_subs_epu16(m, n), _mm_setzero_si128()
)
multiply.o: 3,524 bytes; multiply.s: 9,883 bytes
method B: #define _mm_cmplt_epu16(m, n) _mm_cmplt_epi16(\
_mm_xor_si128(m, _mm_setmin_epi16()), _mm_xor_si128(n, _mm_setmin_epi16())\
)
define _mm_setmin_epi16() _mm_slli_epi16(_mm_allones_si128(), 15)
multiply.o: 3,504 bytes; multiply.s: 9,732 bytes
2018-03-18 20:57:51 -04:00
Iconoclast
ec3b55b48b
syntactical nits
2018-03-18 18:47:07 -04:00
Iconoclast
8857d37876
Count loop iterations with unsigned int, not int.
...
Although functionally there is no difference (when just looping vector elements from 0 to 7) between using a signed int or an unsigned int, repeatedly seeing an inconsistent mix in usage between the two across different vector functions has been an ongoing distraction for years. It should be the same everywhere, and between signed int and unsigned int, unsigned int is the type which always fits within size_t from stddef.h, the safe type for memory pointers and dereference indices.
2018-03-18 18:19:02 -04:00
Iconoclast
af06eddbdd
fixed paste fail
2018-03-18 17:39:35 -04:00
Iconoclast
81c6bd1652
optimized MAC overflow carry when subtracting by -1
2018-03-18 17:15:29 -04:00
Iconoclast
b9e6b43ce5
vectorized VMACU
...
vmacu_old.asm function has 99 instructions.
vmacu_new.asm function has 50 instructions.
2018-03-18 10:28:13 -04:00
Iconoclast
cc4a5bb619
vectorized VMACF
...
New VMACF with manually written SSE2 is 45 instructions.
Old VMACF with auto-vectorized C code was 91 instructions.
2018-03-17 21:56:20 -04:00
unknown
71356b752a
deleted VMACQ from the function table
2015-11-30 23:15:06 -05:00
unknown
7fb9850b68
fixed possible PIC linkage faults by moving merge
to static
2015-08-16 09:51:04 -04:00
unknown
a4a7f4bd8e
forgot to modernize a few types
2015-01-18 16:39:59 -05:00
unknown
bfd74741f9
force vectorization of unsigned multiply, overflow and VMADL clamp
2014-10-28 20:50:10 -04:00
unknown
55ad9ad9d8
optimized VMADN with static overflow, carry and multiply-add
2014-10-28 15:35:05 -04:00
unknown
6d17d19dc6
correspond VMUDM intrinsics to multiply-accumulate variation
2014-10-26 23:58:18 -04:00
unknown
5cce9f457e
new algorithm for mixed signed * unsigned factorization
2014-10-23 16:45:01 -04:00
unknown
ef09b4eb5d
redesign VMUDN with carry and overflow/underflow SSE logic
2014-10-22 22:45:34 -04:00
unknown
f810a85e31
refer unsigned overflow to `negative' mask
2014-10-21 19:58:10 -04:00
unknown
c7a468e3d7
corresponding optimizations to VMUDL (same multiply, diff. clamp)
2014-10-21 00:05:26 -04:00
unknown
9dbdcc490c
restyled some optimization and fix 48-bit MADD sign-extension
2014-10-20 22:25:01 -04:00
unknown
b832e39a92
merged bi-arch VMULF template into optimized SIMD mulf
2014-10-20 00:57:40 -04:00
unknown
d768f51077
more direct multiply-add high operation without bi-arch template
2014-10-18 22:27:08 -04:00
unknown
79c5aa0cf4
removed bi-arch template for VMUDH
2014-10-17 22:43:43 -04:00
unknown
291e7fb10b
remove bi-arch template for VMUDL as mudl was greatly simplified in SSE
2014-10-17 19:02:02 -04:00
unknown
158a4d0b60
pass only 2 XMM operands, w/ no return slot ifndef ARCH_MIN_SSE2
2014-10-16 00:43:37 -04:00
unknown
f1481dd39b
restructured modular layout of the source, dropped some optional features
2014-10-09 16:45:55 -04:00