Although functionally there is no difference (when just looping vector elements from 0 to 7) between using a signed int or an unsigned int, repeatedly seeing an inconsistent mix in usage between the two across different vector functions has been an ongoing distraction for years. It should be the same everywhere, and between signed int and unsigned int, unsigned int is the type which always fits within size_t from stddef.h, the safe type for memory pointers and dereference indices.
Also got rid of the SSE2 code for shuffling. It takes too much extra byte code in the main interpreter instruction cache and requires an extra branch anyway, and an SSSE3 solution would still require at least 3 such large SIMD instructions anyway. So let's see if we can't safely overhaul this without a speed drop.