Fixes#21.
In the face of all adversity to other sources indicating that the four-bit shuffling element specifier is recycled as a selector for the source element from VT, the only way to pass krom's hardware tests on the VMOV operation with operands illegal to standard RSP assembler was to replace this notion with the seemingly oversimplified read from `de` instead of `e`, even though that specifier is already in use as the selector for which destination slice to write to and not just read from.
Despite being removed from any references in the corresponding translation unit's functional implementation, the four-bit element shuffling mask is still in use as with all other vector operations for pre-shuffling VT[] before jumping into the vector operation interpreter function pointer table.
In addition, the MovIn register is also half-emulated. It is not maintained as a global state machine attribute and only stores the final, hardware-accurate result that was already going to be copied into VD[] anyway rather than the preconceived result of a direct copy from VT[e].
Fixes#19.
Disabling the optimized code is perhaps a temporary measure, but the more readable code under the #else clause should absolutely be kept. The optimized version for 2's complement machines has however also been patched with a fix in case it becomes desirable to go back to enabling it for substantial speed gains.
Sign-extension is correct but only for single-precision reciprocal calculations. Double-precision divides should still continue to mask in the zero-extended low 16 bits of the determined vector register slice if the previously executed divide instruction prepared a double-precision result rather than defining a single-precision one.
Although functionally there is no difference (when just looping vector elements from 0 to 7) between using a signed int or an unsigned int, repeatedly seeing an inconsistent mix in usage between the two across different vector functions has been an ongoing distraction for years. It should be the same everywhere, and between signed int and unsigned int, unsigned int is the type which always fits within size_t from stddef.h, the safe type for memory pointers and dereference indices.
This is either for good or just temporary. It depends how much performance is lost from having to call the NOINLINE function, but as this is the actual source of speed hits for the divide operations I find it all that much easier to benchmark it when it is not getting in-lined.
Furthermore, it's usually way low at the bottom of the function hot-spot lists anyway, so I'd rather save my 1 KB of DLL file size than worry about premature optimization for a function that needs more thorough benchmark testing anyway.
This code was back when I wanted a central function for shuffling the vectors, only when all the COP2 vector op-codes had the shuffling of VR[vt] done locally within them. Since shuffling is now done within the COP2 dispatch before the function call table--global to all the vector instructions--there is no more need to have this prototype for a central function. That code was probably removed over 2 years ago.
This also fixes some surviving 64-bit warnings with PIC linkage on Windows, reported by tony971, that I missed.
Also got rid of the SSE2 code for shuffling. It takes too much extra byte code in the main interpreter instruction cache and requires an extra branch anyway, and an SSSE3 solution would still require at least 3 such large SIMD instructions anyway. So let's see if we can't safely overhaul this without a speed drop.