Thanks to Tiny Tiger and AIO for helping to point this out.
One of the arguments was being overwritten before it was
used, which caused an issue with the SSE2 codepath (while
the SSE4.1 one was fine).
tl;dr: Using LUTs to shift and byteswap all in one x86
instruction is awesome for performance, but makes things
absolutely horrendous to debug.
With this commit, audio mixing on the RSP works properly.
I seriously screwed up the TLB lookup logic so bad that
only the first 8 TLB entries were being probed. Fix that.
This fixes (at least) Paper Mario and Mario Tennis.
Replaced all references to simulation with emulation
Updated copyright year
Updated .gitignore to reduce chances of random files being uploaded to
the repo
Added .gitattributes to normalize all text files, and to ignore binary
files (which includes the logo and the NEC PDF)
No need to separate all these functions when they contain so
much common code, so start combining things for the sake of
locality and predictor effectiveness (and size). In addition
to these benefits, the CPU backend is usually busy during the
execution of these functions, so suffering a misprediction
isn't as painful (especially seeing as we can potentially
improve the prediction from the indirect branch).
On Windows, acc_lo (%xmm5) was clashing with the x64 calling
convention, which states %xmm5 is a volatile register and is
the caller's responsibility to save. We need the register
preserved across calls, so until we have a better solution to
the problem, pick registers that are not volatile according to
the calling convention.
Since Microsoft decided to totally bork their x86_64 calling
convention, defer all Windows builds to non-optimized RSP
routines. When MinGW supports __vectorcall, this change can
be reverted.