Up until the, the RSP was storing instruction words in big-
endian format. Thus, each fetch on an x86 host requires a
byteswap. This is wasteful, so use host byte ordering for
the ICACHE (as the VR4300 does now).
izy managed to remove another LUT used in add/sub related
insructions. The devil is in the details (see commit).
<new>:
00000000004006b0 <rsp_addsub_mask>:
4006b0: c1 ef 02 shr $0x2,%edi
4006b3: 19 c0 sbb %eax,%eax
4006b5: c3 retq
<old>:
00000000004006d0 <rsp_addsub_mask>:
4006d0: 83 e7 02 and $0x2,%edi
4006d3: 8b 04 bd 80 07 40 00 mov 0x400780(,%rdi,4),%eax
4006da: c3 retq
"You see that this patch doesn't increase the amount of
instructions. They are always two/three/four instructions
and with automatic register selection. This is always the
case with a MOV from memory... you can load to any register,
but the same will happen with a SBB over itself. That is
also the reason why when the function is inlined it won't
require any special register (such as a the EAX:EDX pair,
the "cltd" instruction you see in the 32 bit code is only
a coincidence caused by the optimizations done by the gcc
and isn't mandatory).
The System V AMD64 calling convention puts the input
parameter in rdi, but wherever the selector is placed
nothing changes. The output parameter is in rax, but
MOV/SBB can work with any register when inlined.
izy noticed that the branch LUT was generating memory moves
and could be replaced with an inlined function that coerces
gcc into generating a lea in its place:
4005ac: 8d 1c 00 lea (%rax,%rax,1),%ebx
4005af: c1 fb 1f sar $0x1f,%ebx
4005b2: f7 d3 not %ebx
(no memory access)
4005b9: c1 e8 1e shr $0x1e,%eax
4005bc: 83 e0 01 and $0x1,%eax
4005bf: 44 8b 24 85 90 07 40 mov 0x400790(,%rax,4),%r12d
(original has memory access)
This ends up optimizing branch instructions quite nicely:
"You see that when you use "mask" you execute "~mask". The
compiler understands that ~(~(partial_mask)) = partial_mask
and removes both "NOTs". So in this case my version uses 2
instructions and no memory access/cache pollution."
Oftentimes, many of our countrollers are just doing a
simple countdown and don't perform any real work for the
cycle. Pull those parts out into headers so that the
compiler can 'see' that and optimize accordingly.
Replaced all references to simulation with emulation
Updated copyright year
Updated .gitignore to reduce chances of random files being uploaded to
the repo
Added .gitattributes to normalize all text files, and to ignore binary
files (which includes the logo and the NEC PDF)
No need to separate all these functions when they contain so
much common code, so start combining things for the sake of
locality and predictor effectiveness (and size). In addition
to these benefits, the CPU backend is usually busy during the
execution of these functions, so suffering a misprediction
isn't as painful (especially seeing as we can potentially
improve the prediction from the indirect branch).