izy managed to remove another LUT used in add/sub related
insructions. The devil is in the details (see commit).
<new>:
00000000004006b0 <rsp_addsub_mask>:
4006b0: c1 ef 02 shr $0x2,%edi
4006b3: 19 c0 sbb %eax,%eax
4006b5: c3 retq
<old>:
00000000004006d0 <rsp_addsub_mask>:
4006d0: 83 e7 02 and $0x2,%edi
4006d3: 8b 04 bd 80 07 40 00 mov 0x400780(,%rdi,4),%eax
4006da: c3 retq
"You see that this patch doesn't increase the amount of
instructions. They are always two/three/four instructions
and with automatic register selection. This is always the
case with a MOV from memory... you can load to any register,
but the same will happen with a SBB over itself. That is
also the reason why when the function is inlined it won't
require any special register (such as a the EAX:EDX pair,
the "cltd" instruction you see in the 32 bit code is only
a coincidence caused by the optimizations done by the gcc
and isn't mandatory).
The System V AMD64 calling convention puts the input
parameter in rdi, but wherever the selector is placed
nothing changes. The output parameter is in rax, but
MOV/SBB can work with any register when inlined.
izy noticed that the branch LUT was generating memory moves
and could be replaced with an inlined function that coerces
gcc into generating a lea in its place:
4005ac: 8d 1c 00 lea (%rax,%rax,1),%ebx
4005af: c1 fb 1f sar $0x1f,%ebx
4005b2: f7 d3 not %ebx
(no memory access)
4005b9: c1 e8 1e shr $0x1e,%eax
4005bc: 83 e0 01 and $0x1,%eax
4005bf: 44 8b 24 85 90 07 40 mov 0x400790(,%rax,4),%r12d
(original has memory access)
This ends up optimizing branch instructions quite nicely:
"You see that when you use "mask" you execute "~mask". The
compiler understands that ~(~(partial_mask)) = partial_mask
and removes both "NOTs". So in this case my version uses 2
instructions and no memory access/cache pollution."
Replaced all references to simulation with emulation
Updated copyright year
Updated .gitignore to reduce chances of random files being uploaded to
the repo
Added .gitattributes to normalize all text files, and to ignore binary
files (which includes the logo and the NEC PDF)
Up until now, the simulator assumed that DMEM accesses had to be
aligned (similarly to the VR4300). This is not actually the case,
so allow scalar memory access to arbitrary DMEM addresses.
Tell GCC to optimize cold functions for size and stash them away in
a separate part of the binary. Put the simulate core, meanwhile, on
the hot path. Also, bump optimization to -O3 as we can now "afford"
to do so.
If we think about how the assembler forms 32-bit immediates, it
usually generates a lui and addiu pair. Well, if can craft the
simulation such that lui and addiu are the same indirect target
when branching to execution functions, we can reduce the chance
that we'll mispredict and have a resulting pipeline flush on the
host.
Every cycle counts!