1) Setting SP_PC was not resetting the pipeline. This caused that
changing the PC within a HALT/UNHALT sequence was still causing
previous instructions in the pipeline (at the old address) to be
executed. This is not how the hardware works: SP_PC is immediate and
discards the whole pipeline.
2) BREAK did not correctly halt the processor at the right instruction,
which in turn caused resumption after HALT to execute the wrong
set of instructions. This was caused by the fact that the SP_STATUS
change was written into the EXDF latch, which in turn takes 3 cycles
to reach completion. Instead, we now use the DFWB latch, and we cause
it to abort the RSP cycle if the processor is halted. This happens
at the beginning of next cycle, which is the correct moment.
2bis) Since we are at it, use rsp_status_write to modify the RSP in
this case, rather than a direct write to the register. This change
fixes a race condition: SP_STATUS must be accessed atomically when
cen64 runs in multithreaded mode. To use rsp_status_write, we need
to introduce a nonexisting SP_SET_BROKE bit: we use the MSB, but then
mask it out in MTC0 to avoid some code to inadvertently have that bit
set.
3) When unhalting after BREAK, it's important to keep the correct
PC which comes from the EX stage (the one that was going to be
executed if BREAK didn't occur). Before, it was using the IF PC (fetch)
which is farther in the future.
Fixes#155
Up until the, the RSP was storing instruction words in big-
endian format. Thus, each fetch on an x86 host requires a
byteswap. This is wasteful, so use host byte ordering for
the ICACHE (as the VR4300 does now).
izy managed to remove another LUT used in add/sub related
insructions. The devil is in the details (see commit).
<new>:
00000000004006b0 <rsp_addsub_mask>:
4006b0: c1 ef 02 shr $0x2,%edi
4006b3: 19 c0 sbb %eax,%eax
4006b5: c3 retq
<old>:
00000000004006d0 <rsp_addsub_mask>:
4006d0: 83 e7 02 and $0x2,%edi
4006d3: 8b 04 bd 80 07 40 00 mov 0x400780(,%rdi,4),%eax
4006da: c3 retq
"You see that this patch doesn't increase the amount of
instructions. They are always two/three/four instructions
and with automatic register selection. This is always the
case with a MOV from memory... you can load to any register,
but the same will happen with a SBB over itself. That is
also the reason why when the function is inlined it won't
require any special register (such as a the EAX:EDX pair,
the "cltd" instruction you see in the 32 bit code is only
a coincidence caused by the optimizations done by the gcc
and isn't mandatory).
The System V AMD64 calling convention puts the input
parameter in rdi, but wherever the selector is placed
nothing changes. The output parameter is in rax, but
MOV/SBB can work with any register when inlined.
izy noticed that the branch LUT was generating memory moves
and could be replaced with an inlined function that coerces
gcc into generating a lea in its place:
4005ac: 8d 1c 00 lea (%rax,%rax,1),%ebx
4005af: c1 fb 1f sar $0x1f,%ebx
4005b2: f7 d3 not %ebx
(no memory access)
4005b9: c1 e8 1e shr $0x1e,%eax
4005bc: 83 e0 01 and $0x1,%eax
4005bf: 44 8b 24 85 90 07 40 mov 0x400790(,%rax,4),%r12d
(original has memory access)
This ends up optimizing branch instructions quite nicely:
"You see that when you use "mask" you execute "~mask". The
compiler understands that ~(~(partial_mask)) = partial_mask
and removes both "NOTs". So in this case my version uses 2
instructions and no memory access/cache pollution."
Oftentimes, many of our countrollers are just doing a
simple countdown and don't perform any real work for the
cycle. Pull those parts out into headers so that the
compiler can 'see' that and optimize accordingly.
Replaced all references to simulation with emulation
Updated copyright year
Updated .gitignore to reduce chances of random files being uploaded to
the repo
Added .gitattributes to normalize all text files, and to ignore binary
files (which includes the logo and the NEC PDF)
No need to separate all these functions when they contain so
much common code, so start combining things for the sake of
locality and predictor effectiveness (and size). In addition
to these benefits, the CPU backend is usually busy during the
execution of these functions, so suffering a misprediction
isn't as painful (especially seeing as we can potentially
improve the prediction from the indirect branch).