First, since the internal register is kept in CPU cycles (not RCP cycles),
we need to double the value written via MTC0/DMTC0.
Second, writing a count equal to compare would cause an infinite loop
because the fault would be triggered while PC was on the instruction
doing MTC0 itself, which would be then re-executed at the end of the
exception. On real hardware, in general, when COUNT==COMPARE, the
interrupt happens a few cycles later, enough for PC to move to other
opcodes. Instead of trying to implement this, I've simply made sure
that the interrupt happened after the opcode was executed rather than
before. Also, since the internal counter is in CPU cycles, we make
sure to only raise the CAUSE bit once.
Currently, all load/store opcodes (with the exception of LWL/LWR) mask
the lowest bit of address that causes a TLB exception in the BADVADDR
COP0 register. This is wrong because the VR4300 reports the exact
faulting address in that register, the reason being that the exception
handler must require it.
simer/sp1187 pointed out that undefined CP0 registers all
share a common value (that is, a write to any undefined CP0
register effectively acts as a write to *all* undefined CP0
registers).
This commit implements the specified behaviour.
I seriously screwed up the TLB lookup logic so bad that
only the first 8 TLB entries were being probed. Fix that.
This fixes (at least) Paper Mario and Mario Tennis.
The action taken for (D) Index_Write_Back_Invalidate was
wrong. As it turns out, the VR4300 manual has an extremely
serious typo in the operation section.
According to the manual, this cache operation should use
the virtual address to index a block (line) in the cache.
If that line is not in the INVALID state, it should be
unconditionally flushed out to memory and the line should
then be invalidated.
The hardware, however, seems to only write back the block
(line) in the event that the line is VALID and DIRTY. It
does, however, invalidate the line regardless of whether
or not the line was DIRTY. That is to say, CLEAN lines get
invalidated as well.
This commit fixes the erroneous behavior.
This optimization removes the LUT in LWL/LWR:
At the moment when the LUT is used inlined this code is generated:
OR LUTAddr(offset), dqm
That is something like:
OR 0x400760(,%rdi,8),dqm
The code equivalent to "mov %edi,%edi" from the function above can get removed.
I want to assume anyway that accessing the LUT and updating the "dqm" variable
generates a single instruction with memory access.
With the patch the generated code is:
add $0xfffffffd,%edi
sbb %rax,%rax
OR %rax, dqm
Thus my patch increases the amount of opcodes by two instructions.
The LUT has 3 advantages on its side:
- The function VR4300_LWL_LWR() will use the value read from the LUT only once
and only for a logic-OR.
- On x86 a logic-OR is an operation that can work with the source operand read
from memory
- The "offset" variable is pre-calculated and can be used "as is" by the LUT.
The code with my patch (without the LUT) has only an advantage on its side:
- The LUT (memory access) is removed
izy managed to remove another LUT used in add/sub related
insructions. The devil is in the details (see commit).
<new>:
00000000004006b0 <rsp_addsub_mask>:
4006b0: c1 ef 02 shr $0x2,%edi
4006b3: 19 c0 sbb %eax,%eax
4006b5: c3 retq
<old>:
00000000004006d0 <rsp_addsub_mask>:
4006d0: 83 e7 02 and $0x2,%edi
4006d3: 8b 04 bd 80 07 40 00 mov 0x400780(,%rdi,4),%eax
4006da: c3 retq
"You see that this patch doesn't increase the amount of
instructions. They are always two/three/four instructions
and with automatic register selection. This is always the
case with a MOV from memory... you can load to any register,
but the same will happen with a SBB over itself. That is
also the reason why when the function is inlined it won't
require any special register (such as a the EAX:EDX pair,
the "cltd" instruction you see in the 32 bit code is only
a coincidence caused by the optimizations done by the gcc
and isn't mandatory).
The System V AMD64 calling convention puts the input
parameter in rdi, but wherever the selector is placed
nothing changes. The output parameter is in rax, but
MOV/SBB can work with any register when inlined.
izy noticed that the branch LUT was generating memory moves
and could be replaced with an inlined function that coerces
gcc into generating a lea in its place:
4005ac: 8d 1c 00 lea (%rax,%rax,1),%ebx
4005af: c1 fb 1f sar $0x1f,%ebx
4005b2: f7 d3 not %ebx
(no memory access)
4005b9: c1 e8 1e shr $0x1e,%eax
4005bc: 83 e0 01 and $0x1,%eax
4005bf: 44 8b 24 85 90 07 40 mov 0x400790(,%rax,4),%r12d
(original has memory access)
This ends up optimizing branch instructions quite nicely:
"You see that when you use "mask" you execute "~mask". The
compiler understands that ~(~(partial_mask)) = partial_mask
and removes both "NOTs". So in this case my version uses 2
instructions and no memory access/cache pollution."
Also slightly tighten the emulated memory delays. With
this commit, WDC boots (but crashes shortly after). Seems
like memory timings are coming into play, among other
things.
Replaced all references to simulation with emulation
Updated copyright year
Updated .gitignore to reduce chances of random files being uploaded to
the repo
Added .gitattributes to normalize all text files, and to ignore binary
files (which includes the logo and the NEC PDF)
When a CACHE instruction uses a mapped virtual address,
and a TLB miss results... just ignore it! Clearly, this
isn't the right thing to do, but all documentation is
ambiguous and this seems to float the boat for now.