Originally these functions were intended to workaround
problems experienced by RSP vector register caching,
but we don't use it anymore so we can just nix them.
Fixes#41
Add a "running" boolean to the master device struct, and set it to false
when the main window is closed. All the tight inner while (1) loops now
become while (running).
Closes#24
Thanks to simer and Happy for pointing out something that
also cropped up in MAME: http://forums.bannister.org/
ubbthreads.php?ubb=showflat&Number=94626#Post94626
This hack fixes Banjo-Kazooie.
Fixes a bug introduced in a4f0d72. read_pif_rom and read_pif_ram were
replaced with a unified read_pif_rom_and_ram, but the excess si instance
remained in the instance mapping.
This field is required in order to distinguish between regional
versions where the game ID is the same but the save type differ,
such as the Castlevania games.
Also added more Japanese-specific game IDs and edited some descriptions.
Up until the, the RSP was storing instruction words in big-
endian format. Thus, each fetch on an x86 host requires a
byteswap. This is wasteful, so use host byte ordering for
the ICACHE (as the VR4300 does now).
izy noticed that the audio buffers are usually >= 64 bytes
in size and aligned to 16 bytes. This makes them a very good
candidate for SSE (instead of swapping a word at a time).
simer suggested (and implemented) the use of ROM IDs instead
of titles: "I've also found that the header name in some cases
are too inprecise, for example "TOP GEAR RALLY" has EEPROM 4K
for the Japanese and European versions, but not for the American.
This optimization removes the LUT in LWL/LWR:
At the moment when the LUT is used inlined this code is generated:
OR LUTAddr(offset), dqm
That is something like:
OR 0x400760(,%rdi,8),dqm
The code equivalent to "mov %edi,%edi" from the function above can get removed.
I want to assume anyway that accessing the LUT and updating the "dqm" variable
generates a single instruction with memory access.
With the patch the generated code is:
add $0xfffffffd,%edi
sbb %rax,%rax
OR %rax, dqm
Thus my patch increases the amount of opcodes by two instructions.
The LUT has 3 advantages on its side:
- The function VR4300_LWL_LWR() will use the value read from the LUT only once
and only for a logic-OR.
- On x86 a logic-OR is an operation that can work with the source operand read
from memory
- The "offset" variable is pre-calculated and can be used "as is" by the LUT.
The code with my patch (without the LUT) has only an advantage on its side:
- The LUT (memory access) is removed
izy managed to remove another LUT used in add/sub related
insructions. The devil is in the details (see commit).
<new>:
00000000004006b0 <rsp_addsub_mask>:
4006b0: c1 ef 02 shr $0x2,%edi
4006b3: 19 c0 sbb %eax,%eax
4006b5: c3 retq
<old>:
00000000004006d0 <rsp_addsub_mask>:
4006d0: 83 e7 02 and $0x2,%edi
4006d3: 8b 04 bd 80 07 40 00 mov 0x400780(,%rdi,4),%eax
4006da: c3 retq
"You see that this patch doesn't increase the amount of
instructions. They are always two/three/four instructions
and with automatic register selection. This is always the
case with a MOV from memory... you can load to any register,
but the same will happen with a SBB over itself. That is
also the reason why when the function is inlined it won't
require any special register (such as a the EAX:EDX pair,
the "cltd" instruction you see in the 32 bit code is only
a coincidence caused by the optimizations done by the gcc
and isn't mandatory).
The System V AMD64 calling convention puts the input
parameter in rdi, but wherever the selector is placed
nothing changes. The output parameter is in rax, but
MOV/SBB can work with any register when inlined.
izy noticed that the branch LUT was generating memory moves
and could be replaced with an inlined function that coerces
gcc into generating a lea in its place:
4005ac: 8d 1c 00 lea (%rax,%rax,1),%ebx
4005af: c1 fb 1f sar $0x1f,%ebx
4005b2: f7 d3 not %ebx
(no memory access)
4005b9: c1 e8 1e shr $0x1e,%eax
4005bc: 83 e0 01 and $0x1,%eax
4005bf: 44 8b 24 85 90 07 40 mov 0x400790(,%rax,4),%r12d
(original has memory access)
This ends up optimizing branch instructions quite nicely:
"You see that when you use "mask" you execute "~mask". The
compiler understands that ~(~(partial_mask)) = partial_mask
and removes both "NOTs". So in this case my version uses 2
instructions and no memory access/cache pollution."