Not sure why GCC was optimizing out these global register variable
writes when FLTO was enabled, but ensure that it does not by using
an inline assembly block.
Periodically (~1000x) poll for input instead of waiting for a frame
boundary. Also relinquish the render_lock more aggressively in an
attempt to step out of the way of the simulator.
Since the CEN64 core now runs in it's own thread (and doesn't use
the FPU), we can steal the host's FPU state register and not have
to worry about preserving it.
Along with that major overhaul, don't force "extra" features like
simulation statistics and debugging if the user doesn't want them.
Including that code, even when it is not run, mucks with register
allocation or something ever so slightly.
gcc (and probably other compilers) don't like working with 16-bit
types and will zero-extend where needed. Save some overhead and
just store the state as a 32-bit type.
Since we have to convert to an integer, as well as round in some
direction, these intrinsics (_mm_ceil_*, _mm_floor_*, _mm_round_*)
aren't of much use to us.
We will likely only hit a couple of the slow_cycle functions in
the VR4300 code when we interrupt. Because of this, push everything
just before what will be hit after a data cache fault into the cold
section.
Perf reported a window where the backend was busy, and the frontend
was idle. Take advantage of the situation by inserting a branch that
has the potential to filter out (a lot of) instructions from the
backend when it's clogged. This works to our advantage, because more
often than not we aren't executing FPU instructions, or we execute
the FPU instructions in small batches.
Add option to specify architecture support (SSE2, SSSE3, etc.)
for each compiler supported. Update UI window title to indicate
architecture folder and support.
We're going to want to instantiate all possible branch targets
ahead of time to avoid SMC penalties, so we want each target to
fit into the smallest block of code possible.