tile.cpp: Optimise the common case of drawing an unclipped but possibly flipped 8x8 tile. Instead of calling WRITE_4PIXELS16 16 times, each performing setup and teardown, move the loop into DrawTile16.
tile.h, tile.cpp, gfx.h, gfx.cpp: End the use of global variable GFX.ScreenColors to pass around the current frame's palette. This saves on memory stores/loads.
With the MIPS instruction cache, this means that two consecutive SNES CPU instructions using e.g. the same addressing style or the same opcode have a chance that the second one will use the first one's code and that it will be cached.