gcc (and probably other compilers) don't like working with 16-bit
types and will zero-extend where needed. Save some overhead and
just store the state as a 32-bit type.
Much of the architecture-specific code uses compiler-agnostic
intrinsics. For this reason, split it out into an arch/ folder,
leaving only the compiler and environment-specific code in os/.