Both Windows and Linux seem to be doing a terrible job of scheduling the device threads optimally. Until I can think of something more clever, manually set thread affinity.