Add Option for asnychronous GPU processing, add test implementation with explicit SSE for CPU (currently disabled)