The 64-bit load and store code was generating pretty bad output with
my compiler, so I extracted the code from csiphash and used that instead.
Close ticket 21737
* SHA-3/SHAKE use little endian for certain things, so byteswap as
needed.
* The code was written under the assumption that unaligned access to
quadwords is allowed, which isn't true particularly on non-Intel.
The digest routines use init/update/sum, where sum will automatically
copy the internal state to support calculating running digests.
The XOF routines use init/absorb/squeeze, which behave exactly as stated
on the tin.