|
|
|
# SHA encoding
|
|
|
|
|
|
|
|
In several places we use a truncated URL-safe base 64 encoding of a SHA-512 checksum.
|
|
|
|
|
|
|
|
# SHA 512
|
|
|
|
|
|
|
|
SHA-512 is a [cryptographic hash function](https://en.wikipedia.org/wiki/Cryptographic_hash_function) described in the [NIST FIBS 180-4](http://dx.doi.org/10.6028/NIST.FIPS.180-4).
|
|
|
|
|
|
|
|
A cryptographic hash function is a function that can be calculated quickly, and produces a relatively short result for which it is not only unlikely to find collisions (i.e. two different arguments that give the same value), but (being cryptographic) it is difficult even to build these collisions artificially.
|
|
|
|
|
|
|
|
For this reasons the SHA 512 checksum can be safely used to identify a long value (for example the content of a file) using just the short generated checksum.
|
|
|
|
|
|
|
|
We use SHA 512 as basis because on 64 bit hardware it is faster other alternatives to compute, while being safe.
|
|
|
|
|
|
|
|
As the name implies SHA 512 generates 512 bit checksums, which are relatively long. Thus we often truncate it. Truncation increases the probability of collisions, a lower bound with random input can be calculated looking at the [birthday problem](https://en.wikipedia.org/wiki/Birthday_problem).
|
|
|
|
|
|
|
|
Luckily the construction of SHA 512 allows for easy truncation while maintaining the good cryprographic properties, as discussed in detail in section 5.1 of [SP 800-107](http://csrc.nist.gov/publications/nistpubs/800-107-rev1/sp800-107-rev1.pdf) "Truncated Message Digest".
|
|
|
|
|
|
|
|
## Base 64
|
|
|
|
|
|
|
|
the SHA 512 checksum is a binary sequence, which can contain invalid charaters, to make it representable as a string one should use an encoding. [Hex encoding](https://en.wikipedia.org/wiki/Hexadecimal) is often used but it needs 2 characters (16 bits) to represent 8 bits, making the encoded value twice as long.
|
|
|
|
As we would like to truncate as much as possible we use the [Base 64](https://en.wikipedia.org/wiki/Base64) encoding in the url safe version that uses alphanumeric characters and '-', '\_' to encode the values.
|
|
|
|
This encodes 6 bits in a 8-bit character making for shorter values.
|
|
|
|
|
|
|
|
For internal Gids we use 28 characters that correspond to 168 bit of the checksum.
|
|
|
|
|
|
|
|
# Conclusion
|
|
|
|
|
|
|
|
Truncated Base64 encoding of SHA 512 checksum can be an effective way to create short unique values that depend only on longer values. They are reproducible and generation can be distributed as there is no need of a central authority. Depending on the truncation lenght the collision probability can vary from unlikely, to effectively impossible. |
|
|
|
\ No newline at end of file |