Friday, 24 May 2019

String hashes

Strings are evil... but they are human friendly.
The common implementation of strings is heavy in dynamic memory and slow, specially for simple comparisons. There are several alternatives, like pooled strings and string hashes.

With modern C++ you can calculate the hashes for literal strings in compile time, that gives string hashes solution a lot of potential. So all literal strings get converted to hashes, memory is under control and the comparisons are fixed cost.

But it is not perfect:
- Collisions, you need a way to detect collisions as having a collision in your implementation could be disastrous.
- Converting back to strings for logging.
- Debugging, you need to be able to see the string value during debugging.

All these issues can be fixed if you maintain a map that converts from hash to string. It will allow detecting collisions and conversions. It solves the debugging issue if you access the map using a natvis file. And in release configuration you can just strip out this map and, as an extra, no more literal strings inside the executable.

But string hashes solves a pattern that I always fight against in big projects.
Usually there are a group of tools for processing data, for example shader, material and meshes. Those tools are independent executable (usually), but they need some dependencies between them. For example, if the shader defines the pass where it needs to be rendered and the pass is inside an enum; usually you create a common include file between the tools, so you can serialise the correct index for the enum. But this pattern has a lot of issues, a change of this include file invalidates all the tools and the data, wasting a lot of cached data.

So, why not instead of using integers we just serialise the string? that would break the dependency, making all the serialise data more data driven and not based of a fixed enum. Thanks to the string hashes you can just serialise the hash, as hashes are the same between executables.

Extra:

My string hash base class has two template parameters:
- Namespace: That allows the hashes to collide between different namespaces and block assignments between string hashes from different namespaces (strong type string hashes).
- Size: Thanks to the namespace, you can now reduce the size of the hash, for example for the pass name you can use 16 bits integers (or 8 bits).


If your project is big, with a lot of different tools using string hashes, you can use an external database instead of a global map inside each tool.
Benefits:
- Better collision detection: You will detect the collision from the tool that first introduced the duplicated hash.
- Once in shipping configuration, you can convert hashes to strings from the logs without leaking the strings and you can still use the natvis for debugging (You need to implement a natvis custom visualized, changed in VS2019 to UIVisualizers plugin).

Code: https://github.com/JlSanchezB/Cute/blob/master/engine/core/string_hash.h

No comments:

Post a Comment