PG Gaming

Making a Parallel Hash Map. Should you even try?

A lot of people have tread this ground before, is there really anything left to add?

There’s a lot of open source hash map implementations out there already. For example, here’s a fairly comprehensive benchmark on a bunch of them: Comprehensive Hash Map Benchmark. That benchmark compares insert, erase, and lookup performance across 15 different hashmap implementations that are written in both C and C++. I’ll be honest, if you’re thinking about implementing your own non-threadsafe version of a hash map. You really shouldn’t bother unless you have a very specific requirement that isn’t handled by an open source project. But if there’s one thing you can definitely gleam from that benchmark, std::unordered_map is significantly slower than the top-contenders for hash map implementations.

But when it comes to Parallel Hash Maps, there’s less tread ground for benchmarking.

One article I found related to parallel hash map performance was written here: PHMap Performance. Then the hash map from that article is compared against another parallel hash map written here: Parlay Hash Map. Which then compares itself against another parallel hash map written here: Folly Concurrent Hash Map. Parlay Hash Map says it’s the best, but is that really true? Many times you can find yourself in a situation where your particular constraints allow you to optimize in ways you can’t for something that needs to be released to a general audience. Or you just have project requirements that are broken by a particular implementation.

That being said, I was still inspired to try writing my own parallel hashmap implementation to see if I could implement something better than other open source projects for my particular needs.Which meant I needed to create my own benchmark and do my own testing.

Benchmarking setup

Requirements

When benchmarking, it’s always best to start by figuring out the requirements of the system you’re working with. In my case, here are my requirements.

The hash map must allow concurrent inserts, deletes, and lookups to all be possible in parallel.

It’s easier to write code that does manipulations at a per-element level (like “Read value X”, “Do operation that then stores value Y”) than it is to have to process everything in batches. While batching is generally more efficient, it can also make code harder to follow, and it’s also important to make it easy to write and read the code, and only optimize the things that need to be optimized.

Iterating across the hashmap does not need to be able to run in parallel with other operations

Most data structures don’t support modifications during iteration even in single-threaded circumstances. To do so here would be a pretty hefty lift and would probably never be needed.

The memory address of the stored value must never change.

It just makes stuff easier to work with if you don’t have to consider the possibility that a future addition to the hashmap would invalidate a stored reference. In general, I like to avoid long-lived pointers, but there are times when we want to cache or hold onto a reference for efficiency, and we don’t want to perform a lookup every time we need to see a value that never changes.

There must be an interface for a custom allocator

While memory allocation strategies are often part of how a certain hash map gets its efficiency, there needs to be a way to override that behavior. The most direct need in my mind is that I want a place to be able to label my memory allocations for optimizing memory usage later on. Sure there are memory profilers and things to manage that, but they often work just from a stack trace, and it can be very difficult to parse that trace to get useful information like “How much memory is System A using”. But that’s a topic for another time. Also, if I can benchmark all the hash maps against the same allocation strategy, then it removes a variable from the comparison since I can be assured that that piece of the benchmark is the same across all implementations.

There must be an interface for a custom hash function

There are other systems that will need to hash things and compare hashes and it helps for the sake of consistency if the hash in the hashmap matches those other systems. It also eliminates another point of noise when benchmarking different hash map implementations.

Need good performance at low thread counts and smaller map sizes too

Some parallel hash map implementations are focused on their ability to create hash maps with millions of elements and hundreds of threads better than anyone else. But we’re working in an environment where we expect to be working with lots of smaller lists and sets. Also, many times, only 2-4 threads will probably be working with those hash maps. So, good performance at lower thread counts is critical.

You’ll notice that I didn’t mention low memory usage as a requirement

For the purpose of this test, I’m just interested in something fast. Memory usage is an important consideration in a lot of cases. But when working in games, which this blog is focused on, speed tends to trump memory requirements since many times there’s plenty of memory to spare. If this was something that needed to run on a GPU, then that wouldn’t be the case. Also, when an implementation starts to approach memory consumption issues, then the solutions tend to involve higher level strategies like culling or dynamic loading and unloading. Or they get really fine-grained and focus on bit packing and dense arrays or precomputed lookup tables. The size of a general use hashmap tends to not be the concern. At least in my experience.

Tests

For all tests, they were performed with 64 bit integer keys and the values were either 64 bit integers or a struct that contained several different primitive types and an std::string. For many of the hashmaps tested, I also included variations where there was no internal locking. Instead it was just putting a lock around a traditional synchronous hash map. The reason being that I wanted to test if spending all this time on a concurrent hash map makes sense or are the operations small enough that it’s more efficient to just lock the entire operation.

Insert Performance

Performed two types of tests. In one test the keys were all sequential integers. This tests the scenario where there are no double-inserts. The other test used randomly generated keys within fixed range so that there would be some conflicts during insertion at random times.

Lookup Performance

Performed three types of tests. In one test the keys were all sequential integers and looked up as sequential integers so there would be no double-inserts and every lookup would succeed. The other used random within a fixed range so there would be double inserts and failed lookups, but approximately 10-20% would have collisions. The final one would only insert numbers between 0 to 100, which would be almost entirely key conflicts and failed lookups.

Erase Performance

Only one type of test was performed. All keys were sequential integers so there were no failed erases.

Mixed Operation Performance

There were three types of tests performed in this category. All of them simulated attempting to read and write to the hash map at the same time. The tests had the following mixtures:

90% Reads 10% Inserts
50% Reads 50% Inserts
50% Reads 40% Inserts 10% Erases

Rekey Performance

There was one type of test performed where all keys inserted were sequential integers and each key was changed to another key that was a fixed offset from the original with no possibility of key conflicts or failures. The goal here was to simulate structures like spatial hashes where a key can be a position, and oftentimes we want to adjust the key without reallocating the value element.

Iteration Performance

These are the only purely single threaded tests. They test the performance of iterating across the hash map synchronously. It’s not uncommon to use a hash map for getting a set of unique items to iterate across, so this helps simulate that scenario.

Batched Operation Performance

While this doesn’t necessarily meet my primary use cases of a general purpose, easy to use, threadsafe hash map, a lot of these hash map implementations are built around being used for batch processed tasks. So, we also wanted to check the performance of a batched insert and lookup.

What was my implementation strategy

Keep it simple

Open source hash map implementations can sometimes have practically unreadable code since they’re employing a lot of tricks to get the compiler to write faster code. They also generally have complex interfaces since they want to manage things like custom hash functions, custom allocators, multiple growth strategies, etc. I want to know if the compiler tricks actually make a difference for my use case and if those complexities lead to reduced performance that could be avoided in a targeted case. Also, one of the main advantages of writing a data structure oneself is that they can maintain it and extend it when needed. If the code is complex, then that maintenance is expensive, and it becomes more efficient from a development standpoint to just use the open source code.

What does that mean in practice

It means the basic bones of the hash map are implemented a lot like this one: Designing a Hash Map from Scratch. I.e. There’s a table of buckets, each bucket is a linked list, adding and removing elements is mainly bucket index lookups and linked list operations.

No configuration necessary

An extension of ‘keep it simple’, if we’re writing code for ourselves, then we can strip out complexity by having something easy to use. Now, it’s typical to have a ‘using’ or ‘typedef’ call around some complex hash map type to keep those simple. But the more and more template variables you abstract away, the more and more likely you’ll be in a situation where you have some variable in the debugger where the typename is unreadable because it’s templates in templates in templates. If we’re making our own data structure, then we can make the type only have the few configuration variables we actually need.

What does that mean in practice

It means the lock type we’re using isn’t a template parameter, nor is the allocation strategy, nor is the hash function. For the other hash maps, I needed to make wrappers around allocating and deallocating nodes or I needed to specify the hash function in a template parameter.

Employ high level optimizations that the other implementations use

Many open source hash map implementations have articles or papers on why they’re so fast. So, if there’s a high level description of their optimizations and they don’t violate the ‘keep it simple’ rule. Then I’ll apply those optimizations.

What does that mean in practice

I’ve added the submap concept that’s described in Parallel Hash Map. I’ve also made the size of the bucket lists to be a power of two, so that I can do a fast modulus. There’s a description of the fast power of two modulus in this article on Fibonacci Hashing. The reason I’m not using a fibonacci hash is because I already am using a hash algorithm which is common across the whole benchmark.

Test quickly to abandon bad ideas quickly

This endeavor became a bit of a rabbit hole as I had more and more optimization ideas. To make sure I wasn’t spending a bunch of time on pointless things, I decided to just test things to see if they made sense, and abandoned ones that didn’t.

What I learned from these ideas

There’s little to gain from trying to improve parallelism of a linked list of 1 or 2 elements

I tried using the lock free parallel linked list implementation created by Timothy Harris. I also tried using a simpler linked list that used atomic compare and swap interactions. But what I discovered was that, since a requirement is that insert, erase, and lookup can all happen in parallel, we always needed a lock of some kind on the linked list. Even the Timothy Harris algorithm needs a lock because it can logically remove items from the linked list in parallel, but it can’t physically destroy them in parallel. In my benchmark testing, it showed that having per-bucket locks generally adds more cost than using a global lock and doing the basic synchronous pointer operations. Because the cost of doing a read-only lock globally then another read or write lock on the bucket just ended up being more expensive than a single global write lock followed by the operation.

How you manage your memory has a major impact on the efficiency of the hash map

This might seem like an obvious comment, but in my tests I benchmarked the map’s performance both with and without an allocator I wrote. The reason I had the custom allocator was to allow for the use of pages of elements to reduce fragmentation and it also allowed me to have a consistent frame of reference in the benchmark. All that being said, the tests demonstrated that my custom allocator was significantly slower than the internal memory management mechanisms for the actual hash maps, and led to better performance in the custom hash map as opposed to the open source ones in some cases.

Locks are very expensive and atomic operations are fairly slow

This is another thing that probably seems obvious. But I found that I got better performance from having one lock and doing the operation synchronously than having a hierarchy of locks and using atomic operations. Part of the reason for the performance benefit is also probably that the logic gets much simpler and the optimizer can do a lot more work. I noticed this when I attempted to do per-bucket locking. While it allows you to potentially increase parallelism, the overhead of those operations trumps whatever gains you may get.

What did I end up testing

For available hash maps, I chose to test the following:

std::unordered_map since it’s a good baseline
Parallel Hash Map
1. Chose this because the write up it had seemed interesting here.
2. Tested both the node hash map
Parlay Hash Map
1. It has its own benchmarks claiming superior performance to all its competitors, so it’s a good thing to test.
2. Of note is that it doesn’t fulfill our requirement for an allocator interface, but we’ll allocate our nodes on a separate pool type
3. As a heads up, I actually had to abandon this hash map because it had issues with Clang with -O3 optimizations. Which I’ll cover later.
Abseil’s Hash Map
1. Chose this because it was used as the baseline for “Parallel Hash Map”. It’s built by Google and it stands to reason that they would have a good hash map implementation. There’s some other options made by other companies (like Meta and Microsoft), but the thing I liked about Abseil is it had a clean CMake interface and it didn’t require additional external dependencies.
2. Tested both the node and flat hash map

Results

Issues with Parlay Hash Map

Parlay Hash Map simply did not work when I tested it. My suspicion is that it’s because I was working from Clang with -O3 optimizations turned on. One of the specific issues I saw is that the code has places where it does while loops checking for modifications to variables that happen outside the function scope. Clang has a habit of optimizing those functions to not re-read memory in those cases. The fix is often to mark variables as _Atomic or to use atomic instructions. Here’s a stack overflow post about it. Parlay Hash Map has several places that use it for some thread management logic and it was leading to infinite loops in some of the tests. So, you won’t see it in any of the actual benchmark results because it didn’t work in my setup.

Use Case Conflict with Parallel Hash Map

Parallel Hash Map is built for batching actions only. Which means you can’t read and write in parallel. If you do, like I attempted to do, then you get bad memory errors that you spend a lot of time digging into with the address sanitizer only to realize that you were misusing the data structure the whole time. So, that means, I needed to put a lock around it to meet my needs. The lock supported multiple readers and multiple writers, but it had a major impact on the performance of the data structure.

PklE Hash Map

This is the custom HashMap I wrote. The code is available on GitHub here. It’s a relatively simple hash map implementation. But it has a few key optimizations. Those optimizations being: nodes are allocated from a pool and the bucket list size is always a power of two. There is also code to break out separate inner-buckets, but as you’ll see in the testing, that actually decreased performance. I also attempted other optimizations to use atomic operations and per-bucket locking, but that also reduced performance. So, overall, we were left with a simple hash map where each bucket is a linked list of nodes and the bucket list is always a power of two.

Data

Tested Hash Maps

AbseilFlatHashMapLocked – Default Abseil flat hash map with an external lock to manage concurrency
- To be consistent with our requirements, the values are all first allocated from a pool, the same one PklE Hash Map uses, then the pointers are stored.
AbseilNodeHashMapLocked – Default Abseil node hash map with an external lock to manage concurrency
AbseilNodeHashMapPagingAllocator – Abseil node hash map with an external lock. Allocator is a custom allocator that reserves fixed-size pages that can get reused.
PhmapParallelFlatHashMapSpinlock – Default parallel flat hash map
- To be consistent with our requirements, the values are all first allocated from a pool, the same one PklE Hash Map uses, then the pointers are stored.
PhmapParallelNodeHashMapPagingAllocator – Parallel node hash map using a customer allocator that reserves fixed-size pages that can get reused.
PhmapParallelNodeHashMapSpinlock – Default parallel node hash map.
PklEHashMap – Simple custom hash map described above. But split into two separate inner maps that each have their own lock.
PklEHashMapLocked – Simple custom hash map described above, but there’s only 1 map, and all operations are done synchronously with a lock around it.
StdUnorderedMapLocked – std::unordered_map with an external lock.
- To be consistent with our requirements, the values are all first allocated from a pool, the same one PklE Hash Map uses, then the pointers are stored.

Insert

Test Structure

An empty hash map was populated with a large number of elements. Keys were chosen either as sequential numbers, or randomly. In the tests with random numbers, it was structured so that ~20% of the numbers would be collisions with existing values. For all tests, it was tested against both storing a 64 bit integer and a larger struct that contains several primitive types and an std::string.