…but I sure don’t. After promising that I had posted my very last article on the subject of load balancing, I spent another three months tweaking the code I’d put up in production. The gist is now pretty close to the production code (there’s a couple more Console.Writes so you can see what’s going on). So, here’s what I learned:
The reason QueueChannel and the nServiceBus distributor are dumb is for a very good reason: even small, rare failures in the distribution code can be horribly fatal. This I knew intellectually, but not in my gut. The code now has some amazing defensive code around AssignShardsToPendingQueueItems. Furthermore, it’s got a channel for reporting errors in the sharding. Obviously, you have two choices when such a piece of code fails: you can kill processing or attempt to keep going. ShardingChannel attempts to keep going, but it’s a decision everyone has to make for themselves.
I had to simplify the locking. Taking out multiple finer grained locks turned out to slow down the system. Now there’s one lock on pendingQueue, the list of items that have yet to be assigned.
Receiving whilst processing can present challenges of its own. The wakeup code is appreciably different. Publishing a message now only wakes at most one thread. Since the consumers now tracking whether they are sleeping, this can be done without locks. This prevented a situation in which a large number of empty consumers receiving messages that processed quickly could actually choke the channel so that it couldn’t even receive messages.
Sometimes you should use degenerate data representations. This violates DRY and makes your code hard to keep correct but sometimes, that doesn’t matter. The mapping of items to queues is now significantly more complex:
- A dictionary that maps all shards to the list of items for the shard.
- A queue of shard, list pairs, in the order that the shards were created. Only inactive shards appear in this list.
- A hashset of active shards.
All three of these were needed to keep the code running fast.
Finally, and most surprisingly, if a consumer finishes with a shard, it now asks the channel if there are any more messages for that shard. This change has produced a phenomenal performance improvement in the target system. In particular, on the test system it went from being IO-bound to CPU-bound. If there’s a lesson to be learned here, it’s that disk caches really do matter.
Sometimes, Complexity Wins
There was one other change: the wait until empty method is now an empty event. The interesting thing about this was that it was the only one that relates to the way we usually discuss code quality: I slightly reduced the responsibilities of the class. Pretty much all of the other changes made the code harder to understand and pretty much none of the performance improvements were susceptible to naive complexity analysis.
Tuning code for performance is fascinating. It’s very rarely the same problem twice, and it can be a constant challenge to unlearn your standard “best practice” responses.
- Complexity analysis is fine, but the exact implementation of the algorithm matters.
- SOLID principles are fine, but they’re just part of a larger trade-off.
- The more complex version of the code is better than the simple one.
- You’ve got to actually understand the machine upon which the code executes.
One thought on “Chuck Norris Understands Concurrency”
Yeah, SOLID. I’ve corrected it in the main text. It isn’t CPU bound on live, but then it’s a pretty ninja live box, but CPU bound on a database server does feel quite like I’m winning… :)For reference, the application in question currently runs 64 threads through this code. It’s pretty easy to initiate 64 threads, it’s taken ages to get them to perform appreciably better than 16…