I’ve got about 500 CDs, and they’re in a complete mess. So, I want to sort them. What sort algorithm would you use? Ask a computer science graduate and he’d tell you that heapsort is the single threaded algorithm of choice. If you had people helping you, you could consider mergesort. Of course, neither of these approaches is actually sensible. Here’s an idea that actually works:
- First go through the CDs and sort them into buckets according to the first letter.
- Insertion sort the CDs belonging to a particular letter.
- Stick the sorted CDs back on the shelves in order.
So why does our complexity analysis fail for this problem? The reason is our assumptions are incorrect. All the analysis of sort algorithms assumes that it is effectively free to move the objects (because they can be pointers). Hence, all analysis of sorting algorithms optimizes for the number of comparisons With a physical CD to move around, this is no longer the case. Equally, we assume all tests on our keys are equal. With sorting CDs, it’s relatively low cost to determine the first letter of the band name. A full comparison, including ordering by date of publication, is significantly harder.*
Units Matter
The point is, although we tend to ignore this, the units of n matter. At different times, the performance of a system can be dominated by different factors
- Number of simultaneous requests
- Size of most complex workflows (think saga if you’re an nServiceBus guy)
- Number of workflows
- Sheer number of messages
- Memory Consumption
This isn’t theoretical, I’ve got a single system where I’ve needed to optimize on all of these bases at one time or another. If you want a system to scale, out or up, you’ve got to understand what units you’re dealing with. Next, you’ve got to track them. Probably the biggest performance secret of all time is good diagnostics, live and historical.
*There is also the problem that doing heapsort by hand is likely to be error prone, but that’s not the point I’m making.
Well, divide and conquer is the underlying theme of map/reduce, but I think sorting there is achieved by mergesort: you split the problem into cheaply chosen buckets (i.e. chunks of the input) and then sort them. Then you merge them in the reduce step. There’s a point here about parallelism breaking classical complexity analysis, but it’s not the one I was trying to make.In the CD sorting case, there’s only one worker process: the physical me. Assume I never make mistakes for the time being, but that I can’t hold the results of a sort in my head. I start by a full pass to create the buckets. That would be regarded as n comparisons. Then I proceed with insertion sort. Classical analysis would call this an n^2 operation, because there’s n^2 comparisons. BUT when sorting CDs, it’s probably moving the physical objects around that’s the expensive operation. Insertion sort only moves any given object once, so it’s actually O(n) in the variable that matters.When doing this on a computer, we assume we can store pointers to objects so the move is cheap, but it’s always worth revisiting your assumptions. And when you’re trying to scale, it’s often a different factor that’s slow from the last time you checked. So the best you can do is to build a decent system console that tracks the factors you believe may matter.If you recall, we once had a load issue that was killing us. Most people looked at threads and processor usage. It turned out to be an extremely unpleasant deadlock in the MSXML libraries. This is a pretty degenerate case, in that it’s hard to see how anyone would have predicted it without pretty precise foreknowledge but it helps with the general point: you’ve got to track factors that may lead to performance degradation. Some of these are general (e.g. message queue length, processor usage) but some (e.g. number of executions in the largest order of the day) are incredibly implementation specific.
LikeLike