I've got about 500 CDs, and they're in a complete mess. So, I want to sort them. What sort algorithm would you use? Ask a computer science graduate and he'd tell you that heapsort is the single threaded algorithm of choice. If you had people helping you, you could consider mergesort. Of course, neither of these approaches is actually sensible. Here's an idea that actually works:
- First go through the CDs and sort them into buckets according to the first letter.
- Insertion sort the CDs belonging to a particular letter.
- Stick the sorted CDs back on the shelves in order.
So why does our complexity analysis fail for this problem? The reason is our assumptions are incorrect. All the analysis of sort algorithms assumes that it is effectively free to move the objects (because they can be pointers). Hence, all analysis of sorting algorithms optimizes for the number of comparisons With a physical CD to move around, this is no longer the case. Equally, we assume all tests on our keys are equal. With sorting CDs, it's relatively low cost to determine the first letter of the band name. A full comparison, including ordering by date of publication, is significantly harder.*
Units Matter
The point is, although we tend to ignore this, the units of n matter. At different times, the performance of a system can be dominated by different factors
- Number of simultaneous requests
- Size of most complex workflows (think saga if you're an nServiceBus guy)
- Number of workflows
- Sheer number of messages
- Memory Consumption
This isn't theoretical, I've got a single system where I've needed to optimize on all of these bases at one time or another. If you want a system to scale, out or up, you've got to understand what units you're dealing with. Next, you've got to track them. Probably the biggest performance secret of all time is good diagnostics, live and historical.
*There is also the problem that doing heapsort by hand is likely to be error prone, but that's not the point I'm making.
Technorati Tags:
Performance