We’ve already talked about how a level of an index looks a lot like a table. Clustered indexes take this to the logical conclusion: they use the actual table as the leaf nodes of the index. This means that we save a whole level of the index, which makes it faster than a non-clustered index. (That’s providing you need columns not in the index. Otherwise index covering is typically faster.) It also means that you’ve actually ordered the table according to the index.
The ordering of the actual table is huge, and the ramifications are large. The simplest, and probably least interesting, is that you can only have the one clustered index. However, the ordering also makes it the ideal candidate for range based searches. If you’re pulling down a large number of columns, it’s your go-to index for unique searches.
Implications of Clustered Indexes for Inserts
The biggest implication of clustered indexes, however, isn’t to do with select performance, it’s to do with inserts. Remember, we’ve just ordered the actual table. This means we’ve just specified onto which page any given row will be inserted. Choose this badly, and you can really kill performance.
Right, I’m not going to go into all of the possible ways that SQL server can arrange a table. I’m just going to give you one simple piece of advice. Every table needs a unique clustered index. Trust me on this. (There will be experts who have counter-examples, but once you know that much, you won’t need to read this article.)
Now, basically you’re about to look at a trade-off. If inserts happen randomly in your table, you’ll get page splits. Page splits are where a page is full and a new row needs adding in the middle. SQL Server splits the page into two, half-empty pages. DBAs talk about “fill factor” as a measure of how full your pages are. The higher the fill factor, the better. If you have a fairly random key for your clustered index, your fill factor will go down with time. This isn’t the end of the world, because you can rebuild your indexes and, by implication, your table. Your DBA will know all about this, and it’s the major reason why he needs to block out time at the weekend to run maintenance scripts. (The same thing happens to non-clustered indexes as well, but the effects aren’t as severe simply because the index is much smaller.)
It might seem like coming up with a “random” clustered index is a bad idea. Actually, it’s a very good idea. As I said, it’s a trade off. Choosing a predictable index spec gives you a higher fill factor, but at the expense of concurrency.
The Clustered Identity Anti-Pattern
So, let’s say that you do what a lot of people do and place a clustered index on the identity field. Usually this is declared as the primary key, but SQL server doesn’t really care about unique constraints or primary keys: they’re just unique indexes. Now you’ve got extremely predictable inserts. In fact, they always get inserted right at the end. Now remember that SQL Server typically locks at a page level. Now imagine twenty people trying to insert at once.
I learned about this the hard way. My very first commercial application was written for Microsoft SQL Server 4.2 (and Microsoft Access, but that’s another story…). You loaded a complex object into the UI, did some work and then it saved it into the database. The performance was cracking.
Well, until we ran a training session and the trainer said “and now could everyone save their orders”. It took half an hour. Every time. I had no idea what on earth was going on. Nothing I did seemed to help. The product went live and was usually okay, but every so often it ground to a halt. Ironically, I was saved by force majeure: the client went bust (not as a consequence of my shoddy software, thankfully). It was only a year later when I finally decided to educate myself about this RDBMS thingy I was using that it clicked what was going on.
So, trust me, you don’t want the clustered index on the identity field in an OLTP scenario. Which is unfortunate, because that’s often the obvious place. Let’s talk about less obvious places to put it. A person’s name is often a good choice, especially since you probably do range-based searches on it.
There’s a small problem with this: often the name isn’t actually unique. In fact, sometimes nothing is except for the identity column. Even then, a clustered index on name followed by id will typically result in a better behaved database structure than an index just on id.
Non-clustered indexes are added to improve select performance at a small cost to insert performance. Clustered indexes, on the other hand, should be chosen on the basis of insert performance first. Improving a couple of selects is nice, but specifying a good insert distribution is much more important.
If you can possibly manage it, what you really want to achieve is to have each thread inserting onto its own page. This is easier said than done (but GUID Comb has potential). Spread your inserts too randomly, and you end up hitting every page in your database. Make them synchronize across threads, and you kill concurrency.
And if you’re not rebuilding you indices once a week, performance will eventually drop through the floor due to low fill factors.