A Note about NHibernate Generators and Clustered Indexes

My previous article emphasized that you shouldn’t mix clustered indexes and identity fields.  However, if you’re using NHibernate you already probably know you shouldn’t be using identity fields.  On the other hand, the points still generalize to some of the NHibernate generators:

  • Increment: just as bad as identity
  • HiLo:  better than identity, but not by much.  Don’t mix with a clustered index.
  • GUID:  Extremely random.  In fact, probably too random.  The inserts get plastered everywhere and can hurt performance.
  • GUID Comb: Better than GUID,

In short: still don’t put a clustered index on an identity field.  Modifying GUID comb to include the thread ID might actually make it viable.

Again, we’re talking OLTP here.  If you read the original article, Jimmy Nilsson measures batch insert performance, and the concurrency implications never come up.

SQL Server: Clustered Indexes

We’ve already talked about how a level of an index looks a lot like a table.  Clustered indexes take this to the logical conclusion: they use the actual table as the leaf nodes of the index.  This means that we save a whole level of the index, which makes it faster than a non-clustered index.  (That’s providing you need columns not in the index.  Otherwise index covering is typically faster.)  It also means that you’ve actually ordered the table according to the index.

The ordering of the actual table is huge, and the ramifications are large.  The simplest, and probably least interesting, is that you can only have the one clustered index.  However, the ordering also makes it the ideal candidate for range based searches.  If you’re pulling down a large number of columns, it’s your go-to index for unique searches.

Implications of Clustered Indexes for Inserts

The biggest implication of clustered indexes, however, isn’t to do with select performance, it’s to do with inserts.  Remember, we’ve just ordered the actual table.  This means we’ve just specified onto which page any given row will be inserted.  Choose this badly, and you can really kill performance.

Right, I’m not going to go into all of the possible ways that SQL server can arrange a table.  I’m just going to give you one simple piece of advice.  Every table needs a unique clustered index.  Trust me on this.  (There will be experts who have counter-examples, but once you know that much, you won’t need to read this article.)

Now, basically you’re about to look at a trade-off.  If inserts happen randomly in your table, you’ll get page splits.  Page splits are where a page is full and a new row needs adding in the middle.  SQL Server splits the page into two, half-empty pages.  DBAs talk about “fill factor” as a measure of how full your pages are.  The higher the fill factor, the better.  If you have a fairly random key for your clustered index, your fill factor will go down with time.  This isn’t the end of the world, because you can rebuild your indexes and, by implication, your table.  Your DBA will know all about this, and it’s the major reason why he needs to block out time at the weekend to run maintenance scripts.  (The same thing happens to non-clustered indexes as well, but the effects aren’t as severe simply because the index is much smaller.)

It might seem like coming up with a “random” clustered index is a bad idea.  Actually, it’s a very good idea.  As I said, it’s a trade off.  Choosing a predictable index spec gives you a higher fill factor, but at the expense of concurrency.

The Clustered Identity Anti-Pattern

So, let’s say that you do what a lot of people do and place a clustered index on the identity field.  Usually this is declared as the primary key, but SQL server doesn’t really care about unique constraints or primary keys: they’re just unique indexes.  Now you’ve got extremely predictable inserts.  In fact, they always get inserted right at the end.  Now remember that SQL Server typically locks at a page level.  Now imagine twenty people trying to insert at once.

I learned about this the hard way.  My very first commercial application was written for Microsoft SQL Server 4.2 (and Microsoft Access, but that’s another story…).  You loaded a complex object into the UI, did some work and then it saved it into the database.  The performance was cracking.

Well, until we ran a training session and the trainer said “and now could everyone save their orders”.  It took half an hour.  Every time.  I had no idea what on earth was going on.  Nothing I did seemed to help.  The product went live and was usually okay, but every so often it ground to a halt.  Ironically, I was saved by force majeure: the client went bust (not as a consequence of my shoddy software, thankfully).  It was only a year later when I finally decided to educate myself about this RDBMS thingy I was using that it clicked what was going on.

So, trust me, you don’t want the clustered index on the identity field in an OLTP scenario.  Which is unfortunate, because that’s often the obvious place.  Let’s talk about less obvious places to put it.  A person’s name is often a good choice, especially since you probably do range-based searches on it.

There’s a small problem with this: often the name isn’t actually unique.  In fact, sometimes nothing is except for the identity column.  Even then, a clustered index on name followed by id will typically result in a better behaved database structure than an index just on id.

Let’s recap

Non-clustered indexes are added to improve select performance at a small cost to insert performance.  Clustered indexes, on the other hand, should be chosen on the basis of insert performance first.  Improving a couple of selects is nice, but specifying a good insert distribution is much more important.

If you can possibly manage it, what you really want to achieve is to have each thread inserting onto its own page.  This is easier said than done (but GUID Comb has potential).  Spread your inserts too randomly, and you end up hitting every page in your database.  Make them synchronize across threads, and you kill concurrency.

And if you’re not rebuilding you indices once a week, performance will eventually drop through the floor due to low fill factors.

SQL Server: Non-clustered Indexes

Okay, first off, there hasn’t been a post in, conservatively forever.  There have been a number of reasons for this: holidays, people visiting during the weekend (when I write this blog) and swine flu are probably the top three.  That and I managed to lose my post on Liskov, so I’m going to have to re-write.  But anyway, I thought that more than one person I know could do with reading a quick guide to SQL Server indexes.

Basics of SQL Server Performance

The first thing you need to know about is pages.  SQL Server puts rows onto pages.  If you access a row, the DB will load the entire page.  Whenever you’re talking about performance, 99% of the time you’re talking about how many pages get accessed.  Pretty much everything else, including processor time, is irrelevant.  Well, unless you decide to do something stupid like a user-defined function or a trigger.  But for straight DB access, pages are what it’s all about.

Non-clustered Indexes

Now, an index is basically a tree.  It’s a tree of pages.  The top level of the index is a page with rows.  Each row points to the page representing the next level of the index.  This works all the way down.  Finally, you get a link to the actual row itself.  This is your basic non-clustered index.

There’s a few things to note:

  • If you’ve got 5 columns in your table, but only two in your index, the index rows will only contain two index columns.
  • The rows in the index will be ordered, typically in ascending order.  (You can do it descending, but it’s rarely useful.)

This can speed up the following operations:

  • Filtering on the basis of columns in the index
  • Ordering using the exact ordering in the index
  • Selecting just the columns of the index.

The second one is highly unlikely to be used, the first is the typical usage, but the last is really rather interesting.  Let’s say that you’ve got two columns in your index, and your query only uses those two columns.  Then, since the index structure is pretty much identical to the structure of the table itself (rows on pages), it can actually query the last level of the non-clustered index and never actually hit the table at all.  If you’ve got some “lite” versions of tables in your database structure, now would be a good time to throw them away and just replace them with indexes.  This is called “index covering” and is a vastly under-used technique.

Insert Performance

Every index you add has to be maintained on every insert and update.  This, obviously, adds to the cost of entering data into the system.  In practice, most systems I encounter could easily handle more non-clustered indexes, since the alternative is dreadful select performance.  However, keep an eye on it and measure it.

Technorati Tags: ,,