A Couple of Useful NHibernate.Linq Extension Methods

I use these a lot.

    public static class NHLinqHelper
    {
        public static INHibernateQueryable<TValue> DistinctRoot<TValue>(this INHibernateQueryable<TValue> query) {
            query.QueryOptions.RegisterCustomAction(c => c.SetResultTransformer(new DistinctRootEntityResultTransformer()));
            return query;
        }

        public static INHibernateQueryable<TValue> Cached<TValue>(this INHibernateQueryable<TValue> query) {
            query.QueryOptions.SetCachable(true);
            return query;
        }
    }

This just makes the most common query options available fluently.  It isn’t perfect, in that using standard LINQ operators changes the declared type, which means you need to set these at the start, not the end of the query.  But if you use NHLinq, you’re used to that.

Technorati Tags:

Dynamic Languages and SOLID Principles

OK, this isn’t actually part of my SOLID Principles series (always a pity when your best content is a youtube link) but a response to Ryan’s article on Los Techies.  I’ve not really got my head around the way that unit testing works in python, but the I get that absolutely everything is being overrideable on an instance or class basis affects the approach.

Let’s talk about Ryan’s list.  Now, he argues that Python offers alternatives for the interface segregation principle, open/closed principle, and the dependency inversion principle.  I’m going to argue that the principles are actually the same, even if the practice is different.  (Like Ryan’s article, pretty much everything said here applies to any dynamic language, but I’ll talk about Python.)

Python and The Interface Segregation Principle

Well, making everything an interface might seem like a valid solution to the Interface Segregation Principle but it’s a bit weird.  As I mentioned in my original article, the whole point is the “fine-grained” part.  With Python, the interface that a client consumes is exactly the methods it calls.  In that respect, all Python code respects ISP by default. 

The potential interface surface is fundamentally flexible.  Arguably that’s a problem for ISP:  You can always just call another method if you want to.  I don’t honestly think it matters though.

Ultimately, I don’t think ISP is changed by Python, it’s just kind of irrelevant, for better or worse.

Python and The Open Closed Principle

Well, the open closed principle is a goal, not a design practice, but let’s take a look at the danger points:

  • You can’t have non-virtual methods, so Python wins this hands down.
  • Your variable can’t be made too specific, so you’re safe there.
  • You can still compare against hard-coded values.  It’s just as easy to get this wrong in Python as it is in C#.
  • Same holds true for Law of Demeter violations.  If you pass the wrong object around, your code will be just as fragile in Python as in C#.

Python certainly reduces the scope for some OC violations, but you’ve still got lots of rope to hang yourself.  Think you still need to bear the goal in mind.

Python and The Dependency Inversion Principle

Python doesn’t provide an alternative to the dependency inversion principle, it just looks like it.  Now, DI isn’t about using an IOC container (which is a slightly crazy/painful thing to do without a static type system), it’s about decoupling.  Now, in Python you can override any function, including a class function (which a C# developer would describe as a static method) so everything’s alright.

Except it isn’t.

Let me give an example.  You go around your house welding lamps directly into the plug sockets.  This is equivalent to calling Lamp() directly from within the Socket’s code.  Now, let’s assume you wanted to change one Lamp to an XBox.  Well, you can always monkey patch the method so that it behaves like one.  Ugly, but possible.  Let’s try something harder: change every Lamp to an XBox.  Not sure why you’d want to, but it’s your house: you can just change the class to behave like an XBox.  Great.

Until your neighbour comes round and asks why all of his lamps just turned into XBoxes.

Let’s quote Ryan:

Now all calls in this runtime, from any module, that reference the Output class will use XmlOutput or HtmlOutput instead.

Yes, but what if I wanted only half of them?  Maybe there’s Python techniques I don’t know about (I’m barely competent in the language) but as I see it, I’m going to need to change the code.  I don’t think that dependencies can “always” be injected.  They can only be done when it won’t cause damage.  In his case, he’s worrying about testability.  That’s fine, but we all agree there’s more to DI than testability.  You will definitely have fewer obvious problems, but if you don’t pass things in using constructors and use abstract factories, you will still run into code fragility, even in a flexible language.

Python still Needs SOLID

None of this is to disparage Python.  It’s a cracking language with a great deal of flexibility and extremely productive.  But it’s not the Holy Grail.  It’s still perfectly possible to violate OCP and it still can find DI useful.

SOA and the Interface Segregation Principle

The interface segregation principle is regarded as the least important of the SOLID principles.  I think this is a matter of context.  If you’re implementing IGame and don’t need PreviousMoves functionality, you could always just throw a NotImplementedException and not worry about it.  Sure, you’ve violated Liskov quite badly by doing so, but not in the contexts that you actually care about.  The problems will start to develop as the code morphs and your broken abstractions start to matter.  It won’t break you half as fast as not using an abstraction in the first place, but it will matter eventually.

Things get more interesting when we start talking SOA.  Now, the requirements of SOA are actually exactly the same in this context as ordinary code, it’s just that interfaces, once published are often set in stone.  This makes it much more important to pay attention to the requirements of the client.  The “client” is often a business process.  So, for instance, take a equities trading system.  The way an order looks to a trader is very different from the way it looks to the guy trying to settle it three days later.  The guy trying to report these trades for compliance purposes has another view, and the guy trying to value them for his risk analysis has another.  Interface segregation says that you shouldn’t be passing around the same interface to all of those people.  You might still be thinking these should all be facades onto the same object, and they could be, but it’s not necessarily the case.  These could be completely separate systems only connected by a messaging interface.  So, I’ll finish up with my own corollary of the Interface Segregation Principle:

Unified Models are neither sensible nor desirable.

SOLID Principles: I is for Interface Segregation

The interesting thing about the problems we encountered with Liskov is that they lead directly to the next principle.  The basic problem we found was that there are a lot of implicit assumptions that can and often are made by the usage of an interface.  We also discussed the use of contracts to make some of those assumption explicit, and the use of unit tests to help verify those assumptions.  However,  we ignored the simplest solution of the lot: make fewer assumptions.  That’s what the Interface Segregation Principle is about.

Let’s go back to my naive understanding of object-orientation.  By now, I’d learned what the interface keyword did, so when I created a ChessGame class, I knew that I needed an IChessGame interface.  I was still wrong.  Let’s think about it for a second: imagine I write a tournament class which plays a certain number of games and returns the winner.  There’s nothing chess-specific here.  By using IChessGame, I’m still requiring anyone using this functionality to implement chess.  Which is a pity, because when you look at it, a draughts tournament works in exactly the same way. 

Get to the principle already

Here’s the basic statement of the principle:

Make fine grained interfaces that are client specific.

Read that last bit again “client specific”.  Let’s say that I look at my previous code and say

  • Well, ChessGame inherits from Game
  • Game implements IGame
  • So I’ll just change my code to use IGame.

Well, I’ve satisfied dependency inversion there, but I’ve completely missed the point when it comes to interface segregation.  Let’s talk a look at the IGame interface:

public interface IGame {
    IPlayer PlayGame();  // returns the winner
    IEnumerable<IMove> PreviousMoves { get; }
}

The tournament doesn’t need to know about PreviousMoves.  He actually wants a smaller interface: one that just plays the game and returns the winner.  Call it ITournamentGame if you like.  Does the Game class or the ChessGame class implement this interface?  Doesn’t matter.  What matters is that we’ve reduced the coupling of our code.

SOLID Principles: L is for Liskov Substitution

Okay, this one is a bit easier to express:

If you declare something as taking a type, any instance of that type should be usable there.

Or, to put it another way, a Stock object is always a Stock object, it’s never a cow.  The calculateValue() method should always calculate the value,  never fire a nuclear missile.  This is the Liskov Substitution principle, and is basically an injunction against creating objects that pretend to be one thing for convenience, when they’re actually something else.

There’s a very easy way to violate Liskov without noticing: check the type of an instance.  Nearly always, if you’ve got an interface IPerson and you use “typeof” or “is”, you’ve written code that branches, usually an if statement.  Now take a look at that statement again, and consider what happens when someone writes a new implementation of IPerson.  Which side of the if statement does it fall?  Answer is, it doesn’t matter, the next implementation might want either side.  Yep, your code’s gonna break. 

In this case, what’s happened is that you’ve basically broken encapsulation.  If you move that decision into the implementing classes, either as a boolean property or a virtual method, you’ll solve the problem.  (I’ll add that a boolean property is going to prove a lot more fragile than the virtual method, but it’s massively easier to achieve.)

The Bad News

Unfortunately from the Liskov Substitution Principle, it’s completely impossible to achieve.  Every piece of code you ever write forms an implicit, stateful contract with its dependencies.  Even if you are fully Liskov compliant right now, the next function someone writes may contain an implicit assumption that’s violated in a tiny proportion of cases.  Truth is, types are not a constraint system and trying to pretend like they are can be positively dangerous.

Bertrand Meyer understood this problem and created Eiffel.  Some of those ideas will make it into C#4.  James Gosling understood the problem, but for some reason thought that constraining thrown exceptions was the best solution.  The problem with Java exceptions actually helps us understand the problem with a slavish adherence to Liskov: premature constraints.  The Java exception paradigm expects the interface designer to be able to anticipate all possible implementations of the interface, and punishes the implementor when the designer got it wrong.

A Sensible Approach

Well, design by contract is coming soon and will definitely enable us to improve our code quality, but what can we do about this now?  First, there’s just the basic “use common sense” directive: don’t wilfully violate the behaviour that you’d expect of an implementation of an interface.  Sometimes it’s unavoidable: a read write interface with an asynchronous implementation could behave quite differently from the synchronous implementation, and for good and valid reasons.  What you can do is to implement standard unit tests for implementations of an interface.  Here’s how you do it:

  • Create an abstract test class with a method GetImplementation
  • Make all of the tests use the interface
  • Create multiple classes all of which override GetImplementation

Obviously, this creates a lot of tests, but it’s probably the best way to specify expected behaviour right now.

Finally, you owe it yourself to take time out and remind yourself that L stands for many other things too.

Technorati Tags:

Evaluating Entity Framework

I don’t tend to post things because they’re funny, but this one was quite special.  I was interviewing someone recently and saw that they were using Entity Framework.  Now, I know very little about it, since I’ve expended my energy on learning NHibernate.  So, I asked “You’re using Entity Framework.  What do you think of it?”.  The reply I got:

Actually, that’s one of the reasons I’m leaving my current employer.

I don’t laugh out loud in interviews very often…

Technorati Tags: ,

How SQL Server Indexes are Chosen

This is probably the biggest problem most people have with understanding indexes.  How exactly does SQL Server decide when to use an index?  There’s two big errors that people make here: assuming it’ll always “just work” (it doesn’t) or that they should just force it to use the indexes they think it should be using (it’s nearly always slower if you do this).  Instead, it’s best to understand the selection process and see if you can structure your query in such a way that it accesses the indexes you want.

First things first.  It’s never going to use two indexes on the same table access.  If you use a table twice in the same query, it might use different indexes, but otherwise you should only ever see one index.  Next, and this is the really important bit: column order matters.  If you’ve got an index on three columns and your query uses two columns but not the first column, it won’t use the index.  (If you use the first and third, it might use it, but it’ll score it the same way as just using the first.)

Other considerations:

  • As discussed previously, it might decide to use index covering.  If the columns match up in order as well, so much the better.
  • If statistics are out of date, it can get its decisions wrong.
  • If you join two columns and they’re not of the same type, it can make the wrong decision.  I’ve seen this happen when the only difference is nullability, but not in recent versions.
  • Inequalities aren’t as selective as equalities.  Typically, given the choice between an index and a greater than and an index with an equals, it’ll choose the equals.  Again, this is driven by the statistics.
  • If the table is very small, it might be faster to do a table scan.
  • In very rare circumstances, it’ll build an index and then query it.  This almost never happens; you usually have to break up your query and create temporary tables to achieve this effect.

But the most important thing to remember is: it will only score the index based on the first columns in the index that contiguously match your query.  Miss out one column and it’ll just ignore any subsequent ones.  Miss out the first column and it’ll probably never use your query except for index covering.

Technorati Tags: ,

A Note about NHibernate Generators and Clustered Indexes

My previous article emphasized that you shouldn’t mix clustered indexes and identity fields.  However, if you’re using NHibernate you already probably know you shouldn’t be using identity fields.  On the other hand, the points still generalize to some of the NHibernate generators:

  • Increment: just as bad as identity
  • HiLo:  better than identity, but not by much.  Don’t mix with a clustered index.
  • GUID:  Extremely random.  In fact, probably too random.  The inserts get plastered everywhere and can hurt performance.
  • GUID Comb: Better than GUID,

In short: still don’t put a clustered index on an identity field.  Modifying GUID comb to include the thread ID might actually make it viable.

Again, we’re talking OLTP here.  If you read the original article, Jimmy Nilsson measures batch insert performance, and the concurrency implications never come up.

SQL Server: Clustered Indexes

We’ve already talked about how a level of an index looks a lot like a table.  Clustered indexes take this to the logical conclusion: they use the actual table as the leaf nodes of the index.  This means that we save a whole level of the index, which makes it faster than a non-clustered index.  (That’s providing you need columns not in the index.  Otherwise index covering is typically faster.)  It also means that you’ve actually ordered the table according to the index.

The ordering of the actual table is huge, and the ramifications are large.  The simplest, and probably least interesting, is that you can only have the one clustered index.  However, the ordering also makes it the ideal candidate for range based searches.  If you’re pulling down a large number of columns, it’s your go-to index for unique searches.

Implications of Clustered Indexes for Inserts

The biggest implication of clustered indexes, however, isn’t to do with select performance, it’s to do with inserts.  Remember, we’ve just ordered the actual table.  This means we’ve just specified onto which page any given row will be inserted.  Choose this badly, and you can really kill performance.

Right, I’m not going to go into all of the possible ways that SQL server can arrange a table.  I’m just going to give you one simple piece of advice.  Every table needs a unique clustered index.  Trust me on this.  (There will be experts who have counter-examples, but once you know that much, you won’t need to read this article.)

Now, basically you’re about to look at a trade-off.  If inserts happen randomly in your table, you’ll get page splits.  Page splits are where a page is full and a new row needs adding in the middle.  SQL Server splits the page into two, half-empty pages.  DBAs talk about “fill factor” as a measure of how full your pages are.  The higher the fill factor, the better.  If you have a fairly random key for your clustered index, your fill factor will go down with time.  This isn’t the end of the world, because you can rebuild your indexes and, by implication, your table.  Your DBA will know all about this, and it’s the major reason why he needs to block out time at the weekend to run maintenance scripts.  (The same thing happens to non-clustered indexes as well, but the effects aren’t as severe simply because the index is much smaller.)

It might seem like coming up with a “random” clustered index is a bad idea.  Actually, it’s a very good idea.  As I said, it’s a trade off.  Choosing a predictable index spec gives you a higher fill factor, but at the expense of concurrency.

The Clustered Identity Anti-Pattern

So, let’s say that you do what a lot of people do and place a clustered index on the identity field.  Usually this is declared as the primary key, but SQL server doesn’t really care about unique constraints or primary keys: they’re just unique indexes.  Now you’ve got extremely predictable inserts.  In fact, they always get inserted right at the end.  Now remember that SQL Server typically locks at a page level.  Now imagine twenty people trying to insert at once.

I learned about this the hard way.  My very first commercial application was written for Microsoft SQL Server 4.2 (and Microsoft Access, but that’s another story…).  You loaded a complex object into the UI, did some work and then it saved it into the database.  The performance was cracking.

Well, until we ran a training session and the trainer said “and now could everyone save their orders”.  It took half an hour.  Every time.  I had no idea what on earth was going on.  Nothing I did seemed to help.  The product went live and was usually okay, but every so often it ground to a halt.  Ironically, I was saved by force majeure: the client went bust (not as a consequence of my shoddy software, thankfully).  It was only a year later when I finally decided to educate myself about this RDBMS thingy I was using that it clicked what was going on.

So, trust me, you don’t want the clustered index on the identity field in an OLTP scenario.  Which is unfortunate, because that’s often the obvious place.  Let’s talk about less obvious places to put it.  A person’s name is often a good choice, especially since you probably do range-based searches on it.

There’s a small problem with this: often the name isn’t actually unique.  In fact, sometimes nothing is except for the identity column.  Even then, a clustered index on name followed by id will typically result in a better behaved database structure than an index just on id.

Let’s recap

Non-clustered indexes are added to improve select performance at a small cost to insert performance.  Clustered indexes, on the other hand, should be chosen on the basis of insert performance first.  Improving a couple of selects is nice, but specifying a good insert distribution is much more important.

If you can possibly manage it, what you really want to achieve is to have each thread inserting onto its own page.  This is easier said than done (but GUID Comb has potential).  Spread your inserts too randomly, and you end up hitting every page in your database.  Make them synchronize across threads, and you kill concurrency.

And if you’re not rebuilding you indices once a week, performance will eventually drop through the floor due to low fill factors.

SQL Server: Non-clustered Indexes

Okay, first off, there hasn’t been a post in, conservatively forever.  There have been a number of reasons for this: holidays, people visiting during the weekend (when I write this blog) and swine flu are probably the top three.  That and I managed to lose my post on Liskov, so I’m going to have to re-write.  But anyway, I thought that more than one person I know could do with reading a quick guide to SQL Server indexes.

Basics of SQL Server Performance

The first thing you need to know about is pages.  SQL Server puts rows onto pages.  If you access a row, the DB will load the entire page.  Whenever you’re talking about performance, 99% of the time you’re talking about how many pages get accessed.  Pretty much everything else, including processor time, is irrelevant.  Well, unless you decide to do something stupid like a user-defined function or a trigger.  But for straight DB access, pages are what it’s all about.

Non-clustered Indexes

Now, an index is basically a tree.  It’s a tree of pages.  The top level of the index is a page with rows.  Each row points to the page representing the next level of the index.  This works all the way down.  Finally, you get a link to the actual row itself.  This is your basic non-clustered index.

There’s a few things to note:

  • If you’ve got 5 columns in your table, but only two in your index, the index rows will only contain two index columns.
  • The rows in the index will be ordered, typically in ascending order.  (You can do it descending, but it’s rarely useful.)

This can speed up the following operations:

  • Filtering on the basis of columns in the index
  • Ordering using the exact ordering in the index
  • Selecting just the columns of the index.

The second one is highly unlikely to be used, the first is the typical usage, but the last is really rather interesting.  Let’s say that you’ve got two columns in your index, and your query only uses those two columns.  Then, since the index structure is pretty much identical to the structure of the table itself (rows on pages), it can actually query the last level of the non-clustered index and never actually hit the table at all.  If you’ve got some “lite” versions of tables in your database structure, now would be a good time to throw them away and just replace them with indexes.  This is called “index covering” and is a vastly under-used technique.

Insert Performance

Every index you add has to be maintained on every insert and update.  This, obviously, adds to the cost of entering data into the system.  In practice, most systems I encounter could easily handle more non-clustered indexes, since the alternative is dreadful select performance.  However, keep an eye on it and measure it.

Technorati Tags: ,,