Shard Balancer: Another Crack at QueueChannel

For those of you bored with this problem, I promise this is the last time I post about this.  That’s because I’ve finally come up with a solution I actually like.  Here, the model is that the message order still matters, but the state is in a database, rather than in memory.  This allows us to change our minds about which thread runs which orders.  I got the idea from reading about sharding and resharding strategies.  One of the approaches to resharding is to assign everything a “virtual shard” where there are a lot of virtual shards and only a few physical shards.  Then, when you need to expand, you can move virtual shards between physical shards.

Here, the “virtual shard” is the identification of the order.  The “physical shard” is the Retlang fiber.  Only one virtual shard is assigned to a physical shard at any given time, so the metaphor breaks down a bit.  Nonetheless, the basic idea is that when a worker is ready, it gets assigned all of the messages for a shard.  Then that shard is locked, so that messages just back up until the worker is finished.  After that point, any new messages can be assigned to a worker, and the worker is available for work again.

This is actually pretty similar to the keyed batch concept in Retlang, which led me to support subscriptions using Action<K, IEnumerable<T>> as well as Action<T>.  The former could, for instance, choose to only commit its changes at the end of a batch.  To support this use case properly, I needed to introduce a maximum batch size as well.  The batch size does matter: the larger the batch, the more efficient the code is.  If you don’t need batching at all, you can always set it to int.MaxValue.

Performance

If you take a look at the test code, you’ll see it calculates a rough “efficiency” number.  That’s the time it took compared to the fastest possible.  In my tests, I’m seeing 98% efficiency.  There’s a couple of commented out lines that test QueueChannel in the same manner.  That shows a figure of 99%, so the order preservation and batching does impose some overhead.

Going back to the model of the system, we assumed that there was no in memory state.  In practice, that doesn’t have to be the case.  The shard itself can be mutable.  In fact, if you’re tracking renamed objects (e.g. cancel/replaces in Fix) you’re going to be doing this the whole time.  You can modify the shard during processing, but if you modify state during processing and assignment you’re going to need a lock.  This is much more lock-free parallelism than I’d hoped to achieve with this approach, and pretty much renders the previous approach irrelevant.

A Walk Around The Code

I’ve decided to try publishing the code as a gist this time.  It’s certainly less effort for me to do than the usual HTML code, but it’s not as flexible: the syntax colouring isn’t up to much and there’s no control over the presentation of the files.  So, the implementation can be found in the following files:

  • ShardBalancer.cs
  • ShardConsumer.cs
  • IConsumer.cs

Arguably, IConsumer doesn’t need to exist.  There’s extremely tight coupling between the balancer and the consumer so the only purpose it serves is to highlight how the balancer calls to the consumer.  I wrote this version from scratch, which explains the change in coding style, but the basic pattern is still Retlang’s hungry consumer implementation.

The test harness consists of:

  • Program.cs
  • Execution.cs
  • SummationService.cs
  • OrderDb.cs

Observations

I think I’ve finally come up with a solution to the problem with which I’m happy.  This solution actually balances very well whilst still observing necessary message ordering.  If you watch it in action, it actually behaves in a similar to processor assignment of processors at the OS level.  I’m particularly pleased that it can be used to batch messages, which could enable some problems to outperform a QueueChannel implementation under the right circumstances.

  • The balancer has two locks.  It’s extremely important that the implementation of Wakeup does no work itself.  Otherwise the solution would deadlock.
  • If KeyValuePair<T, V> was immutable, the batch dequeue could return KeyValuePair<TShard, IEnumerable<TValue>>.  This highlights a problem that’s going to be with us for a long time: the APIs were never built with the new type system in mind.
  • The DataToActions code is a bit on the unreadable side, but the alternative is two different implementations of ShardConsumer, which didn’t appeal.
  • The mechanism to empty the queue is vastly more elegant than the FlushAndWait implementation in the previous version.  EmptyEvent isn’t threadsafe in .NET 4.
  • The code needs a trivial change to compile with Retlang trunk.

You could experiment with switching strategies if the number of queued shards was less than the number of available threads.  Instead of choosing randomly, you could pick the longest queue and throw the entire thing onto a thread.  This is particularly attractive if you’re working with a terminating process as this would speed up the process of closing down at the end of the day.  Something that this code does badly, as does Retlang in general, is report exactly what’s going on internally.  I’m going to see if I can come up with a solution for that.  Don’t hold your breath, though 🙂

Technorati Tags: ,

A Future for Inversion of Control Containers

.NET has a lot of IoC containers.  This is undoubtedly a good thing: diversification is a sign of strength in the ecosystem.  MEF is showing how specialization can be used to solve specific issues.  I’m going to try to set out one vision of how IoC containers should evolve.  This is not my opinion of how they all should evolve, it’s simply my vision of what I would like from one of the many containers.

I’ve argued for some time that containers are all about configuration.  If you don’t believe that, then you should consider why Yadic isn’t your favourite container.  I’ve also been agitating for some time to get environmental deltas included in Castle Windsor, but actually this doesn’t even go far enough.  Let’s talk about what’s still wrong:

  • There’s no standard way to specify external configuration.  (i.e. you can’t separate configuration from wiring)
  • There’s no standard way to examine the runtime configuration.
  • There’s no standard way to report of the validity of the runtime configuration.
  • There’s no standard way of having a centralised configuration source.
  • There’s no sensible way of changing the configuration other than hacking files around.
  • You can’t change the runtime configuration and have that reflected in the state of the container.  Live.

Binsor achieves the first in a general way, but in a fashion that isn’t really acceptable.  nServiceBus thinks this through a lot, but only solves it for the case of using nServiceBus.  StructureMap provides the second at an API level, but not at an ops level. 

Live Diagnostics

StructureMap has, at the time of writing, much better diagnostics than Castle Windsor, which still firmly believes in the “try activating the component and see if it throws an exception” model.  However, why can’t we do the following?

  • Ability to embed a web server in your application.  Seriously, why do we keep writing remoting+winforms applications just to do simple diagnostics?
  • Ability to see all of the live configuration options.
  • The ability to modify the live configuration options.
  • Ability to see which configuration settings have issues.

You might think I’m talking futurology here, but Sam Newman and his mates have already built it.  (You should watch the entire talk.)  There’s no reason that work can’t be generalised in an opinionated manner that requires developers to think about these issues up front.

A possible objection is that this functionality is pretty much orthogonal to what containers do right now.  That’s correct to a certain extent, but it ignores a fundamental issue: every container I’ve used (let’s ignore MEF for now) works on the basis that, ultimately, the configuration for an application can be expressed as an XML file.  This expresses itself in a number of ways:

  • Things are strings that shouldn’t be.
  • Configuration can’t change during the lifetime of the container.
  • The container can’t recreate objects based on a change in configuration.  There’s no model for even expressing how that lifecycle should work.

Viewed from a certain angle, IoC containers at the moment are basically better XML configuration files.  There’s no reason they can’t be a better WMI.

DR is for Disaster Recovery

I was having a drink with a supplier the other day and he was explaining that during the outage “all of our settings had been replicated to DR, so we couldn’t switch to DR either.”  The implication was that automatically replicating settings to DR was a bad idea, precisely to avoid a scenario such as this.  Personally, I think it’s more about understanding purpose.  What do you want to do with your DR solution?

  • Deal with a complete physical loss of your primary site?
  • Handle hardware failure on one of your machines?
  • Handle a software failure on one of your systems?

Of these, only the first is actually disaster recovery.  The second is known as redundancy and the third rollback.  “Switching to DR” is intended for disaster recovery scenarios only.  The problem is, these are all problems and all too often we try to solve all three at once under the heading of “DR” without actually thinking about what the letters stand for.  Sadly, this tends to lead to solutions that achieve none of the aims.  It’s not a question of balancing competing priorities here, it’s a question of recognizing that the problems are separate and so should their solutions.

Technorati Tags:

Understand What You Measure

I never really intended to start blogging about my work kitchen, but it keeps providing me with anecdotes.  Today, I was thinking about the milk.  There are six fridges in the office, and typically you’ll find that they each have eight semi-skimmed jugs of milk.  There’s also a single jug of skimmed milk.  Sometimes there’s none.  So, someone somewhere has made a decision as to the appropriate amount of skimmed and semi-skimmed milk to provide.  Now we enter the realms of pure speculation, but bear with me.

superman-got-milk-ad-commercial

How do you determine what the mix should be?  Well, the obvious googley answer is: you measure consumption and use that as feedback.  I’m pretty sure that’s what someone at the supplier is doing.  The thing is, there’s an absolutely tiny amount of skimmed milk being supplied, far less than you’d expect given a glance at the supermarket.  So what’s wrong?

Ask anyone who drinks skimmed milk at the office and they can tell you: the skimmed milk tastes bad.  I mean, really bad, so bad you think it’s off.  Typically, the first time you try it you pour it down the sink.  The second time you realize it always tastes like that.  Every so often you see someone new do the exact same thing.

Selling Ice Cream In Winter

Famously, if a government department wants to kill something, they’ll run a study under circumstances that make it look incredibly unpopular.  But even when someone isn’t deliberately trying to manipulate the figures, unexpected factors occur.  The good news is, for user interfaces, there’s an easy answer: learn a bit about usability testing.  With internal systems the problem is harder.  Worst of all is with performance questions.  Just because you’ve reduced one number, doesn’t mean you haven’t increased another.  Ultimately, you need to understand your numbers holistically if you’ve got any hope of doing something other than performance whack-a-mole.

Technorati Tags: ,

What is n, anyway?

I’ve got about 500 CDs, and they’re in a complete mess.  So, I want to sort them.  What sort algorithm would you use?  Ask a computer science graduate and he’d tell you that heapsort is the single threaded algorithm of choice.  If you had people helping you, you could consider mergesort.  Of course, neither of these approaches is actually sensible.  Here’s an idea that actually works:

  • First go through the CDs and sort them into buckets according to the first letter.
  • Insertion sort the CDs belonging to a particular letter.
  • Stick the sorted CDs back on the shelves in order.

So why does our complexity analysis fail for this problem?  The reason is our assumptions are incorrect.  All the analysis of sort algorithms assumes that it is effectively free to move the objects (because they can be pointers).  Hence, all analysis of sorting algorithms optimizes for the number of comparisons With a physical CD to move around, this is no longer the case.  Equally, we assume all tests on our keys are equal.  With sorting CDs, it’s relatively low cost to determine the first letter of the band name.  A full comparison, including ordering by date of publication, is significantly harder.*

Units Matter

The point is, although we tend to ignore this, the units of n matter.  At different times, the performance of a system can be dominated by different factors

  • Number of simultaneous requests
  • Size of most complex workflows (think saga if you’re an nServiceBus guy)
  • Number of workflows
  • Sheer number of messages
  • Memory Consumption

This isn’t theoretical, I’ve got a single system where I’ve needed to optimize on all of these bases at one time or another.    If you want a system to scale, out or up, you’ve got to understand what units you’re dealing with.  Next, you’ve got to track them.  Probably the biggest performance secret of all time is good diagnostics, live and historical.

*There is also the problem that doing heapsort by hand is likely to be error prone, but that’s not the point I’m making.

Technorati Tags:

Tip: Never Orphan a Collection in NHibernate

Some guidelines on how to deal with a collection property in NHibernate:

  • Create a protected empty constructor for the use of NHibernate.  There should be no code in this constructor.
  • Create public constructors with parameters to allow you to create objects within your code.  These should initialize the collection properties, typically as empty collections.  (Use a simple List<T> until you’re more sophisticated.)
  • Never, ever, ever, set the collection property after construction.
  • Don’t allow it to have a public set.*
  • If you need to replace it, clear it and put the new values in.

Why is this so important?  Well, if you load the entity from the database, your collection property won’t be a List<T> anymore: it’ll be an object that NHibernate is using to track the one to many relationship.  If that exact object stops being connected to your entity, NHibernate will get upset and you can’t save your objects anymore. 

Unfortunately, people starting out tend to get an elevator pitch for NHibernate that suggest you can write your entities how you like and then it’ll just work.  That isn’t the case.  If you’re an NHibernate ninja, you’ll be able to break some of these guidelines.  But if you’re having problems, check that your entities follow them.

*Actually, I don’t tend to put public set on any entity.  This is part of my “make methods have names that explain the business process by which the entity changes” strategy.  But that’s a whole different blog post…

Technorati Tags: