Colour Coding

Automated Deployments #2: Configuration Management

It’s amazing how much engineering time is spent on arguing about the difference in abstraction strategies, followed by someone saying “just copy the files up, but make sure not to touch the config”. This is a recipe for disaster. There are three common failure scenarios:

Someone takes a copy of the live system, runs some tests and accidentally enters the test data into the live system. I once saw that happen with a stress test. It wasn’t funny. (In fairness, it’s pretty funny in retrospect.)
Someone uploads a debug environment, rendering the live system unstable. (This is mostly a web-related scenario.)
A new version is correctly released, but it required a config change which never made it into the production config.

Now, most people run with this policy because “don’t touch the config” produces fewer failures than “touch the config”. You could argue that most of these scenarios are associated with not carrying out the instructions to the letter. However, this is to miss the point.

Successful processes minimize the chance of human error.

If someone forgot a step, and that guy is not a muppet, your process has too many steps. Our release process has one. Exactly one. I loathe processes that seem to have as their principal benefit that you know who to blame when it goes wrong. I would much rather things didn’t go wrong in the first place. So, we’re looking for a process that guarantees that the environmental differences are respected, but that changes required by the code are propagated.

Types of Environmental Factor

Configuration management is a big and scary subject, and is the proper preserve of your IT department, not your developers. However, if you concentrate on just the bits that matter to developers, it need not be that big an undertaking. Let’s go back to basics. In general terms, there are three common sorts of .NET application:

A standard windows client application. This includes console and GUI apps.
A windows service.
A web site or web service.

For standard windows applications, your environmental delta will usually be in the app.config file. Unless you have multiple installs of the one service on a machine, it is unlikely you’ll have any environmental changes in windows services. Web sites themselves are typically identical on all deployment environments. The fact remains that nearly all installs of corporate applications can be summarized as follows:

Copy some files
Fix the config file
Set up the windows service
Set up the service in IIS.

Now, to produce a perfect install, you end up messing around with InstallShield or WiX or some such tool. However, to cover 95% or more of environmental issues, all you really need is a way of fixing the config file. I’ll remark at this point that since you have control over the entire ecosystem, you can ensure that your system doesn’t require weird settings you can’t handle. Equally, I’d go out of your way to eliminate stuff like custom config sections. They’re more trouble than they’re worth.

In practice, in our environment, we have a phenomenal number of programs are the only config entries we ever change upon deployment are:

AppSettings
ConnectionStrings (We encrypt the connection strings when we apply the delta)
Setting compilation debug equal to false (I can’t stress how important this one is.)

I may one of these days publish the code we use for this (it’s a powershell cmdlet) but the fact remains, it’s easy enough to implement on your own. Incidentally, I can highly recommend you don’t use the Enterprise Library solution. It’s quite complex and has weird bugs (e.g. it won’t work on a network file).

Storing Environmental Deltas

When we were designing this system, we consider the following models:

We use a model in which all deltas are stored in the same file with the master. The program then determines which environment it needs to use.
The development config is the master, deltas are separate files applied to it upon deployment
Deltas are applied to produce all configs, including the development config. The deltas are applied to a master.config.
We use the user settings features built into AppSettings.
We just have different configs for each environment. The deployment process just copies up the right one.

There are die-hard fans of all approaches, but I’ll outline why I believe the second to be superior.

The monolithic file approach is attractive at first because everything’s together but suffers from catastrophic unmanageability as you get a lot of settings (which is a problem you shouldn’t have, but may have). Furthermore, there is the inelegance of having to deploy information for one environment to another (unless you write a post-processor, in which case you might as well have opted for alternatives 2 or 3). The self-discovery aspect is attractive, and the monolithic file is easy enough to put into source control. Just putting all of the configs into source control has its attractions as well, but suffers from the fact that 90% of the XML will be the same in each file, making it hard to track down the differences. I prefer a model with explicit deltas.

The built-in features sound attractive, because it feels like Microsoft has already done the heavy lifting for you. However, you’re pretty much guaranteed to be still modifying the web.config anyway, and you’ve split the config into multiple parts not only for management, but for the deployment environment too. A lot of people practice this method by having the deltas only present

Finally, we’re left with the choice between having an abstract master file and having the master file be the local development config. Here, I’d argue that the local development file will be edited by developers directly whether you like it or not. Best to embrace that than have it as a failure point each time it happens.

Final Thoughts

The level of configuration management you need for .NET apps is pretty easy to implement, which makes it a pity very few people both. All you really need is a couple of xml pokes and you’re done. One of the great benefits is that all of your environmental information is in source control (you can even make the program that applies the deltas encrypt the data if you regard that as desirable) which makes it much easier to check things in a large heterogeneous environment. (Again, not a problem you should have, but a problem you may have.)

And yes, the first failure scenario mentioned at the top is also the reason you should have a firewall between development and production. Next time I’ll talk about configuration that doesn’t appear in the .NET config file.

Technorati Tags: Automated Deployments,Configuration Management,Web.Config,Environmental Delta

How AutoGen is different from Castle’s TypedFactory Facility

Mauricio Scheffer asked how AutoGen differs from the TypedFactory Facility already in Castle. (An equally valid question is why the code is five times the size.) The answer is that it doesn’t in essence, but it does in detail. However, the details matter, in that AutoGen addresses common use cases, whereas the TypedFactoryFacility is only going to save you 6 lines of code. The principal differences are:

Configuration
Constructor Arguments
Handling Keys
Disposal
Handling Multiple Methods

How TypedFactoryFacility is used

Let’s take a look at how you’d implement the example on the AutoGen home page using the TypedFactoryFacility

[Test]
public void TutorialConvertedToTypedFacility()
{
    var container = new WindsorContainer();
    var facility = new TypedFactoryFacility();
    container.AddFacility("TypedFactory", facility);
    facility.AddTypedFactoryEntry(new FactoryEntry("X", typeof(IFactory), "CreateExample", null));
    container.Register(Component.For<IExample>().ImplementedBy<Example>());
    var factory = container.Resolve<IFactory>();
    Assert.IsTrue(factory.CreateExample() is Example);
}

OK, let’s observe a couple of things. First, we created the facility. AutoGen is implemented the same way, as a facility called AutoGenFacility. However, the next line is where the two begin to diverge. With the TypedFactoryFacility, you add the interfaces you want directly to the facility, rather than by adding configuration attributes to the components. This means that it has its own syntax, whilst AutoGen just expects you to add “ccAutoGen=’true'” in your XML files, or “@ccAutoGen=’true'” in your Binsor file. This is the way the RemotingFacility works. The downside of this approach is that it’s harder to use with fluent configuration, which is why I provided an extension method directly for that use case.

(The integration of the AutoGen syntax directly into the registration mechanism is, of course, via a mechanism (extension methods) that didn’t exist when the TypedFactoryFacility was created.)

So far, the two facilities are pretty similar, except for the fact that the TypedFactoryFacility requires you to tell it the create and release methods.

Supporting the Abstract Factory Pattern

There is, of course, a difference between a Factory pattern and an Abstract Factory pattern. The factory pattern is basically just a method which creates an object. Let’s remind ourselves of the GoF maze abstract factory:

MakeMaze()
MakeWall()
MakeRoom(int n)

There are rwo important observations here. The first is that there is more than one creation method. The typed factory facility can’t do this, and it can’t do it by the design of the API: by asking for a single creation method, it can’t support interfaces with more than one. Second, the MakeRoom method takes a parameter. Often when dealing with object creation, there are parameters that vary at runtime. Castle supports this, but the typed factory does not. The following code demonstrates this:

[Test]
[ExpectedException(typeof(InvalidCastException))]
public void TypedFacilityCantImplementConstructorParameters() {
    var container = new WindsorContainer();
    var facility = new TypedFactoryFacility();
    container.AddFacility("TypedFactory", facility);
    facility.AddTypedFactoryEntry(new FactoryEntry("X", typeof(IFactory2), "CreateExample2", null));
    container.Register(Component.For<IExample2>().ImplementedBy<Example2>());
    var factory = container.Resolve<IFactory2>();
    var result = factory.CreateExample2(999);
    Assert.That(result.Value, Is.EqualTo(999));
}

[Test]
public void AutoGenCanImplementConstructorParameters() {
    var container = new WindsorContainer();
    container.Register(Component.For<IExample2>().ImplementedBy<Example2>());
    var factory = container.AutoGen<IFactory2>();
    var result = factory.CreateExample2(999);
    Assert.That(result.Value, Is.EqualTo(999));
}

This isn’t tricky to support (Castle’s got all of the hooks you need), but it requires about five times the code of the existing TypedFactoryFacility, since you’ve got to support various use cases for how you’d like to map the parameters to constructor parameters. Sadly, Castle’s support here isn’t quite as deep as I’d like, but that’s a subject for another day.

For reference, here is the IExample2 code

public interface IExample2 { int Value { get; } }

public class Example2 : IExample2 {
    private readonly int value;

    public Example2(int value) {
        this.value = value;
    }

    public int Value {
        get { return value; }
    }
}

Supporting IDisposable

The TypedFactory facility does it’s job and leaves. If you want to release the object from Castle, it’s got the ability to declare a release method, but that’s it. AutoGen actually goes further, in two ways:

Any object returned that implements IDisposable will be released from Castle when the dispose method is called.
Any factory that implements IDisposable will release all objects it has generated when it is disposed.

To illustrate, here’s the test that a) Release is called and b) the original disposal is called.

[Test]
public void DisposeReleasesASingleton()
{
    var factory = container.Resolve<IFactory>();
    var singleton = factory.Create();
    Assert.That(factory.Create(), Is.EqualTo(singleton));
    Assert.That(singleton.DisposeCount,Is.EqualTo(0));
    singleton.Dispose();
    Assert.That(singleton.DisposeCount,Is.EqualTo(1));
   Assert.That(factory.Create(),Is.Not.EqualTo(singleton));
}

Of course, doing this means that you need to be careful with lifetimes, hence such tests as “SingletonsAreSingletonsWhenProxied”.

Again, this makes the code longer and more detailed than the TypedFactoryFacility.

Conclusion

I hope I’ve managed to demonstrate why the AutoGen code is different from the TypedFactoryFacility. I hope I’ve also described why I think the concept is important. A valid question, however, is why I haven’t submitted the work to the Castle dev team yet. The answer to that one is less good: I didn’t think there would be a lot of interest (or someone would have already done it). I should, however, verify that.

As an aside, the Dynamic Proxy (without which none of this would work) code in Castle is extremely wonderful and powerful. This was my first project that used it, but it won’t be the last. I honestly do not believe it should be wasted on AOP.

Technorati Tags: AutoGen,Castle Windsor,Dynamic Proxy,Abstract Factory,Inversion of Control

Running MTA code in an STA thread

I don’t claim to understand the windows threading model that well. For that matter, I don’t want to learn either. But every so often you hit an error like this: “WaitAll for multiple handles on a STA thread is not supported.”. Now, we’ll calmly step away from the car crash that is understanding the apartment threading model and skip straight to making the problem go away.

private void RunInMtaThread(ThreadStart action)
{
    var thread = new Thread(action);
    thread.SetApartmentState(ApartmentState.MTA);
    thread.Start();
    thread.Join();
}

Incidentally, the SetApartmentState call isn’t actually necessary, since it’s the default. I’m including it so that it’s obvious how to achieve the reverse. As an aside, I’ve been spending a lot of time thinking about API design (Ayende’s opinionated API article started me on this road), and I can think of no good reason that the apartment state should not be an optional parameter of thread start, rather than a settable property. It’s not like you can change the apartment state once it’s running.

AutoGen for Castle Windsor

I recently published a tool called AutoGen for Castle. You can check it out at Google Code. In essence, it auto-generates factory objects for you and adds them to your container. This is an extremely useful thing to be able to do. I do, however, find it a bit hard to explain, so bear with me.

A relatively good rule of thumb these days is that a class should instantiate objects or do something with objects but never both. Miško Hevery has written a lot of good stuff on this subject, and I don’t propose to mangle his thinking here. Now, if we go back to our old Gang of Four maze construction example, the abstract factory in that case produced objects that were closely related and independent. However, that’s rarely the case these days if only because we design systems to have very few interdependent components. It’s actually much more likely that the objects could have been produced using a DI framework such as StructureMap or Castle.

Now, when you start using dependency injection containers, you seem to end up putting “Resolve” or “GetInstance” all over your code. This is an extremely bad idea, for two principal reasons:

Calling Resolve is conceptually as bad as calling new
Your code should not be taking a dependency upon its DI framework.

Now Jeremy Miller wrote an excellent article on the question of libraries taking a dependency upon IoC containers. It’s a known problem and hard to deal with without Microsoft stepping up to the plate. However, typically a your code probably doesn’t need an interface as general as the one Jeremy proposed. That’s only going to be useful for people building frameworks. It’d be better if you could specify your own.

That’s what AutoGen does, it lets you specify an interface (or multiple interfaces) for how you interact with Castle. Anything you like, really. By default a parameter called “key” is the key, and anything else gets passed to the constructor. (Obviously, it uses Castle’s semantics for doing this, there’s not a lot of control there.) It even, if you so wished, allows you implement Jeremy’s interface. That won’t help you with standardization, however.

Ideally, this means that you can actually restrict your interaction with your container to your main method. You only need one call to Resolve/GetInstance: the call that resolves your service class. The rest of your code can now be container-agnostic.

Anyway, if you’re interested, you can take a look here:

http://code.google.com/p/castleautogen/

It depends upon Castle Core, Dynamic Proxy and (obviously) Castle Windsor. The tests are written in NUnit 2.4.

Technorati Tags: AutoGen,Castle Windsor,Dynamic Proxy,Abstract Factory,Inversion of Control

Understanding Inversion of Control Containers

I thought I’d write a bit about how I understand the philosophy of IoC containers. I realize I’m actually building up to something that I started talking about a while ago. I’m probably not saying anything that Martin Fowler didn’t say before, but I have my own slant on it. To start off with, I’d like to just review what we actually mean by various terms:

Inversion of Control (IoC) is a general name for the pattern where an object isn’t responsible for managing the lifecycle of the services it uses.
The simplest way to implement this (in .NET) is passing services in through the constructor. This is termed constructor injection.
Typically, services are passed in using interfaces, which eases testability. However, Inversion of Control is not about testability.

So what is an IoC container? It’s a configuration tool. That’s it. Typically, it implements the constructor injection pattern like so:

For each object registered, you usually specify:
- A name for the component
- The interface it implements
- The class that implements it.
For primitive values, you just say what the constructor parameter is and what the value should be.
For interfaces, you either not specify the implementation, in which case you get the default, or specify a particular component reference.

Actually, there is one other thing the container does: it handles lifecycles. This is a major feature that people often take for granted. The clue is in the name, really. Containers are things that hold objects, not produce them. Containers typically allow you to specify the lifecycle of the object e.g.

one implementation in the process (Singleton)
one implementation in the thread
one implementation in an HttpContext

This lifecycle management is crucial to the use of IoC containers in most environments. The catch is that it can have side effects you do not expect. For instance, if you call a parameterized resolve on an object with a singleton lifecycle, the object will only ever have the first set of parameters passed in. Any others will be ignored (the moral of this story is to always use transient lifecycles when dealing with run-time parameters).

A fundamental part of the philosophy of IoC containers is that they should be extremely low footprint and non-invasive. The code should not need to know it is running in a container. Nor should the interfaces. There are, however, a number of times that you do need to know about the container. The obvious one is when reasoning about lifecycle management, however there are a number of times the abstraction gets broken. Having the abstraction broken is not as painful as having no abstraction at all, but it can be a distraction.

Evaluation of Containers

There are, of course, a lot of subtleties about containers. Quite a lot of people come to the conclusion that the libraries out there are too “heavy-weight” and that they would be better off rolling their own. If you’re one of those people. hopefully after reading this list you will either decide to refocus your efforts on improving the existing libraries, or you will have a USP that merits the duplication of effort. (Or you just want to have fun, which is always a valid reason for writing code.) I’ve listed some of them out here:

Most of this is specific to Castle Windsor, since its the one I’ve worked with most, but many of these questions are common across implementations and are things you should watch out for when evaluating. I will re-iterate that whilst it is easy to write a simple IoC container, writing a very good one such as Castle is a challenge.

Are Primitives Strings?

My personal bugbear is that IoC containers started out when XML was fashionable. As a consequence, there’s a tendency in most of them to treat everything as a string. Since these days there’s a move towards DSLs such as Binsor or fluent configuration, the requirement that parameters be strings is out of date. There are a number of side effects of this. Castle Windsor RC3, for instance, fails one of its unit tests in a UK environment due to different date formats. Equally, adding a primitive type that isn’t easily expressed as a string is painful. Custom Type Converters are a useful concept for dealing with text-based file formats, but seriously, why can’t you say

Component.For<IDownloader>()
    .ImplementedBy<Downloader>()
    .Parameters(
        Parameter.ForKey("target").Eq(new Uri("http://www.google.com"))
    )
);

The current way of achieving this is unnecessarily verbose.

How are Lists Handled?

If there is one thing I heartily dislike about Castle, it’s the list handling. Ironically, in many ways, the list handling is strong: it’s relatively easy to register an array of array of strings, for instance. However, once you leave primitives, it gets more ambitious. If you create a constructor parameter of IEnumerable<IService>, it will by default pass in a list of all components that are registered with the IService interface. There are a number of problems with this

The worst is that it gets in the way of the second simplest use case of a list: one where you specified a list of component references yourself. If you try this, you end up with a type conversion error.
It can’t handle super-interfaces, it will only ever do exact matches.
You can’t specify that you care about interfaces on the registered implementations. Thus, requesting IEnumerable<IDisposable> wouldn’t return the “Right Thing” (all registered disposable objects) even if you could specify that you wanted super-interfaces.

I would advise anyone evaluating a container to pay particular attention to how you can specify lists of components, because it come up a lot in real use cases.

What Issues are there with Open/Closed Generics?

There’s always a couple of bugs to do with open and closed generics. Castle recently fixed another one. In March of this year, it wasn’t possible to express this concept in StructureMap:

Component.For<IMessageHandler<string>>()
    .ImplementedBy<MessageHandler>()
);

Indeed, this issue was pretty much why I moved to Castle in the first place. These days you’ve got to come up with something fairly involved to run into a problem (e.g. an open generic type relying on a closed one). However, if you’re using one of the many less-popular frameworks, or rolling your own, you need to watch out for this.

How does the Container Deal with Multiple Interfaces?

If you register the same class as the implementation of multiple interfaces, typically you will end up with multiple instances. It’s possible to mitigate this by using explicit component references, but that’s not a perfect solution. Sometimes you want a service that exposes different interfaces to different consumers. Castle Windsor calls this feature “forwarding”.

How can you Inject your own code?

How good is the container at handling the case where it doesn’t create the object itself? Can you say something like this?

Component.For<IConnection>()
    .CreatedBy(() => ConnectionFactory.NewConnection() )
);

Windsor’s story here is rather painful, with two facilities defined which use reflection to run. On the other hand, they support Dynamic Proxy out of the box, so intercepting method calls to the interfaces is pretty simple and powerful.

Can you Create a Derived Container?

I am, frankly, amazed this does not come up more often. It should be relatively easy to create a container based upon another container, overriding and extending parts of the configuration. This is actually extremely useful. Binsor has the Extend keyword (you’ll need to check the unit tests for documentation) which achieves this, but frankly this is too important a feature to be left to the DSL, this should be a fundamental part of the container. Certainly there’s no easy way to achieve this without using Binsor in Windsor. I think there will probably be a whole separate post about this.

Technorati Tags: Inversion of Control,Castle Windsor,Binsor,StructureMap

Automated deployments #1: What’s on your server?

No really, what’s actually on your server? If you’re first answer isn’t “erm” you’re either very good or don’t understand the question. Servers are huge, they can store War and Peace on their hard drives without you even noticing. For that matter, they can store any number of PDFs of books on patterns and practices without any appreciable benefit to civilization, but I don’t think that’s really the fault of the servers. It’s practically impossible to really know what’s on there. What’s worse, the way most people do development, they make the job harder for themselves.

I had a meeting with our auditor today and thanked my lucky stars that we had automated deployments. Automated deployments save an awful lot of effort if they’re done right and really save your hide when people start poking around your process. Let’s talk about a really simple question, what’s an your server?

If you tell me it’s version 1.2.6, I’m going to have a few questions.

What was in version 1.2.6? Is there a label in source control?
Was every file checked in?
What build does that correspond to?
How can you check that the build is what got deployed?
How about the config, is that in source control? The actual config that’s on the server right now.
How do you know nothing’s changed on the server since then?

Look at Microsoft, or any large company, and they’ve got this sorted out. It’s internal development teams that tend to have a problem. When people ask these questions:

What’s changed since we last deployed?
What could this affect?
Can we rollback?

You want to have good answers. And absolutely fundamental to this is: know what’s on your server. Exactly.

First, you need to have a build server. Download and love CruiseControl.NET. Builds on local machines always turn out to have the wrong version, a reference to something that isn’t in source control, a dependency that isn’t well worked out. A real pain for anyone starting with this is that it turns out your solution files aren’t really as flexible as you’d like. You can get going with MSBuild, but there’s a reason every open source project uses nant. (NAnt is far from perfect, but it’s a heck of a lot easier than MSBuild for anything slightly complex.)

Anyway, here are my answers:

Version numbers are build numbers. “1.2” is just for PR, it’s got nothing to do with versioning. Call it what you like (you can call it Vista if you must) but the real version number is the build number from on the build server.
Build Servers will only build files that are checked in.
I said that version numbers are build numbers, right?
We label every assembly that gets built with the build number (I stick it in the summary in the DLL). This makes it really easy to just check the version on the server. Also, we stick the information in an About Box, or a web service call.
The actual config on the server isn’t in source control, but the environmental delta is. The deployment process stamps the config with the information as well.
Making sure that nothing’s been changed is harder, because no-one’s written a general tool for doing so, but taking a hash of the directory straight after deployment and checking it each day will catch most of that. (You can hash files individually for better diagnostics.) Tracking every setting on the server is probably more trouble than it’s worth, but I do have a tool for downloading IIS Virtual Directory settings to an XML file, because that turned out to be on the right side of the cost/benefit calculation.

Your answers don’t need to be the same, but I guarantee you your life will be easier when you have answers to these questions. Importantly, the work scales. The more people join your team, the more this stuff is important. Incidentally, you can do all of this in TFS. I know, I’ve done it. And I’ve regretted not using CruiseControl.NET, NUNIT, SVN and NANT every time. Open source, depressingly, appears to be better documented than the stuff I paid for.

Technorati Tags: Automated Deployments,MSBuild,TFS,NAnt,SubVersion,NUnit,CruiseControl.NET

Mono’s still got a long way to go

Reading this didn’t impress me massively, and not only because it’s a reheated blog post. Don’t get me wrong, the Mono team has done some superb work, but it’s really not ready for primetime. Miguel does a phenomenal job of cheerleading, but let’s take a look at this particular example. That’s not an industry standard benchmark they’re running there, it’s some code on some guy’s blog. It’s quite a nice and interesting blog, but it’s nowhere close to the mainstream.

Sadly, where Mono still falls down is meat and potatoes issues. Look at the number of patches the Ubuntu team need to make to each release before they’re happy packaging it. Look at the weird behaviour problems that the Castle team discuss on their mailing lists (e.g. why on earth does Mono 1.9 believe it can convert from a string to an interface? Don’t they have a regression test for that?). Worst of the lot, however, has to be the garbage collector.

Getting the garbage collector wrong is second only to getting the compiler wrong. People won’t understand what the problem is, but they’ll suffer when it doesn’t work right. Mono currently uses the Boehm garbage collector, which is a non-compacting, conservative C++ garbage collector. If you use vanilla .NET, you don’t need to know about the garbage collector for the most part (unless you’re doing stupid things with finalizers), however, if you’re running on mono, the same program that run fine on .NET can give you nonsense like this: http://www.mail-archive.com/mono-list@lists.ximian.com/msg22436.html. (Incidentally, the suggested remedy is a sticking-plaster over a sword wound.)

At the moment, the only real solution to this problem is to use allocation patterns that the Boehm GC likes, which is ridiculous to anyone who has stopped worrying about memory fragmentation for the last five years. In fairness, the Mono Project is planning to address this at some point towards the end of the year. Then all I’ll be worried about is their quality problems.

Unforeseen Consequences: My lavatory made a funny noise

Now, every in my office has already heard this story, but it deserves a wider audience.

Six months ago, I bought a house. One of the lavatories made a funny noise when it flushed. Actually, not so much funny as extremely loud. This noise would go on for about a minute. It actually sounded like the house was shaking apart. I ignored this for months and worked around the problem. Being a geek, I figured out that running the bath at the same time stopped the noise, so I knew I was dealing with a resonance problem. I’m not, however, a plumber, and had no idea what was starting the whole thing off.

There are a couple of weird things about the house. One is that the lower bathroom had a shower head, but no shower rail. Not particularly wanting to soak my new house every time I used the shower, I used the shower upstairs. Finally, I (or should I say, my significantly more organised wife) got the plumber ’round. He installed a shower rail, took one look at the loo and determined it had the wrong washer on it. Replacing the washer for a fiver, he managed to fix a problem that I’d assumed was going to cost me thousands.

I then went away for a couple of days, and came back to a leak in my kitchen. Water was seeping through from the ceiling. I went nuts, thinking the house was about to fall down. I phoned up the plumber and he agreed to come back on the Sunday morning. (Our plumber, you will appreciate, is an absolute brick. Couldn’t praise him more highly.) In the morning, we started discussing the problem. Maurice (really) first wanted to check that he hadn’t drilled through a pipe. He was quite happy to admit that he had done so before, but he doubted this was the problem since we’d have a lot more water leaking. We then started on a relatively serious discussion on whether it was better to rip up the floorboards or break through the plaster. Another difficulty was working out from where exactly it was leaking. Finally, I asked him if it was possible that fixing the washer had affected something else. Maurice said “No, that can’t happen. Let me explain why.”. He lifted off the cover of the lavatory tank, stared at it and said “There’s no overflow”.

For those of you who don’t know, the overflow is a pipe out of the back of your loo that goes outside. In the event of a minor problem, you end up with water being dumped outside your property. Since the property can handle rain, it’s not an urgent problem and is easily fixed. What my loo did, was drop the water from the overflow onto the floor, and eventually through the kitchen ceiling. Basically, the guy who’d installed it in the first place had done a dreadful, incompetent job. So now I have a bucket where the overflow should be and another date with Maurice where he’s going to install some overflows.

The reason I mention this is, the experience was nigh on identical to conversations I have with my manager about some legacy systems I deal with:

Problems sometimes aren’t as serious as their symptoms suggest.
Fixing one thing may highlight a problem somewhere else.
Always explain to someone else why something can’t possibly happen.
An audit won’t find every problem.
You’re always going to get these problems when you’re taking over a badly done job.

And sadly, sometimes you won’t get lucky and will have to rip up the floorboards to figure out what’s going wrong.

How to support Default.aspx in ASP.NET MVC

If you’re trying to use MVC incrementally on an existing project, this can be a bit of pain. There must be some way of way of getting the routing logic to redirect itself, but in practice the following does the trick (assuming you’re using the standard routing).

    public class HomeController : Controller {
        public ActionResult Index()
        {
            return new RedirectResult("~/Default.aspx");
        }

    }

Technorati Tags: ASP.NET MVC

Understanding MapReduce #1: The Assumptions

I finally had a light-bulb go off in my head about MapReduce. To be honest, part of the problem with understanding it is that the implementation most of us look at (Hadoop) has a considerable amount of implementation detail visible at all times. I’m going to try to explain some of the fundamentals behind it in terms of C#, which has two great advantages:

I know it
LINQ gives us a fairly decent declarative syntax for expressing algorithms declaratively.

Now, as everyone knows, Map and Reduce are lisp terms for project and aggregation respectively. In LINQ, these are called Select (or SelectMany) and Aggregate (or just “apply function”). MapReduce simply applies one and then the other.

        public static TResult MapReduce1
            <TResult, TMapped, TInput>
            (
            Func<TInput, TMapped> map,
            Func<IEnumerable<TMapped>, TResult> reduce,
            IEnumerable<TInput> inputs) {

            return reduce(
                from input in inputs
                select map(input)
                );
        }

That’s it! So why is it so clever? Well, what google did was to change the assumptions a bit. The irony is that by adding in more conditions, they actually came up with something more general, not less. So, let’s take a look at some of those assumptions:

The map always returns a list.
The reduce function operates on the same input type as output type.
The reduce function is idempotent. In plain english, if you reduce the output of a reduce, your output will be equal to your input.

The first one’s a gimme. Returning a list doesn’t make a blind bit of difference. You could just return one item for every input and you’d be back to the original function. However, the restriction on the reduce is hugely powerful. In particular, it allows for the distribution of partial reduces. I’m not going to show that in code today.

Version 2 of the code looks pretty similar:

        public static IEnumerable<TResult> MapReduce2
            <TResult, TInput>
            (
            Func<TInput, IEnumerable<TResult>> map,
            Func<IEnumerable<TResult>, IEnumerable<TResult>> reduce,
            IEnumerable<TInput> inputs) {

            return reduce(
                from input in inputs
                from mapped in map(input)
                select mapped
                );
        }

We’ve got an extra from to deal with, but otherwise this is pretty tame. Note that we’ve made the reduce return a list as well. Again, it doesn’t make much of a difference. We’ll abstract away the concept of applying a map.

        public static IEnumerable<TResult> MapReduce2b
            <TResult, TInput>
            (
            Func<TInput, IEnumerable<TResult>> map,
            Func<IEnumerable<TResult>, IEnumerable<TResult>> reduce,
            IEnumerable<TInput> inputs) {

            Func<IEnumerable<TInput>, IEnumerable<TResult>> applyMap =
                mapInputs => mapInputs.SelectMany(map);
            return reduce(applyMap(inputs));
        }

Now things get interesting. MapReduce assumes that you’re using Tuples everywhere. This is the most important step. The point is, it groups on the basis of the keys. We can also use different keys for mapped data and the results of reduces, although the type system restricts how useful that could be. Now version 3 does look somewhat more complex.

        public class Tuple<TKey, TValue> 
        {
            public TKey Key;
            public TValue Value;
        }

        public static IEnumerable<Tuple<TKey, TValue>> MapReduce3
            <TKey, TValue, TInput>
            (
            Func<TInput, IEnumerable<Tuple<TKey, TValue>>> map,
            Func<TKey, IEnumerable<TValue>, IEnumerable<Tuple<TKey, TValue>>> reduce,
            IEnumerable<TInput> inputs) {
            Func<IEnumerable<Tuple<TKey, TValue>>, IEnumerable<Tuple<TKey, TValue>>> applyReduce =
                results => from result in results
                           group result.Value by result.Key into grouped
                           from reduced in reduce(grouped.Key, grouped)
                           select reduced;
            Func<IEnumerable<TInput>, IEnumerable<Tuple<TKey, TValue>>> applyMap =
                mapInputs => mapInputs.SelectMany(map);
            return applyReduce(applyMap(inputs));
        }

The important bit is the way we’ve redefined the reduce operation. Now the reduce operation operates on a list of values for a particular key (it can still return whatever it likes). The applyReduce function demonstrates how this concept of reduce maps onto the old concept of reduce.

The LINQ syntax obscures one thing we’ve overlooked so far: how the grouping actually works. The Hadoop implementation makes this far from explicit as well. Hadoop does it by requiring all keys to implement “WriteableComparable”. The direct translation would be to require TKey to implement IComparable. However, we’ll go with a more .NET like way of doing things using IEqualityComparer<TKey>. Here’s version 3 with an IEqualityComparer.

        public static IEnumerable<Tuple<TKey, TValue>> MapReduce4
            <TKey, TValue, TInput>
            (
            Func<TInput, IEnumerable<Tuple<TKey, TValue>>> map,
            Func<TKey, IEnumerable<TValue>, IEnumerable<Tuple<TKey, TValue>>> reduce,
            IEqualityComparer<TKey> groupRule,
            IEnumerable<TInput> inputs) {
            Func<IEnumerable<Tuple<TKey, TValue>>, IEnumerable<Tuple<TKey, TValue>>> applyReduce =
                results => results
                            .GroupBy(result => result.Key, result => result.Value, groupRule)
                            .SelectMany(grouped => reduce(grouped.Key, grouped));
            Func<IEnumerable<TInput>, IEnumerable<Tuple<TKey, TValue>>> applyMap =
                mapInputs => mapInputs.SelectMany(map);
            return applyReduce(applyMap(inputs));
        }

Now, I’ve tried to avoid talking about distribution concerns in this post, but here we’re forced into it. The results of maps will potentially be transmitted across the network. Therefore, it makes sense for the grouping to actually occur during the map. Again, you might not see this in the Hadoop examples as the grouping is actually performed by the OutputCollector. While we’re here, we’ll observe that the Hadoop standard of taking two inputs to the reduce function doesn’t make much sense in an environment in which IGrouping is a standard concept. Thus, we can move the grouping call to the map as follows:

        public static IEnumerable<Tuple<TKey, TValue>> MapReduce5
            <TKey, TValue, TInput>
            (
            Func<TInput, IEnumerable<Tuple<TKey, TValue>>> map,
            Func<IGrouping<TKey, TValue>, IEnumerable<Tuple<TKey, TValue>>> reduce,
            IEqualityComparer<TKey> groupRule,
            IEnumerable<TInput> inputs) {
            Func<IEnumerable<IGrouping<TKey, TValue>>, IEnumerable<Tuple<TKey, TValue>>> applyReduce =
                results => results.SelectMany(reduce);
            Func<IEnumerable<TInput>, IEnumerable<IGrouping<TKey, TValue>>> applyMap =
                mapInputs => mapInputs
                    .SelectMany(map)
                    .GroupBy(result => result.Key, result => result.Value, groupRule);
            return applyReduce(applyMap(inputs));
        }

The problem with writing it out like this is that the Func definitions get to be most of the code. Let’s see it again, simplified:

        public static IEnumerable<Tuple<TKey, TValue>> MapReduce6
            <TKey, TValue, TInput>
            (
                Func<TInput, IEnumerable<Tuple<TKey, TValue>>> map,
                Func<IGrouping<TKey, TValue>, IEnumerable<Tuple<TKey, TValue>>> reduce,
                IEqualityComparer<TKey> groupRule,
                IEnumerable<TInput> inputs) {
            Func<IEnumerable<Tuple<TKey, TValue>>, IEnumerable<IGrouping<TKey, TValue>>> collectOutput =
                mapped => mapped.GroupBy(result => result.Key, result => result.Value, groupRule);
            return collectOutput(inputs.SelectMany(map)).SelectMany(reduce);
        }

Now, Hadoop goes one stage further by insisting that the inputs also be tuples. It then has a file handling system for generating those tuples from files. Let us just, for the moment, observe that actually generating the list of inputs may be an expensive operation in itself. So, we need to be able to deal with batches of inputs. We’ll leave that problem until next time.

Technorati Tags: Grid Computing,Hadoop,MapReduce