Understanding Inversion of Control Containers

I thought I’d write a bit about how I understand the philosophy of IoC containers.  I realize I’m actually building up to something that I started talking about a while ago.  I’m probably not saying anything that Martin Fowler didn’t say before, but I have my own slant on it.  To start off with, I’d like to just review what we actually mean by various terms:

  • Inversion of Control (IoC) is a general name for the pattern where an object isn’t responsible for managing the lifecycle of the services it uses. 
  • The simplest way to implement this (in .NET) is passing services in through the constructor.  This is termed constructor injection.
  • Typically, services are passed in using interfaces, which eases testability.  However, Inversion of Control is not about testability.

So what is an IoC container?  It’s a configuration tool.  That’s it.  Typically, it implements the constructor injection pattern like so:

  • For each object registered, you usually specify:
    • A name for the component
    • The interface it implements
    • The class that implements it.
  • For primitive values, you just say what the constructor parameter is and what the value should be.
  • For interfaces, you either not specify the implementation, in which case you get the default, or specify a particular component reference.

Actually, there is one other thing the container does: it handles lifecycles.  This is a major feature that people often take for granted.  The clue is in the name, really.  Containers are things that hold objects, not produce them.  Containers typically allow you to specify the lifecycle of the object e.g.

  • one implementation in the process (Singleton)
  • one implementation in the thread
  • one implementation in an HttpContext

This lifecycle management is crucial to the use of IoC containers in most environments.  The catch is that it can have side effects you do not expect.  For instance, if you call a parameterized resolve on an object with a singleton lifecycle, the object will only ever have the first set of parameters passed in.  Any others will be ignored (the moral of this story is to always use transient lifecycles when dealing with run-time parameters).

A fundamental part of the philosophy of IoC containers is that they should be extremely low footprint and non-invasive.  The code should not need to know it is running in a container.  Nor should the interfaces.  There are, however, a number of times that you do need to know about the container.  The obvious one is when reasoning about lifecycle management, however there are a number of times the abstraction gets broken.  Having the abstraction broken is not as painful as having no abstraction at all, but it can be a distraction.

Evaluation of Containers

There are, of course, a lot of subtleties about containers.  Quite a lot of people come to the conclusion that the libraries out there are too “heavy-weight” and that they would be better off rolling their own.  If you’re one of those people. hopefully after reading this list you will either decide to refocus your efforts on improving the existing libraries, or you will have a USP that merits the duplication of effort.  (Or you just want to have fun, which is always a valid reason for writing code.)  I’ve listed some of them out here:

Most of this is specific to Castle Windsor, since its the one I’ve worked with most, but many of these questions are common across implementations and are things you should watch out for when evaluating.  I will re-iterate that whilst it is easy to write a simple IoC container, writing a very good one such as Castle is a challenge.

Are Primitives Strings?

My personal bugbear is that IoC containers started out when XML was fashionable.  As a consequence, there’s a tendency in most of them to treat everything as a string.  Since these days there’s a move towards DSLs such as Binsor or fluent configuration, the requirement that parameters be strings is out of date.  There are a number of side effects of this.  Castle Windsor RC3, for instance, fails one of its unit tests in a UK environment due to different date formats.  Equally, adding a primitive type that isn’t easily expressed as a string is painful.  Custom Type Converters are a useful concept for dealing with text-based file formats, but seriously, why can’t you say

Component.For<IDownloader>()
    .ImplementedBy<Downloader>()
    .Parameters(
Parameter.ForKey("target").Eq(new Uri("http://www.google.com")) ) );

The current way of achieving this is unnecessarily verbose.

How are Lists Handled?

If there is one thing I heartily dislike about Castle, it’s the list handling.  Ironically, in many ways, the list handling is strong: it’s relatively easy to register an array of array of strings, for instance.  However, once you leave primitives, it gets more ambitious.  If you create a constructor parameter of IEnumerable<IService>, it will by default pass in a list of all components that are registered with the IService interface.  There are a number of problems with this

  • The worst is that it gets in the way of the second simplest use case of a list: one where you specified a list of component references yourself.  If you try this, you end up with a type conversion error.
  • It can’t handle super-interfaces, it will only ever do exact matches.
  • You can’t specify that you care about interfaces on the registered implementations.  Thus, requesting IEnumerable<IDisposable> wouldn’t return the “Right Thing” (all registered disposable objects) even if you could specify that you wanted super-interfaces.

I would advise anyone evaluating a container to pay particular attention to how you can specify lists of components, because it come up a lot in real use cases.

What Issues are there with Open/Closed Generics?

There’s always a couple of bugs to do with open and closed generics.  Castle recently fixed another one.  In March of this year, it wasn’t possible to express this concept in StructureMap:

Component.For<IMessageHandler<string>>()
   
.ImplementedBy<MessageHandler>()
);

Indeed, this issue was pretty much why I moved to Castle in the first place.  These days you’ve got to come up with something fairly involved to run into a problem (e.g. an open generic type relying on a closed one).  However, if you’re using one of the many less-popular frameworks, or rolling your own, you need to watch out for this.

How does the Container Deal with Multiple Interfaces?

If you register the same class as the implementation of multiple interfaces, typically you will end up with multiple instances.  It’s possible to mitigate this by using explicit component references, but that’s not a perfect solution.  Sometimes you want a service that exposes different interfaces to different consumers.  Castle Windsor calls this feature “forwarding”.

How can you Inject your own code?

How good is the container at handling the case where it doesn’t create the object itself?  Can you say something like this?

Component.For<IConnection>()
   
.CreatedBy(() => ConnectionFactory.NewConnection() )
);

Windsor’s story here is rather painful, with two facilities defined which use reflection to run.  On the other hand, they support Dynamic Proxy out of the box, so intercepting method calls to the interfaces is pretty simple and powerful.

Can you Create a Derived Container?

I am, frankly, amazed this does not come up more often.  It should be relatively easy to create a container based upon another container, overriding and extending parts of the configuration.  This is actually extremely useful.  Binsor has the Extend keyword (you’ll need to check the unit tests for documentation) which achieves this, but frankly this is too important a feature to be left to the DSL, this should be a fundamental part of the container.  Certainly there’s no easy way to achieve this without using Binsor in Windsor.  I think there will probably be a whole separate post about this.

Automated deployments #1: What’s on your server?

No really, what’s actually on your server?  If you’re first answer isn’t “erm” you’re either very good or don’t understand the question.  Servers are huge, they can store War and Peace on their hard drives without you even noticing.  For that matter, they can store any number of PDFs of books on patterns and practices without any appreciable benefit to civilization, but I don’t think that’s really the fault of the servers.  It’s practically impossible to really know what’s on there.  What’s worse, the way most people do development, they make the job harder for themselves.

I had a meeting with our auditor today and thanked my lucky stars that we had automated deployments.  Automated deployments save an awful lot of effort if they’re done right and really save your hide when people start poking around your process.  Let’s talk about a really simple question, what’s an your server?

If you tell me it’s version 1.2.6, I’m going to have a few questions.

  • What was in version 1.2.6?  Is there a label in source control?
  • Was every file checked in?
  • What build does that correspond to?
  • How can you check that the build is what got deployed?
  • How about the config, is that in source control?  The actual config that’s on the server right now.
  • How do you know nothing’s changed on the server since then?

Look at Microsoft, or any large company, and they’ve got this sorted out.  It’s internal development teams that tend to have a problem.  When people ask these questions:

  • What’s changed since we last deployed?
  • What could this affect?
  • Can we rollback?

You want to have good answers.  And absolutely fundamental to this is: know what’s on your server.  Exactly.

First, you need to have a build server.  Download and love CruiseControl.NET.  Builds on local machines always turn out to have the wrong version, a reference to something that isn’t in source control, a dependency that isn’t well worked out.  A real pain for anyone starting with this is that it turns out your solution files aren’t really as flexible as you’d like.  You can get going with MSBuild, but there’s a reason every open source project uses nant.  (NAnt is far from perfect, but it’s a heck of a lot easier than MSBuild for anything slightly complex.)

Anyway, here are my answers:

  • Version numbers are build numbers.  “1.2” is just for PR, it’s got nothing to do with versioning.  Call it what you like (you can call it Vista if you must) but the real version number is the build number from on the build server.
  • Build Servers will only build files that are checked in.
  • I said that version numbers are build numbers, right?
  • We label every assembly that gets built with the build number (I stick it in the summary in the DLL).  This makes it really easy to just check the version on the server.  Also, we stick the information in an About Box, or a web service call.
  • The actual config on the server isn’t in source control, but the environmental delta is.  The deployment process stamps the config with the information as well.
  • Making sure that nothing’s been changed is harder, because no-one’s written a general tool for doing so, but taking a hash of the directory straight after deployment and checking it each day will catch most of that.  (You can hash files individually for better diagnostics.)  Tracking every setting on the server is probably more trouble than it’s worth, but I do have a tool for downloading IIS Virtual Directory settings to an XML file, because that turned out to be on the right side of the cost/benefit calculation.

Your answers don’t need to be the same, but I guarantee you your life will be easier when you have answers to these questions.  Importantly, the work scales.  The more people join your team, the more this stuff is important.  Incidentally, you can do all of this in TFS.  I know, I’ve done it.  And I’ve regretted not using CruiseControl.NET, NUNIT, SVN and NANT every time.  Open source, depressingly, appears to be better documented than the stuff I paid for.

Mono’s still got a long way to go

Reading this didn’t impress me massively, and not only because it’s a reheated blog post.  Don’t get me wrong, the Mono team has done some superb work, but it’s really not ready for primetime.  Miguel does a phenomenal job of cheerleading, but let’s take a look at this particular example.  That’s not an industry standard benchmark they’re running there, it’s some code on some guy’s blog.  It’s quite a nice and interesting blog, but it’s nowhere close to the mainstream.

Sadly, where Mono still falls down is meat and potatoes issues.  Look at the number of patches the Ubuntu team need to make to each release before they’re happy packaging it.  Look at the weird behaviour problems that the Castle team discuss on their mailing lists (e.g. why on earth does Mono 1.9 believe it can convert from a string to an interface?  Don’t they have a regression test for that?).  Worst of the lot, however, has to be the garbage collector.

Getting the garbage collector wrong is second only to getting the compiler wrong.  People won’t understand what the problem is, but they’ll suffer when it doesn’t work right.  Mono currently uses the Boehm garbage collector, which is a non-compacting, conservative C++ garbage collector.  If you use vanilla .NET, you don’t need to know about the garbage collector for the most part (unless you’re doing stupid things with finalizers), however, if you’re running on mono, the same program that run fine on .NET can give you nonsense like this:  http://www.mail-archive.com/mono-list@lists.ximian.com/msg22436.html.  (Incidentally, the suggested remedy is a sticking-plaster over a sword wound.)

At the moment, the only real solution to this problem is to use allocation patterns that the Boehm GC likes, which is ridiculous to anyone who has stopped worrying about memory fragmentation for the last five years.  In fairness, the Mono Project is planning to address this at some point towards the end of the year.  Then all I’ll be worried about is their quality problems.

Unforeseen Consequences: My lavatory made a funny noise

Now, every in my office has already heard this story, but it deserves a wider audience.

Six months ago, I bought a house.  One of the lavatories made a funny noise when it flushed.  Actually, not so much funny as extremely loud.  This noise would go on for about a minute.  It actually sounded like the house was shaking apart.  I ignored this for months and worked around the problem.  Being a geek, I figured out that running the bath at the same time stopped the noise, so I knew I was dealing with a resonance problem.  I’m not, however, a plumber, and had no idea what was starting the whole thing off.

There are a couple of weird things about the house.  One is that the lower bathroom had a shower head, but no shower rail.  Not particularly wanting to soak my new house every time I used the shower, I used the shower upstairs.  Finally, I (or should I say, my significantly more organised wife) got the plumber ’round.  He installed a shower rail, took one look at the loo and determined it had the wrong washer on it.  Replacing the washer for a fiver, he managed to fix a problem that I’d assumed was going to cost me thousands.

I then went away for a couple of days, and came back to a leak in my kitchen.  Water was seeping through from the ceiling.  I went nuts, thinking the house was about to fall down.  I phoned up the plumber and he agreed to come back on the Sunday morning.  (Our plumber, you will appreciate, is an absolute brick.  Couldn’t praise him more highly.)  In the morning, we started discussing the problem.  Maurice (really) first wanted to check that he hadn’t drilled through a pipe.  He was quite happy to admit that he had done so before, but he doubted this was the problem since we’d have a lot more water leaking.  We then started on a relatively serious discussion on whether it was better to rip up the floorboards or break through the plaster.  Another difficulty was working out from where exactly it was leaking.  Finally, I asked him if it was possible that fixing the washer had affected something else.  Maurice said “No, that can’t happen.  Let me explain why.”.  He lifted off the cover of the lavatory tank, stared at it and said “There’s no overflow”.

For those of you who don’t know, the overflow is a pipe out of the back of your loo that goes outside.  In the event of a minor problem, you end up with water being dumped outside your property.  Since the property can handle rain, it’s not an urgent problem and is easily fixed.  What my loo did, was drop the water from the overflow onto the floor, and eventually through the kitchen ceiling.  Basically, the guy who’d installed it in the first place had done a dreadful, incompetent job.  So now I have a bucket where the overflow should be and another date with Maurice where he’s going to install some overflows.

The reason I mention this is, the experience was nigh on identical to conversations I have with my manager about some legacy systems I deal with:

  • Problems sometimes aren’t as serious as their symptoms suggest.
  • Fixing one thing may highlight a problem somewhere else.
  • Always explain to someone else why something can’t possibly happen.
  • An audit won’t find every problem.
  • You’re always going to get these problems when you’re taking over a badly done job.

And sadly, sometimes you won’t get lucky and will have to rip up the floorboards to figure out what’s going wrong.

How to support Default.aspx in ASP.NET MVC

If you’re trying to use MVC incrementally on an existing project, this can be a bit of pain.  There must be some way of way of getting the routing logic to redirect itself, but in practice the following does the trick (assuming you’re using the standard routing).

    public class HomeController : Controller {
        public ActionResult Index()
        {
            return new RedirectResult("~/Default.aspx");
        }

    }
Technorati Tags:

Understanding MapReduce #1: The Assumptions

I finally had a light-bulb go off in my head about MapReduce.  To be honest, part of the problem with understanding it is that the implementation most of us look at (Hadoop) has a considerable amount of implementation detail visible at all times.  I’m going to try to explain some of the fundamentals behind it in terms of C#, which has two great advantages:

  • I know it
  • LINQ gives us a fairly decent declarative syntax for expressing algorithms declaratively.

Now, as everyone knows, Map and Reduce are lisp terms for project and aggregation respectively.  In LINQ, these are called Select (or SelectMany) and Aggregate (or just “apply function”).  MapReduce simply applies one and then the other.

        public static TResult MapReduce1
            <TResult, TMapped, TInput>
            (
            Func<TInput, TMapped> map,
            Func<IEnumerable<TMapped>, TResult> reduce,
            IEnumerable<TInput> inputs) {

            return reduce(
                from input in inputs
                select map(input)
                );
        }

That’s it!  So why is it so clever?  Well, what google did was to change the assumptions a bit.  The irony is that by adding in more conditions, they actually came up with something more general, not less.  So, let’s take a look at some of those assumptions:

  • The map always returns a list. 
  • The reduce function operates on the same input type as output type.
  • The reduce function is idempotent.  In plain english, if you reduce the output of a reduce, your output will be equal to your input.

The first one’s a gimme.  Returning a list doesn’t make a blind bit of difference.  You could just return one item for every input and you’d be back to the original function.  However, the restriction on the reduce is hugely powerful.  In particular, it allows for the distribution of partial reduces.  I’m not going to show that in code today.

Version 2 of the code looks pretty similar:

        public static IEnumerable<TResult> MapReduce2
            <TResult, TInput>
            (
            Func<TInput, IEnumerable<TResult>> map,
            Func<IEnumerable<TResult>, IEnumerable<TResult>> reduce,
            IEnumerable<TInput> inputs) {

            return reduce(
                from input in inputs
                from mapped in map(input)
                select mapped
                );
        }

We’ve got an extra from to deal with, but otherwise this is pretty tame.  Note that we’ve made the reduce return a list as well.  Again, it doesn’t make much of a difference.  We’ll abstract away the concept of applying a map.

        public static IEnumerable<TResult> MapReduce2b
            <TResult, TInput>
            (
            Func<TInput, IEnumerable<TResult>> map,
            Func<IEnumerable<TResult>, IEnumerable<TResult>> reduce,
            IEnumerable<TInput> inputs) {

            Func<IEnumerable<TInput>, IEnumerable<TResult>> applyMap =
                mapInputs => mapInputs.SelectMany(map);
            return reduce(applyMap(inputs));
        }

Now things get interesting.  MapReduce assumes that you’re using Tuples everywhere.  This is the most important step.  The point is, it groups on the basis of the keys.  We can also use different keys for mapped data and the results of reduces, although the type system restricts how useful that could be.  Now version 3 does look somewhat more complex.

        public class Tuple<TKey, TValue> 
        {
            public TKey Key;
            public TValue Value;
        }

        public static IEnumerable<Tuple<TKey, TValue>> MapReduce3
            <TKey, TValue, TInput>
            (
            Func<TInput, IEnumerable<Tuple<TKey, TValue>>> map,
            Func<TKey, IEnumerable<TValue>, IEnumerable<Tuple<TKey, TValue>>> reduce,
            IEnumerable<TInput> inputs) {
            Func<IEnumerable<Tuple<TKey, TValue>>, IEnumerable<Tuple<TKey, TValue>>> applyReduce =
                results => from result in results
                           group result.Value by result.Key into grouped
                           from reduced in reduce(grouped.Key, grouped)
                           select reduced;
            Func<IEnumerable<TInput>, IEnumerable<Tuple<TKey, TValue>>> applyMap =
                mapInputs => mapInputs.SelectMany(map);
            return applyReduce(applyMap(inputs));
        }

The important bit is the way we’ve redefined the reduce operation.  Now the reduce operation operates on a list of values for a particular key (it can still return whatever it likes).  The applyReduce function demonstrates how this concept of reduce maps onto the old concept of reduce.

The LINQ syntax obscures one thing we’ve overlooked so far: how the grouping actually works.  The Hadoop implementation makes this far from explicit as well.  Hadoop does it by requiring all keys to implement “WriteableComparable”.  The direct translation would be to require TKey to implement IComparable.  However, we’ll go with a more .NET like way of doing things using IEqualityComparer<TKey>.  Here’s version 3 with an IEqualityComparer.

        public static IEnumerable<Tuple<TKey, TValue>> MapReduce4
            <TKey, TValue, TInput>
            (
            Func<TInput, IEnumerable<Tuple<TKey, TValue>>> map,
            Func<TKey, IEnumerable<TValue>, IEnumerable<Tuple<TKey, TValue>>> reduce,
            IEqualityComparer<TKey> groupRule,
            IEnumerable<TInput> inputs) {
            Func<IEnumerable<Tuple<TKey, TValue>>, IEnumerable<Tuple<TKey, TValue>>> applyReduce =
                results => results
                            .GroupBy(result => result.Key, result => result.Value, groupRule)
                            .SelectMany(grouped => reduce(grouped.Key, grouped));
            Func<IEnumerable<TInput>, IEnumerable<Tuple<TKey, TValue>>> applyMap =
                mapInputs => mapInputs.SelectMany(map);
            return applyReduce(applyMap(inputs));
        }

Now, I’ve tried to avoid talking about distribution concerns in this post, but here we’re forced into it.  The results of maps will potentially be transmitted across the network.  Therefore, it makes sense for the grouping to actually occur during the map.  Again, you might not see this in the Hadoop examples as the grouping is actually performed by the OutputCollector.  While we’re here, we’ll observe that the Hadoop standard of taking two inputs to the reduce function doesn’t make much sense in an environment in which IGrouping is a standard concept.  Thus, we can move the grouping call to the map as follows:

        public static IEnumerable<Tuple<TKey, TValue>> MapReduce5
            <TKey, TValue, TInput>
            (
            Func<TInput, IEnumerable<Tuple<TKey, TValue>>> map,
            Func<IGrouping<TKey, TValue>, IEnumerable<Tuple<TKey, TValue>>> reduce,
            IEqualityComparer<TKey> groupRule,
            IEnumerable<TInput> inputs) {
            Func<IEnumerable<IGrouping<TKey, TValue>>, IEnumerable<Tuple<TKey, TValue>>> applyReduce =
                results => results.SelectMany(reduce);
            Func<IEnumerable<TInput>, IEnumerable<IGrouping<TKey, TValue>>> applyMap =
                mapInputs => mapInputs
                    .SelectMany(map)
                    .GroupBy(result => result.Key, result => result.Value, groupRule);
            return applyReduce(applyMap(inputs));
        }

The problem with writing it out like this is that the Func definitions get to be most of the code.  Let’s see it again, simplified:

        public static IEnumerable<Tuple<TKey, TValue>> MapReduce6
            <TKey, TValue, TInput>
            (
                Func<TInput, IEnumerable<Tuple<TKey, TValue>>> map,
                Func<IGrouping<TKey, TValue>, IEnumerable<Tuple<TKey, TValue>>> reduce,
                IEqualityComparer<TKey> groupRule,
                IEnumerable<TInput> inputs) {
            Func<IEnumerable<Tuple<TKey, TValue>>, IEnumerable<IGrouping<TKey, TValue>>> collectOutput =
                mapped => mapped.GroupBy(result => result.Key, result => result.Value, groupRule);
            return collectOutput(inputs.SelectMany(map)).SelectMany(reduce);
        }

Now, Hadoop goes one stage further by insisting that the inputs also be tuples.  It then has a file handling system for generating those tuples from files.  Let us just, for the moment, observe that actually generating the list of inputs may be an expensive operation in itself.  So, we need to be able to deal with batches of inputs.  We’ll leave that problem until next time.

When will I ever learn?

So, I just noticed before deploying the latest version of a system that one of the drop downs wasn’t populating on the front end.  Now, the front end is far from perfect, but this part of the code actually has pretty good test coverage.  So it was a bit puzzling as to why this hadn’t been flagged by the build.  Diving into some controller tests I knocked together about nine months ago, I find the following line.

            IgnoreStaticData(view);

Need I say more?

It seems like I need to keep re-learning the lesson: anything that you’re not testing is wrong.

The LINQ vs Generators Shootout, Round #1

I have to admit, Python is growing on me.  I’m still not entirely convinced of the utility of IronPython, especially given that Boo exists (why don’t more scripting languages allow you to meddle with compilation?).  However, Python as CPython and Jython are actually rather interesting beasts, with some very cool stuff being done on them (I really like the look of Pylons, for instance, I’ll probably write something up on that in the future.)

I thought I should probably expand on my remark that LINQ was more powerful than list comprehension.  It was pointed out to me that Python supports both lazy and greedy evaluation (it calls one list comprehensions and one generators).  LINQ is purely lazy, although adding “ToList” onto the end will typically deal with that if it is a problem (and it would be if you used it naively).

So, how is LINQ a better form of list comprehension?  Four reasons:

  • It’s implemented as a data model, allowing stuff such as LINQ to NHibernate to exist.
  • It supports group by
  • It supports order by
  • It supports intermediate assignments through let and into

The first is probably the most technically impressive, but it’s also the most controversial.  It means that LINQ is much more than just a list comprehension system, but no-one’s got enough experience of it yet to know exactly how these features are best used.

The grouping is cool, although I have to admit I’ve rarely needed it.  The ordering, on the other hand, is huge.  Python’s model for sorting is the IComparable model in C#.  If you’ve ever tried to sort by three keys, you’ll know the problems with it.  In contrast, you can just specify the keys and let LINQ sort it out for you. 

The final one is probably the most useful of the lot, even if it seems minor.  Take a look at the following code:

public static string Deduce7ZipDirectory(IEnumerable<string> output)
{
    var regex = new Regex(@"s(?<Folder>[^s\]+[^:])[\]");
    var result = (from value in output
                  let match = regex.Match(value)
                  where match != Match.Empty
                  select match.Groups["Folder"].Value)
        .FirstOrDefault();
    return result;
}

I actually wrote this code to parse the output of 7zip’s command line list function and I think it’s pretty elegantly declarative.  I’m not entirely happy with the debugging story, however.  You can put breakpoints within the LINQ statement, but seeing the local variables doesn’t seem to work for me.  Ironically, this is a bigger problem for C# than it is for JavaScript or Python, simply because it’s possible to write rather complex things in these statements.

Personal Note

I made the mistake of posting shortly before I went on holiday for several weeks.  I’d like to thank everyone who commented, I learnt a lot,  Amongst the things I learned was that I really need to get around to writing an “About Me” page, mostly because of my aversion to posting noise rather than signal.  For the record, my name is Julian Birch and I live in London. 

Spider Python: Some Notes on IronPython

I’ve converted over my Retlang spider example over to IronPython in order to get a feel for the differences.  Here’s some notes:

  • Python list comprehensions can do the same as LINQ for simple cases, but LINQ is much more powerful, and it supports deferred execution, while list comprehensions are evaluated greedily.  UPDATE: Thanks to Mark for pointing out that generators support deferred execution.  There’s still no syntax for grouping or ordering, but these are relatively rare cases.
  • Every description of the python syntax I ever see emphasizes the fact you don’t need to put in braces.  Pity they don’t spend more time telling you that you have to put in colons, that would actually be useful knowledge.  This really bit me when I learnt boo, which is syntactically very similar.
  • IronPython 1.0 targets CPython 2.4.  This sounds fine until you realize that this was released in 2004.  A fair bit has happened since then, not limited to the introduction of an inline if syntax.
  • While we’re on the subject, the inline if (defaultValue if condition else exceptionValue) is actually quite cool.
  • The fairly lax approach to types means I don’t need to call .Cast<Match> anymore.
  • Tabs are the devil as far as indents in Boo and Python are concerned.  I highly recommend using something that shows tabs explicitly, and then eliminating the lot.
  • Instance methods that explicitly take self feels awkward to me.
  • Static Methods that require an incantation such as “ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)” also feels pretty awkward.  UPDATE:  I’ve now learnt about decorators, thanks to Ken.  @staticmethod is relatively readable.
  • The casting to delegate isn’t quite as slick as it is in C#.  Passing in “spiderTracker.FoundUrl” instead of “lambda url: spiderTracker.FoundUrl(url)” results in an extremely unhelpful runtime error.
  • The lambda syntax is pretty elegant, but so is C#3’s.  Indeed, C#3 seems to have the edge.
  • Python’s regular expressions are powerful, but not quite as good as .NET’s.  In particular
    • search does what you’d expect match to do
    • findall doesn’t return matches.  Rather, it returns the value of the first group of the matches.  This is often actually what you wanted, but it’s a bit peculiar.  The semantics of “Matches” in .NET are much easier to understand.
  • It did seem rather slow compared to the C# version.  There are, of course, way too many variables here to make a final judgement, but it was disappointing.

So, I’m still waiting for my ESR moment.  Can’t say my program ran correctly the first time I ran it.

The code

A couple of notes if you’re looking at this in detail:

  • You’ll need to explicitly declare where the Retlang DLL is.
  • I’ve inlined a couple of functions, since the lambda syntax seemed less work than the class method syntax.
  • It handles proxies better than the original version.
  • There’s a minor bug fix from the previous version to encourage it to ignore hrefs in the HTML.  I’m not intending to work this up into a working HTML parser, so there will definitely be other bugs in this space.

 

import clr
import re
from System import *
from System.Net import *
from System.IO import *
from System.Threading import *

clr.AddReferenceToFileAndPath("""c:WhereeverRetlangIsRetlang.dll""")

from Retlang import *
def Search(baseUrl, spiderThreadsCount):
queues = []
spiderChannel = QueueChannel[str]()
spiderTrackerChannel = Channel[str]()
finishedTrackerChannel = Channel[str]()

waitHandle = AutoResetEvent(False)
spiderTracker = SpiderTracker(spiderChannel, waitHandle)

spiderTrackerQueue = PoolQueue()
spiderTrackerQueue.Start()
spiderTrackerChannel.Subscribe(spiderTrackerQueue,
lambda url: spiderTracker.FoundUrl(url))
finishedTrackerChannel.Subscribe(spiderTrackerQueue,
lambda url: spiderTracker.FinishedWithUrl(url))
for index in range(spiderThreadsCount):
queue = PoolQueue()
queues.append(queue)
queue.Start()
spider = Spider(spiderTrackerChannel, finishedTrackerChannel, baseUrl)
spiderChannel.Subscribe(queue, lambda url: spider.FindReferencedUrls(url))
spiderTrackerChannel.Publish(baseUrl)

waitHandle.WaitOne()
return spiderTracker.FoundUrls()

class Spider:
def __init__(self, spiderTracker, finishedTracker, baseUrl):
self._spiderTracker = spiderTracker
self
._finishedTracker = finishedTracker
self
._baseUrl = baseUrl.lower()


def FindReferencedUrls(self, pageUrl):
content = self.GetContent(pageUrl)
searchUrls = lambda pattern: [match for match in re.findall(pattern, content)]

urls = [self.ToAbsoluteUrl(pageUrl, url)
for url in searchUrls("href=[']([^'<>]+)[']")
+
searchUrls("href=["]([^"<>]+)["]")
+
searchUrls("href=(['" <>]+)")
if url is not None and url.Length > 0
and self.IsInternalLink(url)
and url[0] != '#'
and not url.endswith(".css")
and not re.search("css[.]axd", url)
]
for newUrl in urls:
self._spiderTracker.Publish(newUrl)
self._finishedTracker.Publish(pageUrl)

def IsInternalLink(self, url):
url = url.lower()
if url == """ or url == "'":
return False
if url.startswith(self._baseUrl):
return True
if url.startswith("http") or url.startswith("ftp") or url.startswith("javascript"):
return False
if re.search("javascript-error", url) or re.search("lt;", url):
return False
return True

def ToAbsoluteUrl(url, relativeUrl):
if re.search("//", relativeUrl):
return relativeUrl
BaseUrlIndex
= lambda u: u.find('/', u.find("//") + 2)
hashIndex = relativeUrl.find('#')
if hashIndex >= 0:
relativeUrl = relativeUrl[0:hashIndex]
if len(relativeUrl):
isRoot = relativeUrl.startswith("/")
if isRoot:
index = BaseUrlIndex(url)
else:
index = url.LastIndexOf('/') + 1
if index < 0:
raise "The url % is not correctly formatted." % url
return url[0:index] + relativeUrl
return None

def GetContent(self, url):
# print "Request: " + url
request = WebRequest.Create(url)
request.Proxy = WebRequest.DefaultWebProxy
response
= request.GetResponse()
try:
reader = StreamReader(response.GetResponseStream())
try:
return reader.ReadToEnd()
finally:
reader.Dispose()
finally:
response.Dispose()

ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)

class SpiderTracker:
def __init__(self, spider, waitHandle):
self._spider = spider
self
._waitHandle = waitHandle
self
._knownUrls = set();
self._urlsInProcess = 0

def FoundUrls(self):
return sorted(self._knownUrls)

def FoundUrl(self, url):
if url not in self._knownUrls:
self._knownUrls.add(url)
if Path.GetExtension(url) != "css":
self._urlsInProcess = self._urlsInProcess + 1
self._spider.Publish(url)

def FinishedWithUrl(self, url):
self._urlsInProcess = self._urlsInProcess - 1
print self._urlsInProcess
if self._urlsInProcess == 0:
self._waitHandle.Set()


for url in Search("http://www.yourtargeturl.com/", 5):
print url

 

Technorati Tags: ,

IronPython: I really hope this isn’t going to happen in the final version.

I mean, seriously, how is someone expected to debug this?

image

The bit I really dislike: try to spot where the name of the function that threw the error.  I guess it’ll be slicker when it’s finished…

For that matter, I’d rather not see this appear, either:

image

In fairness, many of Boo’s error messages are pretty useless.

Technorati Tags: