Spider Python: Some Notes on IronPython

I’ve converted over my Retlang spider example over to IronPython in order to get a feel for the differences.  Here’s some notes:

  • Python list comprehensions can do the same as LINQ for simple cases, but LINQ is much more powerful, and it supports deferred execution, while list comprehensions are evaluated greedily.  UPDATE: Thanks to Mark for pointing out that generators support deferred execution.  There’s still no syntax for grouping or ordering, but these are relatively rare cases.
  • Every description of the python syntax I ever see emphasizes the fact you don’t need to put in braces.  Pity they don’t spend more time telling you that you have to put in colons, that would actually be useful knowledge.  This really bit me when I learnt boo, which is syntactically very similar.
  • IronPython 1.0 targets CPython 2.4.  This sounds fine until you realize that this was released in 2004.  A fair bit has happened since then, not limited to the introduction of an inline if syntax.
  • While we’re on the subject, the inline if (defaultValue if condition else exceptionValue) is actually quite cool.
  • The fairly lax approach to types means I don’t need to call .Cast<Match> anymore.
  • Tabs are the devil as far as indents in Boo and Python are concerned.  I highly recommend using something that shows tabs explicitly, and then eliminating the lot.
  • Instance methods that explicitly take self feels awkward to me.
  • Static Methods that require an incantation such as “ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)” also feels pretty awkward.  UPDATE:  I’ve now learnt about decorators, thanks to Ken.  @staticmethod is relatively readable.
  • The casting to delegate isn’t quite as slick as it is in C#.  Passing in “spiderTracker.FoundUrl” instead of “lambda url: spiderTracker.FoundUrl(url)” results in an extremely unhelpful runtime error.
  • The lambda syntax is pretty elegant, but so is C#3’s.  Indeed, C#3 seems to have the edge.
  • Python’s regular expressions are powerful, but not quite as good as .NET’s.  In particular
    • search does what you’d expect match to do
    • findall doesn’t return matches.  Rather, it returns the value of the first group of the matches.  This is often actually what you wanted, but it’s a bit peculiar.  The semantics of “Matches” in .NET are much easier to understand.
  • It did seem rather slow compared to the C# version.  There are, of course, way too many variables here to make a final judgement, but it was disappointing.

So, I’m still waiting for my ESR moment.  Can’t say my program ran correctly the first time I ran it.

The code

A couple of notes if you’re looking at this in detail:

  • You’ll need to explicitly declare where the Retlang DLL is.
  • I’ve inlined a couple of functions, since the lambda syntax seemed less work than the class method syntax.
  • It handles proxies better than the original version.
  • There’s a minor bug fix from the previous version to encourage it to ignore hrefs in the HTML.  I’m not intending to work this up into a working HTML parser, so there will definitely be other bugs in this space.

 

import clr
import re
from System import *
from System.Net import *
from System.IO import *
from System.Threading import *

clr.AddReferenceToFileAndPath("""c:WhereeverRetlangIsRetlang.dll""")

from Retlang import *
def Search(baseUrl, spiderThreadsCount):
queues = []
spiderChannel = QueueChannel[str]()
spiderTrackerChannel = Channel[str]()
finishedTrackerChannel = Channel[str]()

waitHandle = AutoResetEvent(False)
spiderTracker = SpiderTracker(spiderChannel, waitHandle)

spiderTrackerQueue = PoolQueue()
spiderTrackerQueue.Start()
spiderTrackerChannel.Subscribe(spiderTrackerQueue,
lambda url: spiderTracker.FoundUrl(url))
finishedTrackerChannel.Subscribe(spiderTrackerQueue,
lambda url: spiderTracker.FinishedWithUrl(url))
for index in range(spiderThreadsCount):
queue = PoolQueue()
queues.append(queue)
queue.Start()
spider = Spider(spiderTrackerChannel, finishedTrackerChannel, baseUrl)
spiderChannel.Subscribe(queue, lambda url: spider.FindReferencedUrls(url))
spiderTrackerChannel.Publish(baseUrl)

waitHandle.WaitOne()
return spiderTracker.FoundUrls()

class Spider:
def __init__(self, spiderTracker, finishedTracker, baseUrl):
self._spiderTracker = spiderTracker
self
._finishedTracker = finishedTracker
self
._baseUrl = baseUrl.lower()


def FindReferencedUrls(self, pageUrl):
content = self.GetContent(pageUrl)
searchUrls = lambda pattern: [match for match in re.findall(pattern, content)]

urls = [self.ToAbsoluteUrl(pageUrl, url)
for url in searchUrls("href=[']([^'<>]+)[']")
+
searchUrls("href=["]([^"<>]+)["]")
+
searchUrls("href=(['" <>]+)")
if url is not None and url.Length > 0
and self.IsInternalLink(url)
and url[0] != '#'
and not url.endswith(".css")
and not re.search("css[.]axd", url)
]
for newUrl in urls:
self._spiderTracker.Publish(newUrl)
self._finishedTracker.Publish(pageUrl)

def IsInternalLink(self, url):
url = url.lower()
if url == """ or url == "'":
return False
if url.startswith(self._baseUrl):
return True
if url.startswith("http") or url.startswith("ftp") or url.startswith("javascript"):
return False
if re.search("javascript-error", url) or re.search("lt;", url):
return False
return True

def ToAbsoluteUrl(url, relativeUrl):
if re.search("//", relativeUrl):
return relativeUrl
BaseUrlIndex
= lambda u: u.find('/', u.find("//") + 2)
hashIndex = relativeUrl.find('#')
if hashIndex >= 0:
relativeUrl = relativeUrl[0:hashIndex]
if len(relativeUrl):
isRoot = relativeUrl.startswith("/")
if isRoot:
index = BaseUrlIndex(url)
else:
index = url.LastIndexOf('/') + 1
if index < 0:
raise "The url % is not correctly formatted." % url
return url[0:index] + relativeUrl
return None

def GetContent(self, url):
# print "Request: " + url
request = WebRequest.Create(url)
request.Proxy = WebRequest.DefaultWebProxy
response
= request.GetResponse()
try:
reader = StreamReader(response.GetResponseStream())
try:
return reader.ReadToEnd()
finally:
reader.Dispose()
finally:
response.Dispose()

ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)

class SpiderTracker:
def __init__(self, spider, waitHandle):
self._spider = spider
self
._waitHandle = waitHandle
self
._knownUrls = set();
self._urlsInProcess = 0

def FoundUrls(self):
return sorted(self._knownUrls)

def FoundUrl(self, url):
if url not in self._knownUrls:
self._knownUrls.add(url)
if Path.GetExtension(url) != "css":
self._urlsInProcess = self._urlsInProcess + 1
self._spider.Publish(url)

def FinishedWithUrl(self, url):
self._urlsInProcess = self._urlsInProcess - 1
print self._urlsInProcess
if self._urlsInProcess == 0:
self._waitHandle.Set()


for url in Search("http://www.yourtargeturl.com/", 5):
print url

 

Technorati Tags: ,

IronPython: I really hope this isn’t going to happen in the final version.

I mean, seriously, how is someone expected to debug this?

image

The bit I really dislike: try to spot where the name of the function that threw the error.  I guess it’ll be slicker when it’s finished…

For that matter, I’d rather not see this appear, either:

image

In fairness, many of Boo’s error messages are pretty useless.

Technorati Tags:

More Great Things about Retlang

I’ve lost count of the number of times I’ve seen a technology that looked great in the sample, but didn’t hold up when I took it for a proper test drive.  Mike Rettig, on the other hand, has really thought through the use cases.  So when you try to take a sample program and hit the real world, you discover he’s been there before you.  Here’s some highlights:

  • Everything implements interfaces.  You can mock or stub pretty much everything.  (You might want to create your own abstract factory, though.)
  • Not only does everything have an interface, the interfaces are finely grained, making stubbing even easier.  You don’t need to simulate channel subscription if you only use channel publishing, for instance.
  • The publish/subscribe model is very robust
    • You can send a single message to multiple listeners
    • You can handle a message synchronously on the sending thread (providing you’re careful about thread-safety)
    • You can batch the processing of messages, allowing you to ignore duplicates.
  • You can set up timed events, one-shot or repeating.  This is pretty vital to allow long-running services to report their state and perform periodic clean-ups.
  • It’s really fast.  If you’ve got a performance problem, I can assure you it’s your problem, not Retlang’s.
  • You can inject your own behaviours nearly everywhere.  For example, you can set up a common error trap and logging using ICommandExecutor.
  • A single queue can hold lots of different messages.  This is vital for complex interactions.

While I’m here, a couple of things I don’t like:

  • It would be nice if there was an interface that incorporate ICommandQueue and ICommandTimer.
  • Equally, I’d rather like IQueueChannel to implement IChannelPublisher.
  • The default command executor kills the thread if an action throws an exception.  This is a pretty aggressive default behaviour.

You’ll gather that these are pretty minor quibbles about an excellent library.

Technorati Tags: ,

Using Retlang to implement a simple web spider

The Retlang wiki is a bit short on the sort of messy examples that I find useful when learning a product, so I thought I’d write one of my own.  The following is a 200-line web spider.  I’ll go through it and explain how it works and why you’d build it like this.  I recently used techniques similar to this to get a FIX processor to run 30 times faster.  Seriously.  Retlang’s that good.

Five minute introduction to Retlang

Here’s how Retlang works:

  • A Context is a Thread/Queue pair.  That is to say, a thread with an associated queue.  (In practice, we actually use PoolQueues in the code, but the semantics are the same.)
  • Messages are sent one-way to Contexts across Channels.
  • Contexts subscribe to Channels by specifying a function to be called when the message comes off the queue.
  • Messages are processed in the exact order in which they were transmitted.
  • Typically, all of a given context’s messages are handled by a single object.  This is usually termed the service.

Now, the important thing with Retlang is that it is designed to prevent you from having to put lock statements everywhere.  This results in a couple of restrictions:

  • You shouldn’t use shared state.
  • Messages must be either immutable or serializable.  (Immutable is faster.)

You can actually violate the restrictions if you know what you’re doing.  The problem is, once you violate the restrictions, you need to start worrying about thread safety again.  You’ll also need to worry about maintainability.  Although Retlang doesn’t prevent you from using other techniques and threading models, you lose a lot of the readability when you do so.

There is a third restriction:  You shouldn’t wait for another Context to finish doing something.  In fact, you can do this, but you should always try to avoid it, since you can quite quickly kill your performance by doing so.

NB:  Actually, threads and contexts are slightly different, but if you want to understand the differences, you’re better off reading Mike’s Blog.  I’ve just called it a thread for simplicity here.

Shared State

The program works as follows:

  • The Spider class reads a page and works out what URLs are in the page.
  • The SpiderTracker class keeps track of what pages have been found.

In the code, there are five spiders.  However, there can only be one spider tracker, which co-ordinates all of the spiders.  Since I’ve already told you that you can’t have shared state, you might be wondering how this is handled.  The answer is that you associate the SpiderTracker itself with a context.  All modifications to and results from the Tracker comes through the same Retlang architecture.  The Spiders each run on their own Context.

We only ever need to transmit strings, which are immutable.  Now, Channels are one way, and are one way by design, so we need to pass the following messages:

  • Please scan this URL  (SpiderTracker to Spider)
  • I’ve found this URL (Spider to SpiderTracker)
  • I’ve finished scanning this URL (Spider to SpiderTracker)

Distributing the work load is handled by a QueueChannel, which automatically sends messages to the next Spider waiting for a message.  An alternative implementation would be to create separate channels for each Spider.

Halting

The last message is, in some senses, not necessary.  Without it, every page would get scanned.  However, the program would never finish.  One of the trickiest problems with asynchronous communication and processing is actually figuring out what is going on and when you’re finished.  With synchronous systems, you can usually determine both just from the call stack; it takes a bit more effort to display that information to the screen, but not a lot.

Therefore, having set up the Retlang contexts, the main thread then needs to wait for the track to indicate that it is finished.  The tracker, in turn, counts how many pages are currently being scanned.  When that hits zero, we’re finished.  Retlang doesn’t provide its own facility for doing this, reasoning that using .Net’s WaitHandles is good enough.

The Code

Okay, you’ve waited long enough:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using System.Net;
using System.IO;
using System.Threading;

using Retlang;

class Program
{
    static void Main(string[] args)
    {
        string baseUrl = "http://www.yourblogname.net/Blog/";
        int spiderThreadsCount = 5;
        foreach (string url in Search(baseUrl, spiderThreadsCount))
        {
            Console.WriteLine(url);
        } 
        Console.ReadLine();
    }

    private static IEnumerable<string> Search(string baseUrl, int spiderThreadsCount)
    {
        // NB Make sure folders end in a slash: the code fails otherwise since it can't distinguish between
        // a folder and a file
        var queues = new List<IProcessQueue>();
        var spiderChannel = new QueueChannel<string>();
        var spiderTrackerChannel = new Channel<string>();
        var finishedTrackerChannel = new Channel<string>();
        
        var waitHandle = new AutoResetEvent(false);
        var spiderTracker = new SpiderTracker(spiderChannel, waitHandle);

        var spiderTrackerQueue = new PoolQueue();
        spiderTrackerQueue.Start();
        spiderTrackerChannel.Subscribe(spiderTrackerQueue, spiderTracker.FoundUrl);
        finishedTrackerChannel.Subscribe(spiderTrackerQueue, spiderTracker.FinishedWithUrl);
        for (int index = 0; index < spiderThreadsCount; index++)
        {
            var queue = new PoolQueue();
            queues.Add(queue);
            queue.Start();
            var spider = new Spider(spiderTrackerChannel, finishedTrackerChannel, baseUrl);
            // Strictly speaking, we only need one Spider that listens to multiple threads
            // since it has no internal state.
            // However, since this is an example, we'll avoid playing with fire and do
            // it the sensible way.
            spiderChannel.Subscribe(queue, spider.FindReferencedUrls);
        }
        spiderTrackerChannel.Publish(baseUrl);

        waitHandle.WaitOne();
        return spiderTracker.FoundUrls;
    }

    class Spider
    {
        IChannelPublisher<string> _spiderTracker;
        IChannelPublisher<string> _finishedTracker;
        string _baseUrl;

        public Spider(IChannelPublisher<string> spiderTracker, 
IChannelPublisher<string> finishedTracker, string baseUrl) { _spiderTracker = spiderTracker; _finishedTracker = finishedTracker; _baseUrl = baseUrl.ToLowerInvariant(); } public void FindReferencedUrls(string pageUrl) { string content = GetContent(pageUrl); var urls = from url in Urls(content, "href='(?<Url>[^'<>]+)'") .Union(Urls(content, "href="(?<Url>[^"<>]+)"")) .Union(Urls(content, "href=(?<Url>[^'" <>]+)")) where url != null && url.Length > 0 && IsInternalLink(url) && url[0] != '#' && !url.Contains("&lt") && !url.Contains("[") && !url.Contains("\") && !url.EndsWith(".css") && !url.Contains("css.axd") select ToAbsoluteUrl(pageUrl, url); foreach (var newUrl in urls) { _spiderTracker.Publish(newUrl); } _finishedTracker.Publish(pageUrl); } static int BaseUrlIndex(string url) { // This finds the first / after // return url.IndexOf('/', url.IndexOf("//") + 2); } string ToAbsoluteUrl(string url, string relativeUrl) { if (relativeUrl.Contains("//")) { return relativeUrl; } int hashIndex = relativeUrl.IndexOf('#'); if (hashIndex >= 0) { relativeUrl = relativeUrl.Substring(0, hashIndex); } if (relativeUrl.Length > 0) { bool isRoot = relativeUrl.StartsWith("/"); int index = isRoot ? BaseUrlIndex(url) : url.LastIndexOf('/') + 1; if (index < 0) { throw new ArgumentException(string.Format("The url {0} is not correctly formatted.", url)); } return url.Substring(0, index) + relativeUrl; } return null; } bool IsInternalLink(string url) { url = url.ToLowerInvariant(); if (url.StartsWith(_baseUrl)) { return true; } if (url.StartsWith("http") || url.StartsWith("ftp") || url.StartsWith("javascript")) { return false; } if (url.Contains("javascript-error")) { return false; } return true; } static IEnumerable<string> Urls(string content, string pattern) { var regex = new Regex(pattern); // Why exactly doesn't MatchCollection implement IEnumerable<Match> ? return from match in regex.Matches(content).Cast<Match>() select match.Groups["Url"].Value; } static string GetContent(string url) { var request = WebRequest.Create(url); request.Proxy = WebRequest.DefaultWebProxy; try { using (var response = request.GetResponse()) { using (var reader = new StreamReader(response.GetResponseStream())) { return reader.ReadToEnd(); } } } catch (WebException ex) { Console.Error.WriteLine("Problem reading url {0}, message {1}.", url, ex.Message); return ""; } } } class SpiderTracker { // NB We care about case. HashSet<string> _knownUrls = new HashSet<string>(StringComparer.InvariantCulture); IQueueChannel<string> _spider; int _urlsInProcess = 0; AutoResetEvent _waitHandle; public SpiderTracker(IQueueChannel<string> spider, AutoResetEvent waitHandle) { _spider = spider; _waitHandle = waitHandle; } public IEnumerable<string> FoundUrls { get { return from url in _knownUrls orderby url select url; } } public void FoundUrl(string url) { if (!_knownUrls.Contains(url)) { _knownUrls.Add(url); if (Path.GetExtension(url) != "css") { _urlsInProcess++; _spider.Publish(url); } } } public void FinishedWithUrl(string url) { _urlsInProcess--; Console.WriteLine(_urlsInProcess); if (_urlsInProcess == 0) { _waitHandle.Set(); } } } }

Caveats

Well, it’s only 200 lines, so it’s hardly going to be feature complete.  Here’s some restrictions:

  • You can’t really run 5 WebRequests simultaneously, so the 5 queues are actually kind of pointless.  They do handle 2 threads quite well, though.  The code does nothing to fix this.  If someone can point me in the right direction, I’ll release an updated version.
  • There are undoubtedly links that should be ignored that aren’t.  Subtext’s EditUris are an example.  In general, the HTML parsing is extremely simplistic, but it’s not the point of the exercise.
  • It doesn’t read robots.txt.  Please don’t run this against sites you don’t have permission to spider.
  • It doesn’t respect nofollow.
  • It doesn’t clean up its threads after completion.

UPDATE: I’ve tidied up the code slightly.  It’s now got a couple more heuristics about dud urls (it turns out that running the code against a blog full of url scanning code is an eye-opener… 😉 ).  I’ve also tidied up the proxy handling.  The IronPython version is here.

Technorati Tags: ,

Upgrading Skins to Subtext 2.0

The good news is, skins have hardly changed in Subtext 2.0.  The skins were very powerful in 1.9, so there wasn’t a big reason to change them.  The bad news is, they’re not quite the same.  If you’ve got your own skin, here’s all you need to do:

  • Copy across any non-standard skin you were using from your version 1 folder.
  • Copy across the skin settings from the old Admin/Skin.config to the new one.

If the skin is locked, put a space in the web.config first, which will unlock the skin.  Note that if you’re using a modified factory skin, you’ll still need to copy across the Skin.config entries, because SubText 2 has modified the location of the style-sheets on the standard skins.

Now, if you then view your page and all of your style sheet has disappeared, that’s probably because you haven’t updated the skins.config correctly.  Once you’ve done that, you might discover that images have disappeared.  This is because of a change in the way subtext includes stylesheets for skins.  (The new way is better, it’s just a breaking change.)  Your stylesheets are now served by a virtual file called css.axd.  Do a view source to see the url, then view the url in the browser to see the output.  This puts all of your stylesheets in one place, which is in the root of the skins folder.  Since your old skin may have kept the stylesheets in another folder, this breaks the stylesheet.  In my case, a quick search and replace of “..images” to “images” fixed the problem, but it depends on the architecture of your skin.

The only other gotcha I’ve noticed is that Subtext now uses XHTML transitional by default.  Again, this is actually a good thing, but it can affect rendering.  it certainly changes the behaviour of IE’s positioning.  This is usually subtle, but is the reason for the image problem I described previously.

Technorati Tags: ,

Upgrading to Subtext 2.0

It’s a source of embarrassment that most of the traffic for my web site is uploads of blogging software, but there’s nothing better for embarrassment than publicizing it, or so I was told.  So, if you haven’t done the upgrade yet, here’s my routine for doing it:

  • Export your existing blog to BlogML, back up the database, do whatever you can to avoid borking your site because of the upgrade.
  • Make a copy of your current web.config.  The configs aren’t compatible, but you’ll want to refer to it.
  • I uploaded the new software to a new folder (Blog2), rather than overwriting the old one.
  • Since you’re upgrading, delete the SQL dump files (they’re in App_Data, Subtext2.0.mdf and Subtext2.0_log.ldf).  They’re huge and they’ll slow your upload up.
  • If you’re feeling confident, you can avoid uploading some skins as well.  I’d hang onto the default skin, though.
  • Now I did the substitution in.  This actually involved renaming Blog to Blog1, creating a new Blog folder and moving everything from Blog2 into Blog.  If you were more worried about downtime, you’d set up Blog2 as a virtual directory and slap an App_Offline.htm into the Blog folder at this point.
  • Edit the web.config and switch customErrors to “Off” (remember the capital O, I’ve lost hours of life over that…)
  • Fix the HostEmailAddress.  This is the “forgotten admin password” email address.
  • Put the connection string in from the old config file.  (I’ve noticed that they’ve put a clear right at the start of the connectionStrings section.  This prevents a lot of stupid configuration problems.)
  • Now you can go to the blog.  You’ll get the “we’re being upgraded message”.  Now you can click through and log on.
  • If you’ve forgotten your password, the password reset won’t work and you need to run the query at the bottom of this page to sort things out.  If you’ve only forgotten your admin user name, you can find it out just by selecting from the subtext_hosts table.  Either, obviously, is painful in a hosted environment.
  • Hit the button and hopefully you’ll be upgraded!

Next, time to upgrade your skin; it’s a pain but not hard.  I’ve put that into a separate post.

Finally, switch customErrors back to RemoteOnly.

Technorati Tags: