Spider Python: Some Notes on IronPython

I’ve converted over my Retlang spider example over to IronPython in order to get a feel for the differences. Here’s some notes:

Python list comprehensions can do the same as LINQ for simple cases, but LINQ is much more powerful, and it supports deferred execution, while list comprehensions are evaluated greedily. UPDATE: Thanks to Mark for pointing out that generators support deferred execution. There’s still no syntax for grouping or ordering, but these are relatively rare cases.
Every description of the python syntax I ever see emphasizes the fact you don’t need to put in braces. Pity they don’t spend more time telling you that you have to put in colons, that would actually be useful knowledge. This really bit me when I learnt boo, which is syntactically very similar.
IronPython 1.0 targets CPython 2.4. This sounds fine until you realize that this was released in 2004. A fair bit has happened since then, not limited to the introduction of an inline if syntax.
While we’re on the subject, the inline if (defaultValue if condition else exceptionValue) is actually quite cool.
The fairly lax approach to types means I don’t need to call .Cast<Match> anymore.
Tabs are the devil as far as indents in Boo and Python are concerned. I highly recommend using something that shows tabs explicitly, and then eliminating the lot.
Instance methods that explicitly take self feels awkward to me.
Static Methods that require an incantation such as “ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)” also feels pretty awkward. UPDATE: I’ve now learnt about decorators, thanks to Ken. @staticmethod is relatively readable.
The casting to delegate isn’t quite as slick as it is in C#. Passing in “spiderTracker.FoundUrl” instead of “lambda url: spiderTracker.FoundUrl(url)” results in an extremely unhelpful runtime error.
The lambda syntax is pretty elegant, but so is C#3’s. Indeed, C#3 seems to have the edge.
Python’s regular expressions are powerful, but not quite as good as .NET’s. In particular
- search does what you’d expect match to do
- findall doesn’t return matches. Rather, it returns the value of the first group of the matches. This is often actually what you wanted, but it’s a bit peculiar. The semantics of “Matches” in .NET are much easier to understand.
It did seem rather slow compared to the C# version. There are, of course, way too many variables here to make a final judgement, but it was disappointing.

So, I’m still waiting for my ESR moment. Can’t say my program ran correctly the first time I ran it.

The code

A couple of notes if you’re looking at this in detail:

You’ll need to explicitly declare where the Retlang DLL is.
I’ve inlined a couple of functions, since the lambda syntax seemed less work than the class method syntax.
It handles proxies better than the original version.
There’s a minor bug fix from the previous version to encourage it to ignore hrefs in the HTML. I’m not intending to work this up into a working HTML parser, so there will definitely be other bugs in this space.

import clr
import re
from System import *
from System.Net import *
from System.IO import *
from System.Threading import *

clr.AddReferenceToFileAndPath("""c:WhereeverRetlangIsRetlang.dll""")

from Retlang import *
def Search(baseUrl, spiderThreadsCount):
    queues = []
    spiderChannel = QueueChannel[str]()
    spiderTrackerChannel = Channel[str]()
    finishedTrackerChannel = Channel[str]()
    
    waitHandle = AutoResetEvent(False)
    spiderTracker = SpiderTracker(spiderChannel, waitHandle)

    spiderTrackerQueue = PoolQueue()
    spiderTrackerQueue.Start()
    spiderTrackerChannel.Subscribe(spiderTrackerQueue, 
                lambda url: spiderTracker.FoundUrl(url))
    finishedTrackerChannel.Subscribe(spiderTrackerQueue, 
                lambda url: spiderTracker.FinishedWithUrl(url))
    for index in range(spiderThreadsCount):
        queue = PoolQueue()
        queues.append(queue)
        queue.Start()
        spider = Spider(spiderTrackerChannel, finishedTrackerChannel, baseUrl)
        spiderChannel.Subscribe(queue, lambda url: spider.FindReferencedUrls(url))
    spiderTrackerChannel.Publish(baseUrl)

    waitHandle.WaitOne()
    return spiderTracker.FoundUrls()

class Spider:
    def __init__(self, spiderTracker, finishedTracker, baseUrl):
        self._spiderTracker = spiderTracker
        self._finishedTracker = finishedTracker
        self._baseUrl = baseUrl.lower()
        

    def FindReferencedUrls(self, pageUrl):
        content = self.GetContent(pageUrl)
        searchUrls = lambda pattern: [match for match in re.findall(pattern, content)]

        urls = [self.ToAbsoluteUrl(pageUrl, url)
                   for url in searchUrls("href=[']([^'<>]+)[']")
                         + searchUrls("href=["]([^"<>]+)["]")
                         + searchUrls("href=(['" <>]+)")
                   if url is not None and url.Length > 0
                        and self.IsInternalLink(url) 
                        and url[0] != '#'
                        and not url.endswith(".css")
                        and not re.search("css[.]axd", url)
               ]
        for newUrl in urls:
            self._spiderTracker.Publish(newUrl)
        self._finishedTracker.Publish(pageUrl)
    
    def IsInternalLink(self, url):
        url = url.lower()
        if url == """ or url == "'":
            return False
        if url.startswith(self._baseUrl):
            return True
        if url.startswith("http") or url.startswith("ftp") or url.startswith("javascript"):
            return False
        if re.search("javascript-error", url) or re.search("lt;", url):
            return False
        return True
    
    def ToAbsoluteUrl(url, relativeUrl):
        if re.search("//", relativeUrl):
            return relativeUrl
        BaseUrlIndex = lambda u: u.find('/', u.find("//") + 2)
        hashIndex = relativeUrl.find('#')
        if hashIndex >= 0:
            relativeUrl = relativeUrl[0:hashIndex]
        if len(relativeUrl):
            isRoot = relativeUrl.startswith("/")
            if isRoot:
                index = BaseUrlIndex(url)
            else:
                index = url.LastIndexOf('/') + 1
            if index < 0:
                raise "The url % is not correctly formatted." % url
            return url[0:index] + relativeUrl
        return None

    def GetContent(self, url):
        # print "Request:  " + url
        request = WebRequest.Create(url)
        request.Proxy = WebRequest.DefaultWebProxy
        response = request.GetResponse()
        try:
            reader = StreamReader(response.GetResponseStream())
            try:
                return reader.ReadToEnd()
            finally:
                reader.Dispose()
        finally:
            response.Dispose()
            
    ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)

class SpiderTracker:
    def __init__(self, spider, waitHandle):
        self._spider = spider
        self._waitHandle = waitHandle
        self._knownUrls = set();
        self._urlsInProcess = 0

    def FoundUrls(self):
        return sorted(self._knownUrls)

    def FoundUrl(self, url):
        if url not in self._knownUrls:
            self._knownUrls.add(url)
            if Path.GetExtension(url) != "css":
                self._urlsInProcess = self._urlsInProcess + 1
                self._spider.Publish(url)

    def FinishedWithUrl(self, url):
        self._urlsInProcess = self._urlsInProcess - 1
        print self._urlsInProcess
        if self._urlsInProcess == 0:
            self._waitHandle.Set()

    
for url in Search("http://www.yourtargeturl.com/", 5):
    print url

Technorati Tags: Retlang,IronPython

One thought on “Spider Python: Some Notes on IronPython”

Julian says:

November 30, -0001 at 12:00 am

Mark, Ken: Thanks for your corrections. I’ve updated the main article (but not modified the code).Audax: The problem comes when you’re using the syntax I describe above, where you’re trying to pass in an instance method. I don’t know if this is a result of Python’s semantics, or just IronPython’s integration with delegates. However, the effect is that SomeFunction(spiderTracker.FoundUrl)(where SomeFunction is a .NET function that takes a delegate) actually sends a null into SomeFunction. Hardly desirable behaviour.

LikeLike

Spider Python: Some Notes on IronPython

The code

Published by

Julian Birch

One thought on “Spider Python: Some Notes on IronPython”

Leave a comment Cancel reply

The code

Share this:

Related

Published by

Julian Birch

One thought on “Spider Python: Some Notes on IronPython”

Leave a comment Cancel reply