Spider Python: Some Notes on IronPython

I’ve converted over my Retlang spider example over to IronPython in order to get a feel for the differences.  Here’s some notes:

  • Python list comprehensions can do the same as LINQ for simple cases, but LINQ is much more powerful, and it supports deferred execution, while list comprehensions are evaluated greedily.  UPDATE: Thanks to Mark for pointing out that generators support deferred execution.  There’s still no syntax for grouping or ordering, but these are relatively rare cases.
  • Every description of the python syntax I ever see emphasizes the fact you don’t need to put in braces.  Pity they don’t spend more time telling you that you have to put in colons, that would actually be useful knowledge.  This really bit me when I learnt boo, which is syntactically very similar.
  • IronPython 1.0 targets CPython 2.4.  This sounds fine until you realize that this was released in 2004.  A fair bit has happened since then, not limited to the introduction of an inline if syntax.
  • While we’re on the subject, the inline if (defaultValue if condition else exceptionValue) is actually quite cool.
  • The fairly lax approach to types means I don’t need to call .Cast<Match> anymore.
  • Tabs are the devil as far as indents in Boo and Python are concerned.  I highly recommend using something that shows tabs explicitly, and then eliminating the lot.
  • Instance methods that explicitly take self feels awkward to me.
  • Static Methods that require an incantation such as “ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)” also feels pretty awkward.  UPDATE:  I’ve now learnt about decorators, thanks to Ken.  @staticmethod is relatively readable.
  • The casting to delegate isn’t quite as slick as it is in C#.  Passing in “spiderTracker.FoundUrl” instead of “lambda url: spiderTracker.FoundUrl(url)” results in an extremely unhelpful runtime error.
  • The lambda syntax is pretty elegant, but so is C#3’s.  Indeed, C#3 seems to have the edge.
  • Python’s regular expressions are powerful, but not quite as good as .NET’s.  In particular
    • search does what you’d expect match to do
    • findall doesn’t return matches.  Rather, it returns the value of the first group of the matches.  This is often actually what you wanted, but it’s a bit peculiar.  The semantics of “Matches” in .NET are much easier to understand.
  • It did seem rather slow compared to the C# version.  There are, of course, way too many variables here to make a final judgement, but it was disappointing.

So, I’m still waiting for my ESR moment.  Can’t say my program ran correctly the first time I ran it.

The code

A couple of notes if you’re looking at this in detail:

  • You’ll need to explicitly declare where the Retlang DLL is.
  • I’ve inlined a couple of functions, since the lambda syntax seemed less work than the class method syntax.
  • It handles proxies better than the original version.
  • There’s a minor bug fix from the previous version to encourage it to ignore hrefs in the HTML.  I’m not intending to work this up into a working HTML parser, so there will definitely be other bugs in this space.

 

import clr
import re
from System import *
from System.Net import *
from System.IO import *
from System.Threading import *

clr.AddReferenceToFileAndPath("""c:WhereeverRetlangIsRetlang.dll""")

from Retlang import *
def Search(baseUrl, spiderThreadsCount):
queues = []
spiderChannel = QueueChannel[str]()
spiderTrackerChannel = Channel[str]()
finishedTrackerChannel = Channel[str]()

waitHandle = AutoResetEvent(False)
spiderTracker = SpiderTracker(spiderChannel, waitHandle)

spiderTrackerQueue = PoolQueue()
spiderTrackerQueue.Start()
spiderTrackerChannel.Subscribe(spiderTrackerQueue,
lambda url: spiderTracker.FoundUrl(url))
finishedTrackerChannel.Subscribe(spiderTrackerQueue,
lambda url: spiderTracker.FinishedWithUrl(url))
for index in range(spiderThreadsCount):
queue = PoolQueue()
queues.append(queue)
queue.Start()
spider = Spider(spiderTrackerChannel, finishedTrackerChannel, baseUrl)
spiderChannel.Subscribe(queue, lambda url: spider.FindReferencedUrls(url))
spiderTrackerChannel.Publish(baseUrl)

waitHandle.WaitOne()
return spiderTracker.FoundUrls()

class Spider:
def __init__(self, spiderTracker, finishedTracker, baseUrl):
self._spiderTracker = spiderTracker
self
._finishedTracker = finishedTracker
self
._baseUrl = baseUrl.lower()


def FindReferencedUrls(self, pageUrl):
content = self.GetContent(pageUrl)
searchUrls = lambda pattern: [match for match in re.findall(pattern, content)]

urls = [self.ToAbsoluteUrl(pageUrl, url)
for url in searchUrls("href=[']([^'<>]+)[']")
+
searchUrls("href=["]([^"<>]+)["]")
+
searchUrls("href=(['" <>]+)")
if url is not None and url.Length > 0
and self.IsInternalLink(url)
and url[0] != '#'
and not url.endswith(".css")
and not re.search("css[.]axd", url)
]
for newUrl in urls:
self._spiderTracker.Publish(newUrl)
self._finishedTracker.Publish(pageUrl)

def IsInternalLink(self, url):
url = url.lower()
if url == """ or url == "'":
return False
if url.startswith(self._baseUrl):
return True
if url.startswith("http") or url.startswith("ftp") or url.startswith("javascript"):
return False
if re.search("javascript-error", url) or re.search("lt;", url):
return False
return True

def ToAbsoluteUrl(url, relativeUrl):
if re.search("//", relativeUrl):
return relativeUrl
BaseUrlIndex
= lambda u: u.find('/', u.find("//") + 2)
hashIndex = relativeUrl.find('#')
if hashIndex >= 0:
relativeUrl = relativeUrl[0:hashIndex]
if len(relativeUrl):
isRoot = relativeUrl.startswith("/")
if isRoot:
index = BaseUrlIndex(url)
else:
index = url.LastIndexOf('/') + 1
if index < 0:
raise "The url % is not correctly formatted." % url
return url[0:index] + relativeUrl
return None

def GetContent(self, url):
# print "Request: " + url
request = WebRequest.Create(url)
request.Proxy = WebRequest.DefaultWebProxy
response
= request.GetResponse()
try:
reader = StreamReader(response.GetResponseStream())
try:
return reader.ReadToEnd()
finally:
reader.Dispose()
finally:
response.Dispose()

ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)

class SpiderTracker:
def __init__(self, spider, waitHandle):
self._spider = spider
self
._waitHandle = waitHandle
self
._knownUrls = set();
self._urlsInProcess = 0

def FoundUrls(self):
return sorted(self._knownUrls)

def FoundUrl(self, url):
if url not in self._knownUrls:
self._knownUrls.add(url)
if Path.GetExtension(url) != "css":
self._urlsInProcess = self._urlsInProcess + 1
self._spider.Publish(url)

def FinishedWithUrl(self, url):
self._urlsInProcess = self._urlsInProcess - 1
print self._urlsInProcess
if self._urlsInProcess == 0:
self._waitHandle.Set()


for url in Search("http://www.yourtargeturl.com/", 5):
print url

 

Technorati Tags: ,

Published by

Julian Birch

Full time dad, does a bit of coding on the side.

One thought on “Spider Python: Some Notes on IronPython”

  1. Mark, Ken: Thanks for your corrections. I’ve updated the main article (but not modified the code).Audax: The problem comes when you’re using the syntax I describe above, where you’re trying to pass in an instance method. I don’t know if this is a result of Python’s semantics, or just IronPython’s integration with delegates. However, the effect is that SomeFunction(spiderTracker.FoundUrl)(where SomeFunction is a .NET function that takes a delegate) actually sends a null into SomeFunction. Hardly desirable behaviour.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s