Using Retlang to implement a simple web spider

The Retlang wiki is a bit short on the sort of messy examples that I find useful when learning a product, so I thought I’d write one of my own. The following is a 200-line web spider. I’ll go through it and explain how it works and why you’d build it like this. I recently used techniques similar to this to get a FIX processor to run 30 times faster. Seriously. Retlang’s that good.

Five minute introduction to Retlang

Here’s how Retlang works:

A Context is a Thread/Queue pair. That is to say, a thread with an associated queue. (In practice, we actually use PoolQueues in the code, but the semantics are the same.)
Messages are sent one-way to Contexts across Channels.
Contexts subscribe to Channels by specifying a function to be called when the message comes off the queue.
Messages are processed in the exact order in which they were transmitted.
Typically, all of a given context’s messages are handled by a single object. This is usually termed the service.

Now, the important thing with Retlang is that it is designed to prevent you from having to put lock statements everywhere. This results in a couple of restrictions:

You shouldn’t use shared state.
Messages must be either immutable or serializable. (Immutable is faster.)

You can actually violate the restrictions if you know what you’re doing. The problem is, once you violate the restrictions, you need to start worrying about thread safety again. You’ll also need to worry about maintainability. Although Retlang doesn’t prevent you from using other techniques and threading models, you lose a lot of the readability when you do so.

There is a third restriction: You shouldn’t wait for another Context to finish doing something. In fact, you can do this, but you should always try to avoid it, since you can quite quickly kill your performance by doing so.

NB: Actually, threads and contexts are slightly different, but if you want to understand the differences, you’re better off reading Mike’s Blog. I’ve just called it a thread for simplicity here.

Shared State

The program works as follows:

The Spider class reads a page and works out what URLs are in the page.
The SpiderTracker class keeps track of what pages have been found.

In the code, there are five spiders. However, there can only be one spider tracker, which co-ordinates all of the spiders. Since I’ve already told you that you can’t have shared state, you might be wondering how this is handled. The answer is that you associate the SpiderTracker itself with a context. All modifications to and results from the Tracker comes through the same Retlang architecture. The Spiders each run on their own Context.

We only ever need to transmit strings, which are immutable. Now, Channels are one way, and are one way by design, so we need to pass the following messages:

Please scan this URL (SpiderTracker to Spider)
I’ve found this URL (Spider to SpiderTracker)
I’ve finished scanning this URL (Spider to SpiderTracker)

Distributing the work load is handled by a QueueChannel, which automatically sends messages to the next Spider waiting for a message. An alternative implementation would be to create separate channels for each Spider.

Halting

The last message is, in some senses, not necessary. Without it, every page would get scanned. However, the program would never finish. One of the trickiest problems with asynchronous communication and processing is actually figuring out what is going on and when you’re finished. With synchronous systems, you can usually determine both just from the call stack; it takes a bit more effort to display that information to the screen, but not a lot.

Therefore, having set up the Retlang contexts, the main thread then needs to wait for the track to indicate that it is finished. The tracker, in turn, counts how many pages are currently being scanned. When that hits zero, we’re finished. Retlang doesn’t provide its own facility for doing this, reasoning that using .Net’s WaitHandles is good enough.

The Code

Okay, you’ve waited long enough:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using System.Net;
using System.IO;
using System.Threading;

using Retlang;

class Program
{
    static void Main(string[] args)
    {
        string baseUrl = "http://www.yourblogname.net/Blog/";
        int spiderThreadsCount = 5;
        foreach (string url in Search(baseUrl, spiderThreadsCount))
        {
            Console.WriteLine(url);
        } 
        Console.ReadLine();
    }

    private static IEnumerable<string> Search(string baseUrl, int spiderThreadsCount)
    {
        // NB Make sure folders end in a slash: the code fails otherwise since it can't distinguish between
        // a folder and a file
        var queues = new List<IProcessQueue>();
        var spiderChannel = new QueueChannel<string>();
        var spiderTrackerChannel = new Channel<string>();
        var finishedTrackerChannel = new Channel<string>();
        
        var waitHandle = new AutoResetEvent(false);
        var spiderTracker = new SpiderTracker(spiderChannel, waitHandle);

        var spiderTrackerQueue = new PoolQueue();
        spiderTrackerQueue.Start();
        spiderTrackerChannel.Subscribe(spiderTrackerQueue, spiderTracker.FoundUrl);
        finishedTrackerChannel.Subscribe(spiderTrackerQueue, spiderTracker.FinishedWithUrl);
        for (int index = 0; index < spiderThreadsCount; index++)
        {
            var queue = new PoolQueue();
            queues.Add(queue);
            queue.Start();
            var spider = new Spider(spiderTrackerChannel, finishedTrackerChannel, baseUrl);
            // Strictly speaking, we only need one Spider that listens to multiple threads
            // since it has no internal state.
            // However, since this is an example, we'll avoid playing with fire and do
            // it the sensible way.
            spiderChannel.Subscribe(queue, spider.FindReferencedUrls);
        }
        spiderTrackerChannel.Publish(baseUrl);

        waitHandle.WaitOne();
        return spiderTracker.FoundUrls;
    }

    class Spider
    {
        IChannelPublisher<string> _spiderTracker;
        IChannelPublisher<string> _finishedTracker;
        string _baseUrl;

        public Spider(IChannelPublisher<string> spiderTracker, 
                IChannelPublisher<string> finishedTracker, string baseUrl)
        {
            _spiderTracker = spiderTracker;
            _finishedTracker = finishedTracker;
            _baseUrl = baseUrl.ToLowerInvariant();
        }

        public void FindReferencedUrls(string pageUrl) {
            string content = GetContent(pageUrl);
            var urls = from url in Urls(content, "href='(?<Url>[^'<>]+)'")
                            .Union(Urls(content, "href="(?<Url>[^"<>]+)""))
                            .Union(Urls(content, "href=(?<Url>[^'" <>]+)"))
                       where url != null && url.Length > 0
                            && IsInternalLink(url) && url[0] != '#'
                            && !url.Contains("&lt")
                            && !url.Contains("[")
                            && !url.Contains("\")
                            && !url.EndsWith(".css")
                            && !url.Contains("css.axd")
                       select ToAbsoluteUrl(pageUrl, url);
            foreach (var newUrl in urls) {
                _spiderTracker.Publish(newUrl);
            }
            _finishedTracker.Publish(pageUrl);
        }

        static int BaseUrlIndex(string url)
        {
            // This finds the first / after //
            return url.IndexOf('/', url.IndexOf("//") + 2);
        }

        string ToAbsoluteUrl(string url, string relativeUrl)
        {
            if (relativeUrl.Contains("//"))
            {
                return relativeUrl;
            }
            int hashIndex = relativeUrl.IndexOf('#');
            if (hashIndex >= 0)
            {
                relativeUrl = relativeUrl.Substring(0, hashIndex);
            }
            if (relativeUrl.Length > 0)
            {
                bool isRoot = relativeUrl.StartsWith("/");
                int index = isRoot 
                    ? BaseUrlIndex(url) 
                    : url.LastIndexOf('/') + 1;
                if (index < 0) {
                    throw new ArgumentException(string.Format("The url {0} is not correctly formatted.", url));
                }
                return url.Substring(0, index) + relativeUrl;
            }
            return null;
        }

        bool IsInternalLink(string url)
        {
            url = url.ToLowerInvariant();
            if (url.StartsWith(_baseUrl))
            {
                return true;
            }
            if (url.StartsWith("http") || url.StartsWith("ftp") || url.StartsWith("javascript"))
            {
                return false;
            }
            if (url.Contains("javascript-error"))
            {
                return false;
            }
            return true;
        }

        static IEnumerable<string> Urls(string content, string pattern)
        {
            var regex = new Regex(pattern);
            // Why exactly doesn't MatchCollection implement IEnumerable<Match> ?
            return from match in regex.Matches(content).Cast<Match>()
                   select match.Groups["Url"].Value;
        }

        static string GetContent(string url)
        {
            var request = WebRequest.Create(url);
            request.Proxy = WebRequest.DefaultWebProxy;
            try {
                using (var response = request.GetResponse()) {
                    using (var reader = new StreamReader(response.GetResponseStream())) {
                        return reader.ReadToEnd();
                    }
                }
            } catch (WebException ex) {
                Console.Error.WriteLine("Problem reading url {0}, message {1}.", url, ex.Message);
                return "";
            }
        }
    }

    class SpiderTracker
    {
        // NB We care about case.
        HashSet<string> _knownUrls = new HashSet<string>(StringComparer.InvariantCulture);
        IQueueChannel<string> _spider;
        int _urlsInProcess = 0;
        AutoResetEvent _waitHandle;

        public SpiderTracker(IQueueChannel<string> spider, AutoResetEvent waitHandle)
        {
            _spider = spider;
            _waitHandle = waitHandle;
        }

        public IEnumerable<string> FoundUrls
        {
            get { 
                return from url in _knownUrls 
                         orderby url 
                         select url; 
            }
        }

        public void FoundUrl(string url)
        {
            if (!_knownUrls.Contains(url))
            {
                _knownUrls.Add(url);
                if (Path.GetExtension(url) != "css")
                {
                    _urlsInProcess++;
                    _spider.Publish(url);
                }
            }
        }

        public void FinishedWithUrl(string url)
        {
            _urlsInProcess--;
            Console.WriteLine(_urlsInProcess);
            if (_urlsInProcess == 0)
            {
                _waitHandle.Set();
            }
        }
    }    
}

Caveats

Well, it’s only 200 lines, so it’s hardly going to be feature complete. Here’s some restrictions:

You can’t really run 5 WebRequests simultaneously, so the 5 queues are actually kind of pointless. They do handle 2 threads quite well, though. The code does nothing to fix this. If someone can point me in the right direction, I’ll release an updated version.
There are undoubtedly links that should be ignored that aren’t. Subtext’s EditUris are an example. In general, the HTML parsing is extremely simplistic, but it’s not the point of the exercise.
It doesn’t read robots.txt. Please don’t run this against sites you don’t have permission to spider.
It doesn’t respect nofollow.
It doesn’t clean up its threads after completion.

UPDATE: I’ve tidied up the code slightly. It’s now got a couple more heuristics about dud urls (it turns out that running the code against a blog full of url scanning code is an eye-opener… 😉 ). I’ve also tidied up the proxy handling. The IronPython version is here.

Technorati Tags: Retlang,Concurrency

One thought on “Using Retlang to implement a simple web spider”

Julian says:

November 30, -0001 at 12:00 am

Okay, in answer to a)There are three steps: find a url, search the url and mark it as finished. Find and search are different operations because when we’re spidering we quite often hit the same URL multiple times. Equally, of these operations, the only one that can be done in parallel is the actual search.So, the QueueChannel has “search this url” messages, the tracker channel has “found this url” messages and the finished channel has “I’m done with this url” messages.As for b), it’s important to understand that the only purpose of the “finished” messages is to work out when to terminate the process. In fact, part of the idea behind the design of the code was that services shouldn’t have to know about their interactions: this simplifies testing. So, no, the tracker doesn’t need to do anything with its worker processes: the search function orchestrates everything.In larger code, I follow the same pattern: setup, low knowledge interactions, tear down at the end.Incidentally, it’s worth pointing out that Retlang is pure .NET and doesn’t follow the Erlang design slavishly. There’s no “receive” concept in Retlang: it just calls functions.Anyway, I hope this helps explain what’s going on.

LikeLike

Using Retlang to implement a simple web spider

Five minute introduction to Retlang

Shared State

Halting

The Code

Caveats

Published by

Julian Birch

One thought on “Using Retlang to implement a simple web spider”

Leave a comment Cancel reply

Five minute introduction to Retlang

Shared State

Halting

The Code

Caveats

Share this:

Related

Published by

Julian Birch

One thought on “Using Retlang to implement a simple web spider”

Leave a comment Cancel reply