Retlang 0.4

Retlang 0.4 has been out for quite a while, but I’ve never written about it.  Worse, my example code doesn’t work in it, which has garnered complaints (in my defence, the Retlang wiki is out of date as well).  The version is again a restructuring that you shouldn’t apply to your own code without a bit of thought, but it’s added a killer feature: the ability to interact with WinForms.  Or WPF if you’re that way inclined.

Anyway, I’ve drawn up a new version of the spider, but it differs quite a bit from the previous versions:

  • It now has something resembling an architecture.
  • Keeping track of progress is now handled separately from keeping track of which URLs have been encountered.
  • I’ve used the ability to throw a command directly onto a fiber.  (Note for old users:  where you used to say “Context” or “Queue”, you now say “Fiber”)
  • I’ve abstracted out the basic graph-walking functionality from the URL reading functionality.  It’s not clear to me that it’s more readable this way, although the advantages of abstraction tend to grow with scale.
  • If you pass an argument of “interactive” in, it will show a progress bar.

I still haven’t gone with Mike Rettig’s suggestion to group channels together into service classes, which are then passed into the constructors of their consumers.  Instead, I’ve mostly gone for a single top-level routine that sets up the whole routing.  However, this ran into trouble when dealing with the new feature, the ability to switch on a progress bar.

Basically, UI fibers look the same in Retlang as threads, but ultimately they’re not.  So the main set up routine needed access to the particular fiber.  I achieved that by actually making the fiber a property of the object.  At one point in the design (yes, there is some design…) this produced a chicken-and-egg problem. 

  • I wanted the form to have a fiber property
  • I wanted to pass that in in the constructor to the form
  • But the fiber itself wanted the form in its constructor

This looked like a serious problem until I realized that there was a better design, which separated out the job of tracking progress from reporting progress.  This is reflective of a more general fact about retlang: it really punishes you sometimes if your design isn’t SOLID.  This isn’t a criticism, in fact it’s one of the best features of Retlang: it pushes you towards better designs.

Anyway, enough rambling and on with the code.  Feedback on whether or not this is better at illustrating principles than the old version will be appreciated.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using System.Net;
using System.IO;
using System.Threading;
using System.Windows.Forms;
using Retlang.Channels;
using Retlang.Core;
using Retlang.Fibers;

delegate IEnumerable<TNode> NodeWalker<TNode>(TNode root);

interface IProgressCounter<TNode> {
    void NodeQueued(TNode node);
    void NodeCompleted(TNode node);

    IFiber Fiber { get; }

    void WaitUntilComplete();
}

interface IGraphListener<TNode>
{
    bool HasEncounteredNode(TNode node);

    void ProcessNode(TNode node);
}

class GeneralGraphListener<TNode> : IGraphListener<TNode>
{
    private readonly HashSet<TNode> alreadyProcessed;

    public GeneralGraphListener(IEqualityComparer<TNode> comparer)
    {
        alreadyProcessed = new HashSet<TNode>(comparer);
    }

    public virtual bool HasEncounteredNode(TNode node)
    {
        return alreadyProcessed.Contains(node);
    }

    public virtual void ProcessNode(TNode node)
    {
        alreadyProcessed.Add(node);
    }

    public IEnumerable<TNode> NodesEncountered
    {
        get { return alreadyProcessed; }
    }
}

static class RetlangMacros
{
    internal static Action<TValue> Distribute<TValue>(Func<Action<TValue>> processorFactory, int numberOfProcessors)
    {
        var distributeChannel = new QueueChannel<TValue>();
        for (int index = 0; index < numberOfProcessors; index++) {
            var processor = processorFactory();
            var queue = new ThreadFiber();
            queue.Start();
            distributeChannel.Subscribe(queue, processor);
        }
        return distributeChannel.Publish;
    }

    internal static IPublisher<TNode> CreateGraphWalker<TNode>(
        Func<NodeWalker<TNode>> nodeWalkerFactory, 
        int numberOfProcessors, 
        IGraphListener<TNode> listener,
        IProgressCounter<TNode> progressCounter)
    {
        var foundNodeChannel = new Channel<TNode>();
        var enqueuedNodeChannel = new Channel<TNode>();
        var finishedNodeChannel = new Channel<TNode>();

        Func<Action<TNode>> nodeProcessorFactory = () => {
            var walker = nodeWalkerFactory();
            return node => {
                foreach (TNode child in walker(node)) {
                    foundNodeChannel.Publish(child);
                }
                finishedNodeChannel.Publish(node);
            };
        };
        var walkChildren = Distribute(nodeProcessorFactory, numberOfProcessors);
        var trackerQueue = new PoolFiber();
        trackerQueue.Start();
        enqueuedNodeChannel.Subscribe(progressCounter.Fiber, progressCounter.NodeQueued);
        finishedNodeChannel.Subscribe(progressCounter.Fiber, progressCounter.NodeCompleted);
        foundNodeChannel.Subscribe(trackerQueue, node =>
        {
            if (!listener.HasEncounteredNode(node)) {
                listener.ProcessNode(node);
                enqueuedNodeChannel.Publish(node);
                walkChildren(node);
            }
        });

        return foundNodeChannel;
    }
}


class ProgressCounter<TNode> : IProgressCounter<TNode>
{
    private readonly IFiber fiber;
    private readonly IProgressReporter reporter;
    private readonly AutoResetEvent waitHandle;
    int nodesEncountered;
    int nodesProcessed;
    

    internal ProgressCounter(AutoResetEvent waitHandle, IFiber fiber, IProgressReporter reporter)
    {
        this.waitHandle = waitHandle;
        this.fiber = fiber;
        this.reporter = reporter;
    }

    public IFiber Fiber
    {
        get { return fiber; }
    }

    public virtual void NodeQueued(TNode node)
    {
        nodesEncountered++;
    }

    public virtual void NodeCompleted(TNode node) {
        nodesProcessed++;
        reporter.Report(nodesProcessed, nodesEncountered);
        if (nodesProcessed == nodesEncountered) {
            waitHandle.Set();
        }
    }

    ///NOTE that this routine is called on a separate fiber from the other functions in
    ///this class.  All other classes stick to their fibers.
    public virtual void WaitUntilComplete() {
        waitHandle.WaitOne();
    }
}

interface IProgressReporter : IDisposable
{
    void Report(int nodesProcessed, int nodesEncountered);
}

class ConsoleProgressReporter : IProgressReporter
{
    public void Report(int nodesProcessed, int nodesEncountered)
    {
        Console.WriteLine(string.Format("{0}/{1}", nodesProcessed, nodesEncountered));
    }

    public void Dispose()
    {
        // Not needed
    }
}

class FormProgressReporter : Form, IProgressReporter {
    private readonly ProgressBar progressBar;

    public FormProgressReporter()
    {
        progressBar = new ProgressBar();
        SuspendLayout();
        progressBar.Dock = DockStyle.Fill;
        progressBar.Location = new System.Drawing.Point(0, 0);
        progressBar.Name = "progressBar";
        progressBar.Size = new System.Drawing.Size(292, 266);
        progressBar.TabIndex = 0;
        Controls.Add(progressBar);
        Height = 75;
        Width = 600;
        ResumeLayout();
    }

    public void Report(int nodesProcessed, int nodesEncountered)
    {
        Text = string.Format("{0}/{1}", nodesProcessed, nodesEncountered);
        SuspendLayout();
        progressBar.Maximum = nodesEncountered;
        progressBar.Value = nodesProcessed;
        ResumeLayout();
    }
}

class Program
{
    static void Main(string[] args)
    {
        string baseUrl = "http://www.yourwebsite.xyzzy/";
        int spiderThreadsCount = 15;
        bool hasUI = args.Length > 0;
        if (hasUI)
        {
            Application.EnableVisualStyles();
            Application.SetCompatibleTextRenderingDefault(false);
        }
        
        var waitHandle = new AutoResetEvent(false);
        var reporter = hasUI
            ? (IProgressReporter)new FormProgressReporter()
            : new ConsoleProgressReporter();
        var form = reporter as FormProgressReporter;
        if (form != null) {
            new Thread(() => Application.Run(form)).Start();
        }
        using (var fiber = hasUI 
            ? new FormFiber(form, new BatchAndSingleExecutor())
            : (IFiber) new PoolFiber())
        {
            fiber.Add(reporter);
            var progressCounter = new ProgressCounter<string>(
                waitHandle, fiber, reporter);
            var listener = new GeneralGraphListener<string>(
                StringComparer.InvariantCultureIgnoreCase);
            progressCounter.Fiber.Start();
            var walker = RetlangMacros.CreateGraphWalker(
                () => new Spider(baseUrl).FindReferencedUrls,
                spiderThreadsCount,
                listener,
                progressCounter
            );
            walker.Publish(baseUrl);
            progressCounter.WaitUntilComplete();
            foreach (string url in listener.NodesEncountered.OrderBy(url => url)) {
                Console.WriteLine(url);
            }
            Console.ReadLine();
            fiber.Enqueue(reporter.Dispose);
        }
    }

    class Spider
    {
        string _baseUrl;

        public Spider(string baseUrl)
        {
            _baseUrl = baseUrl.ToLowerInvariant();
        }

        public IEnumerable<string> FindReferencedUrls(string pageUrl) {
            if (Path.GetExtension(pageUrl) == "css")
            {
                return new string[0];
            }
            string content = GetContent(pageUrl);
            return from url in Urls(content, "href='(?<Url>[^'<>]+)'")
                            .Union(Urls(content, "href="(?<Url>[^"<>]+)""))
                            .Union(Urls(content, "href=(?<Url>[^'" <>]+)"))
                       where url != null && url.Length > 0
                            && IsInternalLink(url) && url[0] != '#'
                            && !url.Contains("&lt")
                            && !url.Contains("[")
                            && !url.Contains("\")
                            && !url.EndsWith(".css")
                            && !url.Contains("css.axd")
                       select ToAbsoluteUrl(pageUrl, url);
        }

        static int BaseUrlIndex(string url)
        {
            // This finds the first / after //
            return url.IndexOf('/', url.IndexOf("//") + 2);
        }

        string ToAbsoluteUrl(string url, string relativeUrl)
        {
            int hashIndex = relativeUrl.IndexOf('#');
            if (hashIndex >= 0) {
                relativeUrl = relativeUrl.Substring(0, hashIndex);
            }
            if (relativeUrl.Contains("//"))
            {
                return relativeUrl;
            }
            if (relativeUrl.Length > 0)
            {
                bool isRoot = relativeUrl.StartsWith("/");
                int index = isRoot 
                    ? BaseUrlIndex(url) 
                    : url.LastIndexOf('/') + 1;
                if (index < 0) {
                    throw new ArgumentException(string.Format("The url {0} is not correctly formatted.", url));
                }
                var result = url.Substring(0, index) + relativeUrl;
                return result;
            }
            return null;
        }

        bool IsInternalLink(string url)
        {
            url = url.ToLowerInvariant();
            if (url.StartsWith(_baseUrl))
            {
                return true;
            }
            if (url.StartsWith("http") || url.StartsWith("ftp") || url.StartsWith("javascript"))
            {
                return false;
            }
            if (url.Contains("javascript-error"))
            {
                return false;
            }
            return true;
        }

        static IEnumerable<string> Urls(string content, string pattern)
        {
            var regex = new Regex(pattern);
            // Why exactly doesn't MatchCollection implement IEnumerable<Match> ?
            return from match in regex.Matches(content).Cast<Match>()
                   select match.Groups["Url"].Value;
        }

        static string GetContent(string url)
        {
            var request = WebRequest.Create(url);
            request.Proxy = WebRequest.DefaultWebProxy;
            try {
                using (var response = request.GetResponse()) {
                    using (var reader = new StreamReader(response.GetResponseStream())) {
                        return reader.ReadToEnd();
                    }
                }
            } catch (WebException ex) {
                Console.Error.WriteLine("Problem reading url {0}, message {1}.", url, ex.Message);
                return "";
            }
        }
    }
}

 

Technorati Tags: ,

Published by

Julian Birch

Full time dad, does a bit of coding on the side.

One thought on “Retlang 0.4”

  1. Somehow the GUI version is behaving different to the console version (set hasGUI to true explicit and run as a windows app. from VS2008 vs. the console app., using the original code)Somehow the console app. finds more URLs in the to spider pages using the same base url.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s