A Node.js Web Spider

It’s been a while since I’ve written this kata.  It’s different every time I do it.  This being node.js, all IO is asynchronous, and there’s only one thread.  I fairly quickly realized that I didn’t necessarily want to hammer the remote server, so I needed to come up with a way of keeping the number of requests down.  This being JavaScript, there are vastly more regular expressions in the code than if I’d written in C#.  Being a little too fond of LISP at the moment, there’s hardly any classes and everything’s done via closures.  Hope you like CoffeeScript.

The Work manager exposes one method: please.  The parameters amount to a function call with a callback.  The first eight please calls are executed immediately, the others are queued.

http = require 'http'
fs = require 'fs'
site = 'yourwebsite.com'
rootUrl = '/blog'
baseUrl = "http://#{site}#{rootUrl}"
client = http.createClient 80, site

# Throttle the number of requests
class WorkManager
  constructor : (@maxRequests) ->
    @queue = []
    @count = 0
  please : (f, args..., callback) ->
    @count++
    task = =>
      args.push =>
        @count--
        @queue.pop()?() # Pop and execute
        callback.apply this, arguments       f.apply this, args     if (@count >= @maxRequests)       @queue.push task     else       task() wm = new WorkManager 8 # Don't bother the website too much

Next, we need a routine to get HTML files down.  It also performs rudimentary logging.  This is the only output.

# Perform web requests
count = 0
get = (url, callback) ->
  req = client.request 'GET', url, {'host' : site}
  count++
  console.log "Requested #{count} : #{url}"
  result = ""
  req.on 'response', (res) ->
    res.setEncoding 'utf8'
    res.on 'data', (chunk) -> result += chunk
    res.on 'end', ->
      count--
      console.log "Received #{count} : #{url}"
      callback result
  req.end()

Finally, we need some actual spidering to take place.  The spider consists of three functions. 

  • IsInternal is used to prevent our spider from attempting to slurp the internet.
  • Standardize is used to prevent us pulling down the same page twice.
  • The Spider function itself just asks the workmanager to get a page.  When the page is pulled down, it searches it for URLs, throws away external URLs, standardizes the internal ones and spiders them if necessary.
# The actual spider logic

# Detect internal links
prefixes = new RegExp "^" + p, "ci" for p in ['http','ftp','javascript']
isInternal = (url) ->
  return true if url.indexOf(baseUrl) == 0
  return false if p.test url for p in prefixes
  return 0 == url.indexOf 'javascript-error'

# Standardize urls
parentRegExp = new RegExp "[^/]+/[.][.]/"
baseRegExp = new RegExp "//[^/]+(/.*)"
standardize = (url, base) ->
  url = url.replace /#.*/, '' # ignore #urls
  url = url.replace //default[.]aspx$/, '/' # ignore default.aspx
  url = url.replace /%20/, '+' # standardize spaces as +s
  url.replace parentRegExp, '' while parentRegExp.test url
  # take out parent references
  m = baseRegExp.exec url
  if m && m.length
    m[1]
  else if url[0] =='/'
    url
  else
    base+url

# Spider urls
internalUrls = {}
externalUrls = {}
regexps =-> new RegExp p, "gi" for p in [
    'href='([^'<>]+)'',
    'href="([^"<>]+)"',
    'href=([^'" <>]+)'
]
spider = (url) ->
  base = standardize /.*//.exec(url)[0], ''
  onContent = (content) ->
    for re in regexps()
      while match = re.exec content
        url = match[1].toLowerCase()
        if isInternal url
          url = standardize url, base
          unless internalUrls[url]
            internalUrls[url] = true
            spider url
        else
          externalUrls[url] = true
  wm.please get, url, onContent
spider rootUrl

It’s interesting to observe that in Clojure, “please get, url, onContent” would be “please get url onContent” whilst “get url, onContent” would be “get url onContent”.  The lisp style doesn’t think you’re doing anything weird here.  Most other languages do.

Just shy of 100 lines, including a work manager and comments.  Of course, by the time you’ve compiled it to JavaScript, it’s about 150 with no comments.  About 35 of those extra lines read “close curly brace”.

UPDATE: Embarrassingly, the original version of this code didn’t quite work.  I clearly didn’t pay enough attention to exactly which version I put up.  The problem was that the regular expressions were static, which means you can only use them once in Javascript.  Coffeescript made this an easy bug to fix: put a “->” in front of the regexps variable turns it into a function, and () where it is used to invoke it.

As shown in the comments, I used to pass a “c” flag to the regular expressions, which is completely useless.  That has been removed.

The only other change is an undocumented feature of coffeescript.  The code originally read “t = pop();t() if t?”.  Now, in coffeescript, “t?()” calls t if it is not null.  This allows the two lines to be simplified to “pop()?()”, which is either elegant or hermetic, depending on your experience.

Technorati Tags: ,

JavaScript: Everything’s an Instance

I’ve been coming to the opinion for some time now that static methods aren’t the problem.  Global variables are the problem.  You may think that your code doesn’t have many globals in it, but effectively, it’s littered with immutable global variables:

  • Static Methods are immutable globals
  • Classes are immutable globals
  • Namespaces are immutable globals

If this all sounds a bit like I’m saying “Look at all these trees!  There must be a wood nearby!” you’re right.  The point I’m making is that techniques such as dependency inversion are all geared to reducing the impact of this immutable baggage we’re carrying around.

Let’s contrast the Java/C# approach with JavaScript:

  • function() just declares a variable.  It’s of type function.
  • This variable has properties that you can add or remove, just like any other object.
  • One of these, prototype is normally considered to be the “class” of the object.
  • The prototype, however, is just another instance object.
  • Globals is, itself, an instance variable you can manipulate.

But that’s not the coolest bit.  The coolest bit is “require”.  Require is the function you use in CommonJS to import a module.  However,

  • Require just returns an instance.
  • You can actually replace require (although you’re unlikely to, because it has properties allowing you to tweak its behaviour)

I’m not sure even Clojure allows you to just replace the module loader (although bindings are extremely powerful). 

To put it another way, if namespaces aren’t a problem, how come their names are so long?

Hacking State in Vows.js with CoffeeScript

I can’t tell if you’re meant to do this kind of thing to vows.  I honestly can’t.  It’s either a demonstration of the power of the architecture, or it’s a phenomenal hack that shouldn’t be allowed out in broad daylight.  Either way, I’m probably thinking too LISP-ily for my own good.

Let’s say that you’re trying to test the behaviour of a workflow.  Under certain conditions, certain things should happen.  The problem is, some of those certain conditions are pretty verbose.  In fact, if you’ve got three yes/no decisions to make, you’re left having to set up eight different scenarios (more if you’re testing the intermediate states).

Now, in most testing environments, this is what you have “setup” for.  However, it only really works if you can have nested setup procedures, like “before” in RSpec.  Vows, on the other hand, creates a topic once and then tests run against that one topic.  Not really ideal for testing workflows.  So, I thought “why not make the topic itself a factory”.  That way, I could call the topic in each test and set it up repeatedly.  Then it occurred to me that, ideally, later topics should contain instructions on how to get the topic into the correct state, reducing the amount of repetition that we saw in my previous post.  Finally it occurred to me that, ultimately, the entire batch specification is just a hash table.

So, I wrote a function that rewrites a batch to do workflows:

withSetup = (batch) ->
  setupTopic = (f) -> (topic) -> ->
    # N.B. The topic is a factory.  The setupTopic function returns a factory as well
    return f() unless topic? # Resolve item
    # Apply item to topic and return topic
    t = topic()
    f(t)
    t
  inner = (item) ->
    # The item is a test
    # Take the topic, resolve it and run tests in "item"
    return ((topic) -> item topic()) unless typeof item == 'object'
    # The item is a batch
    for k,v of item
      item[k] = (if k == 'topic' then setupTopic else inner) v
    item
  inner batch

All you need to do is add withSetup to the addBatch invocation.

vows.describe('Guessing Game').addBatch(withSetup({
  'Player is playing a guessing game' : {
    topic : -> new game.Player(new StubEmitter(), guessingGameFactory)
    'should be able to start a game' : (p) ->
      assert.isFalse p.game?
      p.client.emit 'message', { action : 'start' }
      assert.isNotNull p.game
      assert.equal p.client.data.question, "Guess what number I'm thinking of"
    'after game has started' : {
      topic : (p) -> p.client.emit 'message', { action : 'start' }
      'correct guess' : (p) ->
        p.client.emit 'message', { action : 'answer', answer : 1 }
        assert.isTrue p.client.data.wasRight
      'wrong guess' : (p) ->
        p.client.emit 'message', { action : 'answer', answer : 2 }
        assert.isFalse p.client.data.wasRight
      'after correct answer' : {
        topic : (p) -> p.client.emit 'message', { action : 'answer', answer : 1 }
        'We're now on the second question' : (p) ->
          assert.equal p.playerActions.game.currentQuestionCount, 2
      }
    }
  }
})).export module

As I say, I can’t figure out if this works because Cloudhead’s really smart or I’m really stupid.

Getting Syntax Colouring in HTML using Vim

If you’re wondering how I got the syntax colouring on my CoffeeScript post, here’s the important bits:

  • The VIM :TOhtml command absolutely rocks.  (Thanks to the Vim gurus who pointed me towards it.)
  • The molokai colour scheme is very pretty.
  • The ability to :so! a list of commands in Vi saves a lot of time.

Here’s the commands I use (formatted, of course, using the program itself)

/<body
:s/bgcolor="/background : /
:s/" text="/; color : /
:s/<body /<pre class='code' style="/
:s/"><.*//
A; -moz-background-size: auto auto; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">^[kdgg
/</body
dGo</pre>^[
:2,-1s/<span.*/span>/<span>/
G:2,-1s/<br>/</span>/

Vim’s colour-coding of keyboard macros leaves something to be desired, but hey, you can’t have everything.  What I do have is something significantly easier to use than the solution I’ve been using for Visual Studio.  There’s a couple of things the Visual Studio solution gets right, like highlighting mutable variables and showing interfaces and classes in different colours, but there’s a lot more manual work involved.

If you are using it, make sure the HTML is the only buffer currently visible.  I’d be very glad to hear from someone who knows how to fix that particular bug.  For non-VIM users, this may look like line noise, but I assure you you’ll be able to read it after a fairly short period of time practicing Vim.  More usefully, you’ll be able to write your own.  One final tip: ^[ is a single character: use CTRL-Q CTRL-[ to get it on Windows.

Technorati Tags: ,,

CoffeeScript: Looking for the Catch

I’m just starting out with CoffeeScript, but it’s currently looking incredible.  Normally I loathe languages that rewrite other languages, but CoffeeScript appears to be well thought out and targets the JavaScript syntax whilst leaving the semantics pretty much unchanged.  Runtime errors are found in the generated files, not the originals, which is a bit rubbish but I’m guessing fixable (as long as V8 supports it).

Let me share with you some code I wrote the other day.  (Feedback on the idioms in the code are welcomed: I’m just starting with node.js.)

var util = require('util'),
    EventEmitter = require('events').EventEmitter;

function SocketPlayer(client) {
  EventEmitter.call(this);
}
util.inherits(SocketPlayer, EventEmitter);

exports.Game = function (questionFactory, questionCount, player, setTimeout) {
  var self = this;
  this.setTimeout = setTimeout || process.setTimeout;
  this.questionFactory = questionFactory;
  this.delay = 1000;
  this.questionCount = questionCount;
  this.currentQuestionCount = 0;

  function advanceQuestion(wasRight) {
    self.currentQuestionCount++;
    self.currentQuestion = questionFactory();
    var dto = self.currentQuestion.dto();
    dto.wasRight = wasRight;
    player.send(dto);
  };
  player.on('answer', function(data) {
    console.log(data.answer);
    var wasRight = self.currentQuestion.isCorrect(data.answer);
    if (wasRight) {
      advanceQuestion(true);
    } else {
      var explanation = self.currentQuestion.explanationDto();
      explanation.delay = self.delay;
      explanation.wasRight = wasRight;
      player.send(explanation);
      self.setTimeout(advanceQuestion, self.delay);
      self.delay *= 2;
    }
  });

  advanceQuestion(false);
}

So, this represents an abstract game.  The player gets the next question if they get one right, an explanation of the right answer if they got it wrong.  (If it helps, think of hangman.)  I think this code is alright.  However, take a look at the version in CoffeeScript

util = require('util')
EventEmitter = require('events').EventEmitter

class SocketPlayer extends EventEmitter

class exports.Game
  constructor : (@questionFactory, @questionCount, @player, @setTimeout) ->
    @setTimeout ?= process.setTimeout
    @delay = 1000
    @currentQuestionCount = 0
    @advanceQuestion = (wasRight) ->
      @currentQuestionCount++
      @currentQuestion = @questionFactory() # Returns the next question
      dto = @currentQuestion.dto()
      dto.wasRight = wasRight # but the answer to the previous question
      @player.send(dto)
    @player.on 'answer', (data) =>
      console.log data.answer
      if wasRight = @currentQuestion.isCorrect data.answer
        @advanceQuestion true
      else
        explanation = @currentQuestion.explanationDto()
        explanation.delay = @delay
        explanation.wasRight = wasRight
        @player.send explanation
        @setTimeout (-> @advanceQuestion(false)), @delay
        @delay *= 2
    @advanceQuestion false

How About Testing?

Here’s the tests I wrote in JavaScript:

var vows = require('vows'),
    eyes = require('eyes'),
    assert = require('assert')
    game = require('../src/game.js'),
    util = require('util'),
    EventEmitter = require('events').EventEmitter;
function guess(n) {
  this.isCorrect = function(a) { return a == n; }
  this.explanationDto = function() {
    return { correctAnswer : n };
  }

  this.dto = function() { return { question : "Guess what number I'm thinking of" }}
}

function StubPlayer() {
  EventEmitter.call(this)
  var self = this
  this.send = function(data) { self.data = data }
}
util.inherits(StubPlayer, EventEmitter);

function GuessingGame() {
  this.player = new StubPlayer();
  game.Game(
    function() { return new guess(1); },
    10,
    this.player,
    function(action, delay) {  });
  return this;
}

vows.describe('Guessing Game').addBatch({
  'Given a guessing game' : {
    topic : GuessingGame,
    'when you guess correctly' : {
      topic : function(topic) {
        var t = new GuessingGame();
        t.player.emit('answer', { 'answer' : 1 });
        return t;
      },
      'then it says you were right' : function(topic) {
        eyes.inspect(topic.player.data);
        assert.isTrue(topic.player.data.wasRight)
       }
    },
    'when you guess wrong' : {
      topic : function(topic) {
        var t = new GuessingGame();
        t.player.emit('answer', { 'answer' : 2 });
        return t;
      },
      'then it says you were wrong' : function(topic) {
        assert.isFalse(topic.player.data.wasRight);
      }
    }
  }
}).export(module);

Again, let’s see what it was in CoffeeScript.

vows = require('vows')
eyes = require('eyes')
assert = require('assert')
game = require('../src/game2.js')
EventEmitter = require('events').EventEmitter

class Guess
  constructor : (n) ->
    @isCorrect = (a) -> a == n
    @explanationDto = -> { correctAnswer : n }
    @dto = -> { question :  "Guess what number I'm thinking of" }

class StubPlayer extends EventEmitter
  send : (data) -> @data = data

class GuessingGame extends game.Game
  constructor : ->
    super((-> new Guess 1),
      10,
      new StubPlayer,
      -> 0)

vows.describe('Guessing Game CoffeeScript example').addBatch({
  'Given a guessing game' : {
    topic : GuessingGame
    'when you guess correctly' : {
      topic : (topic) ->
        t = new GuessingGame
        t.player.emit 'answer', { 'answer' : 1 }
        return t
      'then it says you were right' : (topic) ->
        # eyes.inspect topic.player.data
        assert.isTrue topic.player.data.wasRight
       }
    'when you guess wrong' : {
      topic : (topic) ->
        t = new GuessingGame()
        t.player.emit 'answer', { 'answer' : 2 }
        return t
      'then it says you were wrong' : (topic) ->
        # eyes.inspect topic
        assert.isFalse topic.player.data.wasRight
    }
  }
}).export module

The actual vows stuff is slightly shorter, but it’s the setting up of the stub classes that really makes CoffeeScript shine.  StubPlayer is a very good example of this.  All it needs to be is an EventEmitter that captures what it is sent.  The sheer ceremony involved in declaring that in JavaScript was pretty painful.

The Smooth and the Rough

Although this code is pretty short, it’s a pretty good tour of some tricksy things about CoffeeScript.  First, it’s worth understand that CoffeeScript really is JavaScript.  If you don’t understand JavaScript, you won’t get anywhere with CoffeeScript.  Let’s see some examples of this:

  • the difference between -> and => requires you to understand how “this” behaves in JavaScript. 
  • If you take a look at the function “advanceQuestion”, you’ll see that it’s instantiated in the constructor, not added to the prototype.  This makes no sense until you realize the function is called from the constructor, and in JavaScript, the whole idea of calling a prototype method from the constructor makes no sense at all.

Some other cool things about constructors:

  • extends deals with all of that util.inherits nonsense
  • super is vastly more pleasant than copying class names all over the place.  But you still need to explicitly call super
  • parameters that begin with an @ sign are automatically made instance fields (more languages should have this feature)
  • Just like in JavaScript, the classes are just variables.  Declare a variable as “exports.Game” and you’ll export “Game” from the module.  Everything else is private.

The ugly:

  • You don’t have to type in brackets, but you need to careful where you leave them out.  For instance, if “addBatch” doesn’t have an explicit bracket, the implicit close bracket ends up after module, rather than before export.  (This might be fixable.)
  • Debugging pretty much requires you to read the generated JavaScript.  Luckily, it’s pretty good JavaScript.