A Node.js Web Spider

It’s been a while since I’ve written this kata.  It’s different every time I do it.  This being node.js, all IO is asynchronous, and there’s only one thread.  I fairly quickly realized that I didn’t necessarily want to hammer the remote server, so I needed to come up with a way of keeping the number of requests down.  This being JavaScript, there are vastly more regular expressions in the code than if I’d written in C#.  Being a little too fond of LISP at the moment, there’s hardly any classes and everything’s done via closures.  Hope you like CoffeeScript.

The Work manager exposes one method: please.  The parameters amount to a function call with a callback.  The first eight please calls are executed immediately, the others are queued.

http = require 'http'
fs = require 'fs'
site = 'yourwebsite.com'
rootUrl = '/blog'
baseUrl = "http://#{site}#{rootUrl}"
client = http.createClient 80, site

# Throttle the number of requests
class WorkManager
  constructor : (@maxRequests) ->
    @queue = []
    @count = 0
  please : (f, args..., callback) ->
    @count++
    task = =>
      args.push =>
        @count--
        @queue.pop()?() # Pop and execute
        callback.apply this, arguments       f.apply this, args     if (@count >= @maxRequests)       @queue.push task     else       task() wm = new WorkManager 8 # Don't bother the website too much

Next, we need a routine to get HTML files down.  It also performs rudimentary logging.  This is the only output.

# Perform web requests
count = 0
get = (url, callback) ->
  req = client.request 'GET', url, {'host' : site}
  count++
  console.log "Requested #{count} : #{url}"
  result = ""
  req.on 'response', (res) ->
    res.setEncoding 'utf8'
    res.on 'data', (chunk) -> result += chunk
    res.on 'end', ->
      count--
      console.log "Received #{count} : #{url}"
      callback result
  req.end()

Finally, we need some actual spidering to take place.  The spider consists of three functions. 

  • IsInternal is used to prevent our spider from attempting to slurp the internet.
  • Standardize is used to prevent us pulling down the same page twice.
  • The Spider function itself just asks the workmanager to get a page.  When the page is pulled down, it searches it for URLs, throws away external URLs, standardizes the internal ones and spiders them if necessary.
# The actual spider logic

# Detect internal links
prefixes = new RegExp "^" + p, "ci" for p in ['http','ftp','javascript']
isInternal = (url) ->
  return true if url.indexOf(baseUrl) == 0
  return false if p.test url for p in prefixes
  return 0 == url.indexOf 'javascript-error'

# Standardize urls
parentRegExp = new RegExp "[^/]+/[.][.]/"
baseRegExp = new RegExp "//[^/]+(/.*)"
standardize = (url, base) ->
  url = url.replace /#.*/, '' # ignore #urls
  url = url.replace //default[.]aspx$/, '/' # ignore default.aspx
  url = url.replace /%20/, '+' # standardize spaces as +s
  url.replace parentRegExp, '' while parentRegExp.test url
  # take out parent references
  m = baseRegExp.exec url
  if m && m.length
    m[1]
  else if url[0] =='/'
    url
  else
    base+url

# Spider urls
internalUrls = {}
externalUrls = {}
regexps =-> new RegExp p, "gi" for p in [
    'href='([^'<>]+)'',
    'href="([^"<>]+)"',
    'href=([^'" <>]+)'
]
spider = (url) ->
  base = standardize /.*//.exec(url)[0], ''
  onContent = (content) ->
    for re in regexps()
      while match = re.exec content
        url = match[1].toLowerCase()
        if isInternal url
          url = standardize url, base
          unless internalUrls[url]
            internalUrls[url] = true
            spider url
        else
          externalUrls[url] = true
  wm.please get, url, onContent
spider rootUrl

It’s interesting to observe that in Clojure, “please get, url, onContent” would be “please get url onContent” whilst “get url, onContent” would be “get url onContent”.  The lisp style doesn’t think you’re doing anything weird here.  Most other languages do.

Just shy of 100 lines, including a work manager and comments.  Of course, by the time you’ve compiled it to JavaScript, it’s about 150 with no comments.  About 35 of those extra lines read “close curly brace”.

UPDATE: Embarrassingly, the original version of this code didn’t quite work.  I clearly didn’t pay enough attention to exactly which version I put up.  The problem was that the regular expressions were static, which means you can only use them once in Javascript.  Coffeescript made this an easy bug to fix: put a “->” in front of the regexps variable turns it into a function, and () where it is used to invoke it.

As shown in the comments, I used to pass a “c” flag to the regular expressions, which is completely useless.  That has been removed.

The only other change is an undocumented feature of coffeescript.  The code originally read “t = pop();t() if t?”.  Now, in coffeescript, “t?()” calls t if it is not null.  This allows the two lines to be simplified to “pop()?()”, which is either elegant or hermetic, depending on your experience.

Technorati Tags: ,

Published by

Julian Birch

Full time dad, does a bit of coding on the side.

One thought on “A Node.js Web Spider”

  1. @Maksym Sorry it’s taken so long to respond. The code’s working again now, with minor modifications. The good news is that it wasn’t an instability in coffeescript versions, it was my own idiocy.Well, it’s good news for coffeescript users, anyway…

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s