A Node.js Web Spider

It’s been a while since I’ve written this kata. It’s different every time I do it. This being node.js, all IO is asynchronous, and there’s only one thread. I fairly quickly realized that I didn’t necessarily want to hammer the remote server, so I needed to come up with a way of keeping the number of requests down. This being JavaScript, there are vastly more regular expressions in the code than if I’d written in C#. Being a little too fond of LISP at the moment, there’s hardly any classes and everything’s done via closures. Hope you like CoffeeScript.

The Work manager exposes one method: please. The parameters amount to a function call with a callback. The first eight please calls are executed immediately, the others are queued.

http = require 'http'
fs = require 'fs'
site = 'yourwebsite.com'
rootUrl = '/blog'
baseUrl = "http://#{site}#{rootUrl}"
client = http.createClient 80, site

# Throttle the number of requests
class WorkManager
  constructor : (@maxRequests) ->
    @queue = []
    @count = 0
  please : (f, args..., callback) ->
    @count++
    task = =>
      args.push =>
        @count--
        @queue.pop()?() # Pop and execute
        callback.apply this, arguments
      f.apply this, args
    if (@count >= @maxRequests)
      @queue.push task
    else
      task()
wm = new WorkManager 8 # Don't bother the website too much

Next, we need a routine to get HTML files down. It also performs rudimentary logging. This is the only output.

# Perform web requests
count = 0
get = (url, callback) ->
  req = client.request 'GET', url, {'host' : site}
  count++
  console.log "Requested #{count} : #{url}"
  result = ""
  req.on 'response', (res) ->
    res.setEncoding 'utf8'
    res.on 'data', (chunk) -> result += chunk
    res.on 'end', ->
      count--
      console.log "Received #{count} : #{url}"
      callback result
  req.end()

Finally, we need some actual spidering to take place. The spider consists of three functions.

IsInternal is used to prevent our spider from attempting to slurp the internet.
Standardize is used to prevent us pulling down the same page twice.
The Spider function itself just asks the workmanager to get a page. When the page is pulled down, it searches it for URLs, throws away external URLs, standardizes the internal ones and spiders them if necessary.

# The actual spider logic

# Detect internal links
prefixes = new RegExp "^" + p, "ci" for p in ['http','ftp','javascript']
isInternal = (url) ->
  return true if url.indexOf(baseUrl) == 0
  return false if p.test url for p in prefixes
  return 0 == url.indexOf 'javascript-error'

# Standardize urls
parentRegExp = new RegExp "[^/]+/[.][.]/"
baseRegExp = new RegExp "//[^/]+(/.*)"
standardize = (url, base) ->
  url = url.replace /#.*/, '' # ignore #urls
  url = url.replace //default[.]aspx$/, '/' # ignore default.aspx
  url = url.replace /%20/, '+' # standardize spaces as +s
  url.replace parentRegExp, '' while parentRegExp.test url
  # take out parent references
  m = baseRegExp.exec url
  if m && m.length
    m[1]
  else if url[0] =='/'
    url
  else
    base+url

# Spider urls
internalUrls = {}
externalUrls = {}
regexps =-> new RegExp p, "gi" for p in [
    'href='([^'<>]+)'',
    'href="([^"<>]+)"',
    'href=([^'" <>]+)'
]
spider = (url) ->
  base = standardize /.*//.exec(url)[0], ''
  onContent = (content) ->
    for re in regexps()
      while match = re.exec content
        url = match[1].toLowerCase()
        if isInternal url
          url = standardize url, base
          unless internalUrls[url]
            internalUrls[url] = true
            spider url
        else
          externalUrls[url] = true
  wm.please get, url, onContent
spider rootUrl

It’s interesting to observe that in Clojure, “please get, url, onContent” would be “please get url onContent” whilst “get url, onContent” would be “get url onContent”. The lisp style doesn’t think you’re doing anything weird here. Most other languages do.

Just shy of 100 lines, including a work manager and comments. Of course, by the time you’ve compiled it to JavaScript, it’s about 150 with no comments. About 35 of those extra lines read “close curly brace”.

UPDATE: Embarrassingly, the original version of this code didn’t quite work. I clearly didn’t pay enough attention to exactly which version I put up. The problem was that the regular expressions were static, which means you can only use them once in Javascript. Coffeescript made this an easy bug to fix: put a “->” in front of the regexps variable turns it into a function, and () where it is used to invoke it.

As shown in the comments, I used to pass a “c” flag to the regular expressions, which is completely useless. That has been removed.

The only other change is an undocumented feature of coffeescript. The code originally read “t = pop();t() if t?”. Now, in coffeescript, “t?()” calls t if it is not null. This allows the two lines to be simplified to “pop()?()”, which is either elegant or hermetic, depending on your experience.

Technorati Tags: Node.js,CoffeeScript

One thought on “A Node.js Web Spider”

Julian says:

November 30, -0001 at 12:00 am

@Maksym Sorry it’s taken so long to respond. The code’s working again now, with minor modifications. The good news is that it wasn’t an instability in coffeescript versions, it was my own idiocy.Well, it’s good news for coffeescript users, anyway…

LikeLike

A Node.js Web Spider

Published by

Julian Birch

One thought on “A Node.js Web Spider”

Leave a comment Cancel reply

Share this:

Related

Published by

Julian Birch

One thought on “A Node.js Web Spider”

Leave a comment Cancel reply