It’s been a while since I’ve written this kata. It’s different every time I do it. This being node.js, all IO is asynchronous, and there’s only one thread. I fairly quickly realized that I didn’t necessarily want to hammer the remote server, so I needed to come up with a way of keeping the number of requests down. This being JavaScript, there are vastly more regular expressions in the code than if I’d written in C#. Being a little too fond of LISP at the moment, there’s hardly any classes and everything’s done via closures. Hope you like CoffeeScript.
The Work manager exposes one method: please. The parameters amount to a function call with a callback. The first eight please calls are executed immediately, the others are queued.
http = require 'http' fs = require 'fs' site = 'yourwebsite.com' rootUrl = '/blog' baseUrl = "http://#{site}#{rootUrl}" client = http.createClient 80, site # Throttle the number of requests class WorkManager constructor : (@maxRequests) -> @queue = [] @count = 0 please : (f, args..., callback) -> @count++ task = => args.push => @count-- @queue.pop()?() # Pop and execute
callback.apply this, arguments f.apply this, args if (@count >= @maxRequests) @queue.push task else task() wm = new WorkManager 8 # Don't bother the website too much
Next, we need a routine to get HTML files down. It also performs rudimentary logging. This is the only output.
# Perform web requests count = 0 get = (url, callback) -> req = client.request 'GET', url, {'host' : site} count++ console.log "Requested #{count} : #{url}" result = "" req.on 'response', (res) -> res.setEncoding 'utf8' res.on 'data', (chunk) -> result += chunk res.on 'end', -> count-- console.log "Received #{count} : #{url}" callback result req.end()
Finally, we need some actual spidering to take place. The spider consists of three functions.
- IsInternal is used to prevent our spider from attempting to slurp the internet.
- Standardize is used to prevent us pulling down the same page twice.
- The Spider function itself just asks the workmanager to get a page. When the page is pulled down, it searches it for URLs, throws away external URLs, standardizes the internal ones and spiders them if necessary.
# The actual spider logic # Detect internal links prefixes = new RegExp "^" + p, "ci" for p in ['http','ftp','javascript'] isInternal = (url) -> return true if url.indexOf(baseUrl) == 0 return false if p.test url for p in prefixes return 0 == url.indexOf 'javascript-error' # Standardize urls parentRegExp = new RegExp "[^/]+/[.][.]/" baseRegExp = new RegExp "//[^/]+(/.*)" standardize = (url, base) -> url = url.replace /#.*/, '' # ignore #urls url = url.replace //default[.]aspx$/, '/' # ignore default.aspx url = url.replace /%20/, '+' # standardize spaces as +s url.replace parentRegExp, '' while parentRegExp.test url # take out parent references m = baseRegExp.exec url if m && m.length m[1] else if url[0] =='/' url else base+url # Spider urls internalUrls = {} externalUrls = {} regexps =-> new RegExp p, "gi" for p in [ 'href='([^'<>]+)'', 'href="([^"<>]+)"', 'href=([^'" <>]+)' ] spider = (url) -> base = standardize /.*//.exec(url)[0], '' onContent = (content) -> for re in regexps() while match = re.exec content url = match[1].toLowerCase() if isInternal url url = standardize url, base unless internalUrls[url] internalUrls[url] = true spider url else externalUrls[url] = true wm.please get, url, onContent spider rootUrl
It’s interesting to observe that in Clojure, “please get, url, onContent” would be “please get url onContent” whilst “get url, onContent” would be “get url onContent”. The lisp style doesn’t think you’re doing anything weird here. Most other languages do.
Just shy of 100 lines, including a work manager and comments. Of course, by the time you’ve compiled it to JavaScript, it’s about 150 with no comments. About 35 of those extra lines read “close curly brace”.
UPDATE: Embarrassingly, the original version of this code didn’t quite work. I clearly didn’t pay enough attention to exactly which version I put up. The problem was that the regular expressions were static, which means you can only use them once in Javascript. Coffeescript made this an easy bug to fix: put a “->” in front of the regexps variable turns it into a function, and () where it is used to invoke it.
As shown in the comments, I used to pass a “c” flag to the regular expressions, which is completely useless. That has been removed.
The only other change is an undocumented feature of coffeescript. The code originally read “t = pop();t() if t?”. Now, in coffeescript, “t?()” calls t if it is not null. This allows the two lines to be simplified to “pop()?()”, which is either elegant or hermetic, depending on your experience.
@Maksym Sorry it’s taken so long to respond. The code’s working again now, with minor modifications. The good news is that it wasn’t an instability in coffeescript versions, it was my own idiocy.Well, it’s good news for coffeescript users, anyway…
LikeLike