Plan Zero

Hello, there!


Showing posts from 2013. Filter to March.

Spidering the web with CasperJS


The CasperJS web spider in action For a project I've been working on I needed a simple spider which would, given a start URL, recursively collect all the URLs it could find.

In the past I've used the excellent PhantomJS headless webkit browser for automation, but writing complex navigation scenarios can be a bit long-winded. Enter CasperJS. Built on top of PhantomJS, it simplifies the process and provides some nice syntactic sugar to boot.

The spider I wrote grabs the first page, finds all of the links, then by pushing each URL onto a stack and shifting new URLs from the bottom, follows each link in the order in which it was found. Going recursive is key; the method doesn't block, so without recursion there would be trouble.

The following code is the core spider, which should be easy to adapt for most purposes: