crawlers and spiders

Its monday morning, and I’ve just reviewed the suggestions for storing information for the team that I put forward earlier. My boss is going to go with me on DokuWiki but for some reason the lack of database-backend is making him nervous. The search functionality is currently absolutely fine but that’s with 50 docs and we might need to handle 5000. We need a spider.

The thing with DokuWiki is that it stores its information in files, which is fine because it is a series of pages, or documents, and that’s what file systems were invented for! However if you want to look for a particular word or phrase then you will need to open and close each one of those documents … and that’s slow. So I’m looking for a thing of some kind which will index my information out of those files (not choking on the markup) while I’m not looking, and then deliver very fast search results.

I haven’t got anything working yet so this is kind of theoretical and I’ll come back and update this when I get a solution in place, but here’s the current shortlist.

ZSearch from the Zend Framework Except it needs PHP5 and we’re running PHP4. Not sure whether I should try to work with it or what. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Xapian

I’ve come across http://www.xapian.org/ which looks promising … except I’m working on Windows and I’ll have to compile stuff, and the IT proxy isn’t working and the main one won’t let me download executables. Back to some real work and save this project for another day!

EDIT you can read the follow-up articles here and here