Lucene search functionality – following on from crawlers and spiders

The search functionality works!!

I have PHP 5.1.2 on a Windows XP machine (I know, I know, its my work desktop) and Zend Framework preview release 1.3. I’m using it to index the files in a local copy of DokuWiki – this is an easy starting point as the pages are text files and all held in directories.

zend_lib.php
I have a library file which has some settings used both when building the index and when searching it. This file also includes a function which turns the url of the returned page into a path that dokuwiki can understand.

zend_text.php
The indexing page starts from $startpoint (set in zend_lib.php) and recurses into directories indexing file size, url and content of each file.

zend_text2.php
The search page shows some searching tips and a search box. When a search is performed it then prints links to the pages in dokuwiki and their relevance, ordered by relevance.

I’m so excited that I’m including the files here for now and might think about turning this into a proper dokuwiki plugin – although, personally I like their current search functionality, this current exercise is a PHB requirement. Here are the files:

zend_lib.php

zend_text.php

zend_text2.php

The Beauty of Vim

I work with vim and its fabulous. Although I’ve been a casual linux shell user for some years, I’ve never had to get to grips with vim as my main editor which I’m using eight hours a day until now. And I love it.

Cheat Sheets

Its vital to get a good cheat sheet to start with. This is like a menu of commands to remind you how to do things. Then when you think “wouldn’t it be cool if this program did …”, you can look up how to do it (and I guarantee vim has the feature you wanted, whatever it is. Some of my favourite tricks not always listed on cheat sheets are:

gv reselect your most recent selection
% when on a bracket (either ( or { ), jump to its partner

For more cheat sheets, probably best to look on my del.icio.us page (here and linked in left hand bar) as I keep my favourite links of the moment updated there

Colour Syntax

I have finally managed to get my vim working with colour highlighting which is making my life much easier (and prettier, of course). I’m running vim 6.2 on AIX 5.3 and found that the only way to get my vim into colourful mode was to turn on the syntax and set my terminal to dtterm.

To change your terminal type, at the prompt type:

export TERM=dtterm

Then when you do

echo $TERM

it should tell you that your term type is now dtterm. Unfortunately the change in terminal type made my function keys stop working (argh – see my earlier post on this topic). As a compromise I have aliased vim to set the terminal type when it runs, by adding the following line to my .kshrc file (if you’re running bash then add it to .bashrc instead)

alias vim=’vim -T dtterm’

Arrow keys

I normally use h,j,k,l to navigate in vim (left, down, up and right respectively), but I get stressed by the cursor not wrapping at the end of the lines. I googled for the problem and found that adding this to my .kshrc helped:

set -o emacs
alias __A=$(print ’\0020’) # ^P = up = previous command
alias __B=$(print ’\0016’) # ^N = down = next command
alias __C=$(print ’\0006’) # ^F = right = forward a character
alias __D=$(print ’\0002’) # ^B = left = back a character
alias __H=$(print ’\0001’) # ^A = home = beginning of line

Regex

Regular expressions in vim are more powerful than I can imagine, and I’m loving the find-and-replace, especially because you can use the pattern you matched in your replacement expression by typing \0 as part of the expression. Its so powerful.

For my next trick, I will figure out how to group parts of my pattern to use bits of them in the replacement – I feel a tutorial coming on.

Opera’s Favourite Icon

I’ve been getting wound up recently by Opera spamming my apache logs with errors about missing favicon.ico files. So here’s some instructions for removing this annoying default behaviour:

http://groups.google.co.uk/group/opera.general/browse_thread/thread/601683ed17b42762/ac5685ea6a310180?lnk=st&q=opera+favicon+request+error&rnum=1#ac5685ea6a310180

Symptom

You’ll spot the problem because there will be lines in the apache error.log file which look like this (this error is from a windows machine)

File does not exist: C:/www/favicon.ico, referer:

Normally, on a public website, I’d ignore this unless you do have a favicon set up. However I’m developing locally and so its my copy of Opera that is causing this crud in the files.

crawlers and spiders – take two

I made some progress with getting Xapian set up, but not to the point where I was all ready to go. I’m not all that familiar with C++ and somehow I lost the will to live somewhere along the way this afternoon.

PHPDig

I should mention PHPDig here because it is a really good product and widely used. However a mysql database is a no-no, as I mentioned earlier.

This leads me back around to ….

ZSearch

The Lucene implementation in the Zend Framework for PHP 5. Wish me luck!

EDIT You can read the follow-up post here

TikiWiki and Oracle

Well, TikiWiki claims to support Oracle … great! So I’ll install it, and try it.

(insert comedy failure noise here)

The installation doesn’t work! Mostly because you can only name oracle things with names of less than thirty characters in length, and this product doesn’t respect that when installing on Oracle, so action is needed.

Here’s the file of corrected statements I ran to get all the tables created successfully and also reinstate triggers and indexes that failed (I’m not promising its perfect). Where I needed to modify a correlating php page, that’s documented as well. I hope this helps someone in the future – me, next time I need to do this, perhaps?

tiki_installation_corrections.txt

crawlers and spiders

Its monday morning, and I’ve just reviewed the suggestions for storing information for the team that I put forward earlier. My boss is going to go with me on DokuWiki but for some reason the lack of database-backend is making him nervous. The search functionality is currently absolutely fine but that’s with 50 docs and we might need to handle 5000. We need a spider.

The thing with DokuWiki is that it stores its information in files, which is fine because it is a series of pages, or documents, and that’s what file systems were invented for! However if you want to look for a particular word or phrase then you will need to open and close each one of those documents … and that’s slow. So I’m looking for a thing of some kind which will index my information out of those files (not choking on the markup) while I’m not looking, and then deliver very fast search results.

I haven’t got anything working yet so this is kind of theoretical and I’ll come back and update this when I get a solution in place, but here’s the current shortlist.

ZSearch from the Zend Framework
Except it needs PHP5 and we’re running PHP4. Not sure whether I should try to work with it or what.

Xapian

I’ve come across http://www.xapian.org/ which looks promising … except I’m working on Windows and I’ll have to compile stuff, and the IT proxy isn’t working and the main one won’t let me download executables. Back to some real work and save this project for another day!

EDIT you can read the follow-up articles here and here

Thoughts of Wikis

I’m implementing a new information-keeping system where I work, and trying to find something that will fit in with a number of requirements. Here’s a quick summary of the task and how I’m getting along:

Requirements

  • Allow text
  • Allow attachments (files and pictures)
  • integrate with existing extranet signon
  • re-brand to match extranet
  • allow conversion of existing files from
    1. knowledgebase
    2. dokuwiki
    3. html pages
  • fine-grained access control for groups of users – it, programmers, symphony users, customers, etc
  • consideration for scaling of solution
  • lowest possible effort needed to edit/add info

Preferences

  • Oracle-driven if a database is needed
  • powerful search functionality
  • ideally free!

Products

TikiWiki

The only PHP-driven Oracle-backed product on the market. TikiWiki is relatively straightforward to install. It is very complex for the purpose as it is a fully-fledged groupware with CMS, the wiki is just one module

pros

  • oracle-driven
  • written in php
  • skinnable
  • support for output in PDF

cons

  • overkill
  • no fine-grained access control
  • no hooks for adding our authentication or interface with existing standards
  • rather buggy under Oracle (especially the installation!)

MediaWiki

Experimental Oracle support apparently – testing with MySQL

MediaWiki is the engine behind slashdot – it is widely used and understood. Traditionally PHP driven there is some support for Oracle however this is not widely used and not really supported by the project developers.

pros

  • widely used product – plenty of community support
  • good search functionality
  • fine-grained access control (hides things you don’t have access to – very nice)
  • LDAP authentication supported

cons

  • horrible markup (not very strict, not block-level, hard to parse or convert from)
  • difficult to convert existing documents
  • standard of Oracle implementation unknown – I can’t get it to install! Likely to be poor and/or patchy

DokuWiki

Simple, text-backed storage. DokuWiki is the first information management system we implemented at Symphony to test the idea. It uses flat file storage, which we saw as a potential storage problem, however I’m seeing examples of people saying its fine up to 40k pages or so. http://www.pmwiki.org/wiki/PmWiki/FlatFileAdvantages

pros

  • lightweight, easy to install and brand
  • text files can be readable with or without frontend
  • simple and clear markup
  • easy to convert documents to it
  • syntax highlight in code blocks

cons

  • poor search functionality, also slow due to the file access. Could use external spider e.g. http://www.phpdig.net
  • potentially poor scalability as the file structure grows

I’ve got a Google Analyltics account

Yay! My invite arrived this morning for google analytics. I’ve got this site and one other all hooked up to it and will wait and see how it all turns out. Both sites are very low traffic but that’s OK.

So far the features are good, I needed a Google account to log in with and can give access to others as well – they also need a google account. I’ve given access to my co-owner of one of the sites to the reports, which was very easy.

I’ll write more about how I get on once I get some statistics to look at.

Textile Knowledge Nuggets

I’ve run into some problems formatting an article about markup, which I was writing in textpattern, which uses markup …. you can see where this is going. Well I learned some new tricks!

escaping from textile

To prevent a block from being processed, just use double equals signs around what you are interested in (don’t know how to show you an example without breaking stuff so I won’t try!)

This is much better than the results from ... or bc. where your code can still get processed.

to make a block style persist

When using a block quote to show lots of lines of code, or verses of a song, use a double dot, like this:

bq..

Then everything else you write

Even if it has line breaks

Will carry on being in that style until you start a new style

p.

such as a paragraph

Many thanks to AllPhilosophy for these! http://allphilosophy.com/home/guide/rich