crawlers and spiders – take two

I made some progress with getting Xapian set up, but not to the point where I was all ready to go. I’m not all that familiar with C++ and somehow I lost the will to live somewhere along the way this afternoon.

PHPDig

I should mention PHPDig here because it is a really good product and widely used. However a mysql database is a no-no, as I mentioned earlier.

This leads me back around to ….

ZSearch

The Lucene implementation in the Zend Framework for PHP 5. Wish me luck!

EDIT You can read the follow-up post here

TikiWiki and Oracle

Well, TikiWiki claims to support Oracle … great! So I’ll install it, and try it.

(insert comedy failure noise here)

The installation doesn’t work! Mostly because you can only name oracle things with names of less than thirty characters in length, and this product doesn’t respect that when installing on Oracle, so action is needed.

Here’s the file of corrected statements I ran to get all the tables created successfully and also reinstate triggers and indexes that failed (I’m not promising its perfect). Where I needed to modify a correlating php page, that’s documented as well. I hope this helps someone in the future – me, next time I need to do this, perhaps?

tiki_installation_corrections.txt

crawlers and spiders

Its monday morning, and I’ve just reviewed the suggestions for storing information for the team that I put forward earlier. My boss is going to go with me on DokuWiki but for some reason the lack of database-backend is making him nervous. The search functionality is currently absolutely fine but that’s with 50 docs and we might need to handle 5000. We need a spider.

The thing with DokuWiki is that it stores its information in files, which is fine because it is a series of pages, or documents, and that’s what file systems were invented for! However if you want to look for a particular word or phrase then you will need to open and close each one of those documents … and that’s slow. So I’m looking for a thing of some kind which will index my information out of those files (not choking on the markup) while I’m not looking, and then deliver very fast search results.

I haven’t got anything working yet so this is kind of theoretical and I’ll come back and update this when I get a solution in place, but here’s the current shortlist.

ZSearch from the Zend Framework
Except it needs PHP5 and we’re running PHP4. Not sure whether I should try to work with it or what.

Xapian

I’ve come across http://www.xapian.org/ which looks promising … except I’m working on Windows and I’ll have to compile stuff, and the IT proxy isn’t working and the main one won’t let me download executables. Back to some real work and save this project for another day!

EDIT you can read the follow-up articles here and here

Thoughts of Wikis

I’m implementing a new information-keeping system where I work, and trying to find something that will fit in with a number of requirements. Here’s a quick summary of the task and how I’m getting along:

Requirements

  • Allow text
  • Allow attachments (files and pictures)
  • integrate with existing extranet signon
  • re-brand to match extranet
  • allow conversion of existing files from
    1. knowledgebase
    2. dokuwiki
    3. html pages
  • fine-grained access control for groups of users – it, programmers, symphony users, customers, etc
  • consideration for scaling of solution
  • lowest possible effort needed to edit/add info

Preferences

  • Oracle-driven if a database is needed
  • powerful search functionality
  • ideally free!

Products

TikiWiki

The only PHP-driven Oracle-backed product on the market. TikiWiki is relatively straightforward to install. It is very complex for the purpose as it is a fully-fledged groupware with CMS, the wiki is just one module

pros

  • oracle-driven
  • written in php
  • skinnable
  • support for output in PDF

cons

  • overkill
  • no fine-grained access control
  • no hooks for adding our authentication or interface with existing standards
  • rather buggy under Oracle (especially the installation!)

MediaWiki

Experimental Oracle support apparently – testing with MySQL

MediaWiki is the engine behind slashdot – it is widely used and understood. Traditionally PHP driven there is some support for Oracle however this is not widely used and not really supported by the project developers.

pros

  • widely used product – plenty of community support
  • good search functionality
  • fine-grained access control (hides things you don’t have access to – very nice)
  • LDAP authentication supported

cons

  • horrible markup (not very strict, not block-level, hard to parse or convert from)
  • difficult to convert existing documents
  • standard of Oracle implementation unknown – I can’t get it to install! Likely to be poor and/or patchy

DokuWiki

Simple, text-backed storage. DokuWiki is the first information management system we implemented at Symphony to test the idea. It uses flat file storage, which we saw as a potential storage problem, however I’m seeing examples of people saying its fine up to 40k pages or so. http://www.pmwiki.org/wiki/PmWiki/FlatFileAdvantages

pros

  • lightweight, easy to install and brand
  • text files can be readable with or without frontend
  • simple and clear markup
  • easy to convert documents to it
  • syntax highlight in code blocks

cons

  • poor search functionality, also slow due to the file access. Could use external spider e.g. http://www.phpdig.net
  • potentially poor scalability as the file structure grows

I’ve got a Google Analyltics account

Yay! My invite arrived this morning for google analytics. I’ve got this site and one other all hooked up to it and will wait and see how it all turns out. Both sites are very low traffic but that’s OK.

So far the features are good, I needed a Google account to log in with and can give access to others as well – they also need a google account. I’ve given access to my co-owner of one of the sites to the reports, which was very easy.

I’ll write more about how I get on once I get some statistics to look at.

Textile Knowledge Nuggets

I’ve run into some problems formatting an article about markup, which I was writing in textpattern, which uses markup …. you can see where this is going. Well I learned some new tricks!

escaping from textile

To prevent a block from being processed, just use double equals signs around what you are interested in (don’t know how to show you an example without breaking stuff so I won’t try!)

This is much better than the results from ... or bc. where your code can still get processed.

to make a block style persist

When using a block quote to show lots of lines of code, or verses of a song, use a double dot, like this:

bq..

Then everything else you write

Even if it has line breaks

Will carry on being in that style until you start a new style

p.

such as a paragraph

Many thanks to AllPhilosophy for these! http://allphilosophy.com/home/guide/rich

Apache FOP: formatting objects is fun

I’ve been working on a tricky problem at work this week (and last week as well actually, its been really really tricky in fact), we need to be able to output a form in both PDF (Portable Document Format) and PCL (Printer Control Language) output, because our fax system can only handle PCL format.

Ghostscript

I had a look at using Ghostscript, its been around a while and is widely-used, freely-available and, by all accounts, stable. I had some trouble getting it working initially but I think it would have done the job.

Apache FOP

The Apache Project has a project called FOP (Formatting Objects Project) which is part of their XML Graphics project. Its a module that takes a particular type of XML format called Formatting Objects (now a w3c recommendation and known as xml:fo), a type of XML used to represent a document of information along with information about presentation.

Since xml:fo is a recognised standard, its a great format to choose to implement the conversions to PCL and PDF. Other output versions are also available with more on the way too, so its an application that can be adapted to meet other needs as they arise.

XSL translations

Since xml:fo is a standard and its XML, it should be possible to get any number of XML formats (including Open Office or Word XML) translated into it using an XSL (eXtensible Stylesheet Language). I tried out a couple of these from http://www.antennahouse.com/, however although these worked well with the sample files I found that I had trouble with the resulting xml:fo formats produced from my own xhtml files.

AntennaHouse clearly have a lot of knowledge in this area though, and their site is well worth a visit for background reading on this topic. I suspect that part of the problem was that FOP only has a partial implementation of the xml:fo specification, so although I was feeding it valid xml:fo, it didn’t know what to do with all of it. There is a rewrite in progress so I expect that newer versions will be much more robust.

Final Solution

In the end (since I only wanted a simple one-page form), I settled on writing the xml:fo format by hand, producing really great results in both formats and with images as well. I’ve also been asked to look into programs to generate this output, they’re mostly commercial but if I come across anything interesting I’ll add it here. Apache FOP is a great project and I hope it doesn’t lose it momentum!

Ringing the (password) changes

I have milestones in my working life, I mark time by them. They help me to think quantitively about how much time has passed since a particular point in time or event. Its helpful because it enables me to think clearly about whether a colleague would have been involved in a particular activity at that particular time. And the nature of the milestone? Password changes!

I am pretty consciencious about passwords. I always have different ones for different things, with uppercase, numbers, punctuation and sometimes uppercase. As a sufferer of DOuble-CApital-itis, I am not a big fan of uppercase but I make the effort sometimes. Of course there are exceptions, such as the one password I use for all random website registrations, but I’m in good company with that. Because of these password habits, changing a password that I use every day is a big event! I have to think of something that my brain can hold on to, and train myself to type the new one rather than the old. I sometimes change existing passwords for no reason, I just think its good practice. What I really hate is being forced to use a password I don’t want, or change it when I am not ready!

I have recently changed employment, no particular reason just the next step on the ladder really. At my old workplace, I typed my password every single time I opened an internet browser, or logged onto another machine. I can’t deal with too many windows on the taskbar so I was opening and closing browsers all day. I must have typed it fifty times most days. The password complexity rules were there, but they didn’t really get in my way. I was forced by the system to change my password every three months. Three months is quite short when you are subconsciously typing that same password in so often! Still, the password change would roll around, marking a change in season, and I’d spend three days swearing at having typed in the wrong password on autopilot. When my password expired with a week of my notice still left to work, my boss (I guess tired of all that swearing) extended the expiry period to save me the pain.

So here I am, bright and enthusiastic in my new job. Day one, I have to choose a new password. No problem. Four weeks later, I get prompted to change my password. OK, well that’s a pain because I find password changes difficult but hey, I’m new, and I’ll just grin and bear it – after all, I don’t have to type my password for the web proxy here, just when I log in or unlock my machine. That’s still quite a few times though as I don’t leave my desk to go anywhere without locking it. So …. you can guess what’s coming next. Eight weeks into the new job and the password change box is back. My mind is too full to manage another “good” password so I try out something insecure – all lower case characters. And it accepts.

There’s something about this “security” which bothers me immensely. Most password setup systems come with tickboxes, to turn on “features”, such as

  • require mixed case
  • require at least one number
  • require some punctuation
  • ban password recycling
  • ban similar passwords
  • force password change

The sysadmin starts to read the list, tick the top few boxes, decides this is a Good Thing and ticks them all – the system is as secure as possible – Right???

This is how security myths start, and “force password change” is not something where (more often == better). A few months from now, I’m going to be a gibbering wreck, with my plain text password post-it-ed onto my monitor, and not locking the console when I walk away.