crawlers and spiders

Its monday morning, and I’ve just reviewed the suggestions for storing information for the team that I put forward earlier. My boss is going to go with me on DokuWiki but for some reason the lack of database-backend is making him nervous. The search functionality is currently absolutely fine but that’s with 50 docs and we might need to handle 5000. We need a spider.

The thing with DokuWiki is that it stores its information in files, which is fine because it is a series of pages, or documents, and that’s what file systems were invented for! However if you want to look for a particular word or phrase then you will need to open and close each one of those documents … and that’s slow. So I’m looking for a thing of some kind which will index my information out of those files (not choking on the markup) while I’m not looking, and then deliver very fast search results.

I haven’t got anything working yet so this is kind of theoretical and I’ll come back and update this when I get a solution in place, but here’s the current shortlist.

ZSearch from the Zend Framework
Except it needs PHP5 and we’re running PHP4. Not sure whether I should try to work with it or what.

Xapian

I’ve come across http://www.xapian.org/ which looks promising … except I’m working on Windows and I’ll have to compile stuff, and the IT proxy isn’t working and the main one won’t let me download executables. Back to some real work and save this project for another day!

EDIT you can read the follow-up articles here and here

Thoughts of Wikis

I’m implementing a new information-keeping system where I work, and trying to find something that will fit in with a number of requirements. Here’s a quick summary of the task and how I’m getting along:

Requirements

  • Allow text
  • Allow attachments (files and pictures)
  • integrate with existing extranet signon
  • re-brand to match extranet
  • allow conversion of existing files from
    1. knowledgebase
    2. dokuwiki
    3. html pages
  • fine-grained access control for groups of users – it, programmers, symphony users, customers, etc
  • consideration for scaling of solution
  • lowest possible effort needed to edit/add info

Preferences

  • Oracle-driven if a database is needed
  • powerful search functionality
  • ideally free!

Products

TikiWiki

The only PHP-driven Oracle-backed product on the market. TikiWiki is relatively straightforward to install. It is very complex for the purpose as it is a fully-fledged groupware with CMS, the wiki is just one module

pros

  • oracle-driven
  • written in php
  • skinnable
  • support for output in PDF

cons

  • overkill
  • no fine-grained access control
  • no hooks for adding our authentication or interface with existing standards
  • rather buggy under Oracle (especially the installation!)

MediaWiki

Experimental Oracle support apparently – testing with MySQL

MediaWiki is the engine behind slashdot – it is widely used and understood. Traditionally PHP driven there is some support for Oracle however this is not widely used and not really supported by the project developers.

pros

  • widely used product – plenty of community support
  • good search functionality
  • fine-grained access control (hides things you don’t have access to – very nice)
  • LDAP authentication supported

cons

  • horrible markup (not very strict, not block-level, hard to parse or convert from)
  • difficult to convert existing documents
  • standard of Oracle implementation unknown – I can’t get it to install! Likely to be poor and/or patchy

DokuWiki

Simple, text-backed storage. DokuWiki is the first information management system we implemented at Symphony to test the idea. It uses flat file storage, which we saw as a potential storage problem, however I’m seeing examples of people saying its fine up to 40k pages or so. http://www.pmwiki.org/wiki/PmWiki/FlatFileAdvantages

pros

  • lightweight, easy to install and brand
  • text files can be readable with or without frontend
  • simple and clear markup
  • easy to convert documents to it
  • syntax highlight in code blocks

cons

  • poor search functionality, also slow due to the file access. Could use external spider e.g. http://www.phpdig.net
  • potentially poor scalability as the file structure grows

I’ve got a Google Analyltics account

Yay! My invite arrived this morning for google analytics. I’ve got this site and one other all hooked up to it and will wait and see how it all turns out. Both sites are very low traffic but that’s OK.

So far the features are good, I needed a Google account to log in with and can give access to others as well – they also need a google account. I’ve given access to my co-owner of one of the sites to the reports, which was very easy.

I’ll write more about how I get on once I get some statistics to look at.

Textile Knowledge Nuggets

I’ve run into some problems formatting an article about markup, which I was writing in textpattern, which uses markup …. you can see where this is going. Well I learned some new tricks!

escaping from textile

To prevent a block from being processed, just use double equals signs around what you are interested in (don’t know how to show you an example without breaking stuff so I won’t try!)

This is much better than the results from ... or bc. where your code can still get processed.

to make a block style persist

When using a block quote to show lots of lines of code, or verses of a song, use a double dot, like this:

bq..

Then everything else you write

Even if it has line breaks

Will carry on being in that style until you start a new style

p.

such as a paragraph

Many thanks to AllPhilosophy for these! http://allphilosophy.com/home/guide/rich

Apache FOP: formatting objects is fun

I’ve been working on a tricky problem at work this week (and last week as well actually, its been really really tricky in fact), we need to be able to output a form in both PDF (Portable Document Format) and PCL (Printer Control Language) output, because our fax system can only handle PCL format.

Ghostscript

I had a look at using Ghostscript, its been around a while and is widely-used, freely-available and, by all accounts, stable. I had some trouble getting it working initially but I think it would have done the job.

Apache FOP

The Apache Project has a project called FOP (Formatting Objects Project) which is part of their XML Graphics project. Its a module that takes a particular type of XML format called Formatting Objects (now a w3c recommendation and known as xml:fo), a type of XML used to represent a document of information along with information about presentation.

Since xml:fo is a recognised standard, its a great format to choose to implement the conversions to PCL and PDF. Other output versions are also available with more on the way too, so its an application that can be adapted to meet other needs as they arise.

XSL translations

Since xml:fo is a standard and its XML, it should be possible to get any number of XML formats (including Open Office or Word XML) translated into it using an XSL (eXtensible Stylesheet Language). I tried out a couple of these from http://www.antennahouse.com/, however although these worked well with the sample files I found that I had trouble with the resulting xml:fo formats produced from my own xhtml files.

AntennaHouse clearly have a lot of knowledge in this area though, and their site is well worth a visit for background reading on this topic. I suspect that part of the problem was that FOP only has a partial implementation of the xml:fo specification, so although I was feeding it valid xml:fo, it didn’t know what to do with all of it. There is a rewrite in progress so I expect that newer versions will be much more robust.

Final Solution

In the end (since I only wanted a simple one-page form), I settled on writing the xml:fo format by hand, producing really great results in both formats and with images as well. I’ve also been asked to look into programs to generate this output, they’re mostly commercial but if I come across anything interesting I’ll add it here. Apache FOP is a great project and I hope it doesn’t lose it momentum!

Ringing the (password) changes

I have milestones in my working life, I mark time by them. They help me to think quantitively about how much time has passed since a particular point in time or event. Its helpful because it enables me to think clearly about whether a colleague would have been involved in a particular activity at that particular time. And the nature of the milestone? Password changes!

I am pretty consciencious about passwords. I always have different ones for different things, with uppercase, numbers, punctuation and sometimes uppercase. As a sufferer of DOuble-CApital-itis, I am not a big fan of uppercase but I make the effort sometimes. Of course there are exceptions, such as the one password I use for all random website registrations, but I’m in good company with that. Because of these password habits, changing a password that I use every day is a big event! I have to think of something that my brain can hold on to, and train myself to type the new one rather than the old. I sometimes change existing passwords for no reason, I just think its good practice. What I really hate is being forced to use a password I don’t want, or change it when I am not ready!

I have recently changed employment, no particular reason just the next step on the ladder really. At my old workplace, I typed my password every single time I opened an internet browser, or logged onto another machine. I can’t deal with too many windows on the taskbar so I was opening and closing browsers all day. I must have typed it fifty times most days. The password complexity rules were there, but they didn’t really get in my way. I was forced by the system to change my password every three months. Three months is quite short when you are subconsciously typing that same password in so often! Still, the password change would roll around, marking a change in season, and I’d spend three days swearing at having typed in the wrong password on autopilot. When my password expired with a week of my notice still left to work, my boss (I guess tired of all that swearing) extended the expiry period to save me the pain.

So here I am, bright and enthusiastic in my new job. Day one, I have to choose a new password. No problem. Four weeks later, I get prompted to change my password. OK, well that’s a pain because I find password changes difficult but hey, I’m new, and I’ll just grin and bear it – after all, I don’t have to type my password for the web proxy here, just when I log in or unlock my machine. That’s still quite a few times though as I don’t leave my desk to go anywhere without locking it. So …. you can guess what’s coming next. Eight weeks into the new job and the password change box is back. My mind is too full to manage another “good” password so I try out something insecure – all lower case characters. And it accepts.

There’s something about this “security” which bothers me immensely. Most password setup systems come with tickboxes, to turn on “features”, such as

  • require mixed case
  • require at least one number
  • require some punctuation
  • ban password recycling
  • ban similar passwords
  • force password change

The sysadmin starts to read the list, tick the top few boxes, decides this is a Good Thing and ticks them all – the system is as secure as possible – Right???

This is how security myths start, and “force password change” is not something where (more often == better). A few months from now, I’m going to be a gibbering wreck, with my plain text password post-it-ed onto my monitor, and not locking the console when I walk away.

putty function keys in AIX

I’m having fun and games with AIX! I have two telnet clients, putty (fantastic client) and KEA!. I’ve been using KEA! successfully but would like to switch to putty, however the function keys didn’t result in the same escape characters being sent as in KEA!.

After a post to google groups and a very helpful link I finally started to make some progress. I only need keys F1 to F20 working and I’m there.

What I had to do was set putty’s setting under Terminal -> Keyboard -> The function keys and keypad to “Xterm R6”. This doesn’t exactly match what KEA! output but its close. The big difference is that with KEA I use ctrl with F1 to F10 to access the keys F11 to F20; on putty this is shift instead of ctrl.

In case it is any help to anyone, here are the outputs of F1 to F12 on the first line, shift and F1 to F12 on the second line, and ctrl and F1 to F12 on the third line, for both putty and KEA!.

putty:



<sup>[OP</sup>[OQ^[OR^[OS^[[15~^[[17~^[[18~^[[19~^[[20~^[[21~^[[23~^[[24~
<sup>[[23~</sup>[[24~^[[25~^[[26~^[[28~^[[29~^[[31~^[[32~^[[33~^[[34~^[[23~^[[24~
<sup>[OP</sup>[OQ^[OR^[OS^[[15~^[[17~^[[18~^[[19~^[[20~^[[21~^[[23~^[[24~

and KEA!:



<sup>[OP</sup>[OQ^[OR^[OS^[[17~^[[18~^[[19~^[[20~^[[21~^[[23~^[[24~
l
<sup>[[23~</sup>[[24~^[[25~^[[26~^[[28~^[[29~^[[31~^[[32~^[[33~^[[34~^[[23~^[[24~

NB F5 is missing from the KEA! one since pressing it while running cat caused cat to coredump!!

Anyway that was today’s crash course in escape chars, my putty is now working well and I can use it for the apps which use the function keys too, which is good news.

my first shell script

I have a new job, which involves working on an AIX box over SSH. It the first time I’ve used command line to do everything I do in a day and its an education – especially as I’m new to korn shell.

Today I wrote my first shell script, it sets the title of my putty window so I don’t get confused which window is which. Here it is:


echo "\033]0;$PWD\007";
if test -n $1
then
        echo "\033]0;$1\007";
fi