Spindle

I switch a few site searches over from using a Swish-e CGI to using a Java-based packaged called Spindle. These are my notes. The summary is: it works very well.

Contents

 

Original: July 2002. This version: $Id: index.html,v 1.6 2004/06/20 21:34:51 richard Exp $


Introduction

My needs for site search are pretty simple: a fast indexing tool to scan a web site; and a fast search tool that will find relevant stuff.

Swish-e is a C-based lump of code plus a bit of Perl to do the site crawling. It works fine, but I've found it gets a bit slow. Also, the CGI architecture for performing the search doesn't scale too well, and I wasn't in the mood for fixing it.

Lucene has been around for a while as an indexing tool but I've ignored it because in my day job I need multilingual search, and Lucene doesn't have stemmers for many languages -- not yet, check regularly because they are adding more. However, for a couple of personal web sites, I didn't need anything other than English, so I decided to try Lucene.

To make life easier, Bitmechanic have wrapped Lucene with a web crawler and called the package Spindle. I installed version 0.90 (dated 2002-03-30)....

Indexing

The first step with Spindle is to create an index. I downloaded and installed the Spindle JAR files on a Linux server and used this script ("spindle") to fire off the indexing:

#!/bin/sh
# spindle -- Runs the spindle HTTP spider to index a site

SPINDLE=~/local/spindle-0.90

CLASSPATH=$SPINDLE/lib/spindle.jar:$SPINDLE/lib/lucene-1.2-rc4.jar:$SPINDLE/lib/jsse.jar
java -cp $CLASSPATH com.bitmechanic.spindle.Spider "$@"

Note that contrary to the documentation you do need jsse.jar for indexing (at least you do with JDK 1.3). You can download this jar from the Java Secure Socket Extension site.

The command to index this very site is: spindle -u http://www.dallaway.com/ -d /home/richard/html/dallaway/spindle/ -e .cgi -e .jar -e .zip -e /comment -v -dt p -dt span -dt h1 -n

In summary this says: search www.dallaway.com, storing the index in the "spindle" directory, ignoring any .cgi, .jar or .zip files, oh, and be verbose about it. I'll explain the other command line arguments later, in the section on my hacks.

Having been use to Swish-e, when I ran my first site index I was expecting the indexing to be complete in an hour or so. What I saw was this: Indexed 1047 URLs (2546 KB) in 137 seconds. That's fast... take my word for it. Probably the main parameter to use to adjust the performance is the number of threads for crawling (the default, which I used, is 2).

Having a built an index, you'll want to manually see what's in it from the command line. Here's the script I use ("search"):

#!/bin/sh

SPINDLE=~/local/spindle-0.90

CLASSPATH=$SPINDLE/lib/spindle.jar:$SPINDLE/lib/lucene-1.2-rc4.jar:$SPINDLE/lib/jsse.jar:~/local/listlib-0.91/lib/listlib.jar
java -cp $CLASSPATH com.bitmechanic.spindle.Search "$@"

The documentation doesn't mention that you also need Bitmechanic's listlib library on the classpath, but I found that I did need it.

The above script allows you to search the index using a Lucine search phrase. For example, to search the index for the word "auction":

$./search /home/richard/html/dallaway/spindle auction
http://www.dallaway.com/acad/cbd/; Title: Outsourced component development; Score: 0.21613583
Description: Outsourced component development Late in 1999 I was asked to develop a small Java applet for a colleague who was working on a bigger project for a client.
This wasn't something I was really interested in and nor was it my speciality, but I thought it
would
http://www.dallaway.com/acad/index.html; Title: Writing; Score: 0.213919
Description: Things I have written down My notes on getting going with Java Web Start for deploying applications over the internet, and learning how to digitally sign code on the cheap. My first attempt at commissioning some code development via an internet auction. My

Two results, showing the page, the title, the score and the description (summary) of the page.

Searching

To provide this search functionality on the web it's just a matter of changing the JSP that comes with Spindle. It'd probably be better to write a servlet or struts action that forwards to a display JSP, for proper MVC, but the single JSP works fine for me on the small web sites where I'm using Spindle.

To deploy the JSP, you need to have commons-beanutils.jar, lucene-1.2-rc4.jar, listlib.jar and spindle.jar in WEB-INF/lib, and also you need to deploy WEB-INF/listlib.tld. These all come in the Spindle and/or listlib distributions from Bitmechanic. Later version of Lucene will probably work, but I've not tried them out yet.

The index is set in the top of the JSP:


<jsp:setProperty name="search" property="dir"
                   value="/home/richard/html/dallaway/spindle"/>

And that's pretty much it. You can try the searching by using the "Find" box at the top of this page.

My hacks

I've made five changes to the Spindle distribution:

You can download the modified source of Spider.java, ListContainer.java and TagToken.java, or you can download a drop-in replacement for spindle.jar and listlib.jar (this is built for spindle 0.90 using JDK 1.4). Note that this version of spindle.jar includes listlib, so replace your copy of spindle.jar with this version and remove your copy of listlib.jar.

Conclusions

Lucene works and is shocking fast. Spindle makes it (almost) out-of-the-box no-brainer for web site search. However, check with the Lucene project because they are working on developing something with will replace Spindle, as Spindle does not seem to be actively developed and contains a few bugs.

Resources

  1. Lucene home page. Good for docs on how to search.
    http://jakarta.apache.org/lucene/docs/index.html
  2. Lucene application extensions. The plan to replace Spindle.
    http://jakarta.apache.org/lucene/docs/luceneplan.html
  3. Spindle home page.
    http://www.bitmechanic.com/projects/spindle/
  4. Listlib home page.
    http://www.bitmechanic.com/projects/taglib/listlib/
  5. Jakara Newsletter. Good for keeping up to date with the various Apache Jakarta projects.
    http://jakarta.apache.org/site/news/