Search, Hold the Server

Being a content site, it was pretty important that whichElement.com have search–which was pretty hard considering that I didn’t want to have any server-side components involved.

At first I thought I would just let Google index the site, and hook up a Google search box on the site to solve the problem. That was certainly an option.  I thought I would have to do some SEO magic to make it happen correctly, but it was doable. In fact, Ray had solved this problem already.

But then I got to thinking, wouldn’t be cooler to rise to the challenge of a search without a server? Why, yes, yes it would. I broke up my needs into two parts:

  • An index of the site’s content
  • A mechanism for searching the index and displaying the results.

I kicked around a few ideas, but finally settled on the idea of creating a JSON file that had an array of objects with title, url, summary, and condensed content info. If I had such a file, all I would have to do is search through the JSON to find results. So the second part of my search was a snap.  All I had to do was:

  • Pull down the JSON file
  • Run searches against that JSON file
  • Present the results

All of this was pretty easy to do, and not revolutionary.

The difficult part was making the index in the first place.  The added difficulty is that I wanted to use JavaScript for everything.  So I couldn’t just use a shell script or some other easy way of indexing the files. Basically I wanted to remove barriers to entry, so an OS X shell script would create an obstacle for Windows based HTML developers to get involved.

Trying to do this with JavaScript in a browser was very hard. While there is a File API for the browser, you can’t use it to point at arbitrary directories on the file system like the site itself, you can only really point it at a sandboxed space. This made indexing kinda impossible.

I said I had to use JavaScript; but I didn’t say I had to use a browser. Enter Rhino the Java interpreter for JavaScript. Rhino gave me the ability to call Java File IO classes from JavaScript.  This allowed for easy indexing of the content. Now this might be a bit of a cheat since I am basically calling Java, which is a decidedly server-side technology in this case.  I rationalized my way out of it. ANT is required to build the project, but knowing how to fire off an ANT build and being forced to right full Java are two different things.  I’d love to hear if any of you are put off by this.

Rhino gave me the ability to run JavaScript from the command line, or from ANT.  Since we publish whatever gets checked in to the github repository, and we publish that code through ANT, I could just reindex as part of the build whenever new content comes in. New content causes a reindex, the index is always up to date, and generated on my terms – only JavaScript.

What it actually does:

  • Reads in all HTML files in the site
  • Filters ones that I don’t want in search results.
  • Grabs the title, url, and content from each
  • Writes out this content to JSON on disk

It’s not perfect. Search is pretty primitive – I don’t know how far it will scale. But for now, I have a pretty cool solution to my problem.

Here’s the indexer code:

http://snipplr.com/js/embed.js
http://snipplr.com/json/63962

7 thoughts on “Search, Hold the Server

  1. It seems like you’re trying to build an awful lot of infrastructure to avoid using a server side technology. Well this make all be working well for a few pieces of content, I think it’s going to cause you problems down the road when you try to scale the content. It just seems like once your site has more than a handful of pieces of content, performance issues will arise.

    If you end up with thousands of pages (granted, unlikely, but let’s plan for the best case scenario) then doing this kind of search isn’t going to be very efficient.

    It just seems like you’re really fighting the no server-side thing. It seems like the better solution is to have the server-side generate static compiled content and only use the server-side technology for doing truly dynamic things–such as search.

    Like

  2. @Dan I hear you, but that’s sort of the point. Let’s see how this works out. Let’s kill an assumption (you have to have a server side tech) and see if it is still true.

    If I have problems scaling, it will be because it’s successful. I’ll worry about that if it actually happens.

    Like

  3. It’s your court, so your rules. You can do what you like. I haven’t looked very deeply into the details, but it sure seems like Rhino is run on the server. In my book, that is a server side technology, and you are breaking your own rule.

    If I’m mistaken and Rhino runs in the browser, then your rules are intact.

    Like

  4. @Gus, your mind’s your own court too, so feel free to convinct in absentia. So why am I focused on “no server” on this project? To make sure anyone could take the source code and start playing with it, without the need for an application server. The Rhino code is run outside the browser, but as part of the build, a user never initiates that processing. So maybe I should focus on “no application server” instead of “no server.”

    Like

  5. The Rhino dependency does feel like a *little* of a barrier to entry, but not overly so – can easily be mitigated by good documentation. I love the fact that the search index is built with the build process in this case – makes perfect sense for this kind of site.

    As for scale, there may be ways to enhance the performance of this JavaScipt based index that can be explored if and when it becomes an issue. I’d be surprised if there aren’t other efforts out there in this vein (though I struggled to find them when I was looking to do exactly this same job). Anyone up for a JavaScript port of Lucene?!

    Like

  6. @Dominic. The RHINO dependency is kinda wrapped up in the ANT one. If you can run the ANT script then you can run the RHINO scripts, as they are abstracted by the ANT script.

    Like

Leave a comment