Search, Hold the Server

Being a content site, it was pretty important that have search–which was pretty hard considering that I didn’t want to have any server-side components involved.

At first I thought I would just let Google index the site, and hook up a Google search box on the site to solve the problem. That was certainly an option.  I thought I would have to do some SEO magic to make it happen correctly, but it was doable. In fact, Ray had solved this problem already.

But then I got to thinking, wouldn’t be cooler to rise to the challenge of a search without a server? Why, yes, yes it would. I broke up my needs into two parts:

  • An index of the site’s content
  • A mechanism for searching the index and displaying the results.

I kicked around a few ideas, but finally settled on the idea of creating a JSON file that had an array of objects with title, url, summary, and condensed content info. If I had such a file, all I would have to do is search through the JSON to find results. So the second part of my search was a snap.  All I had to do was:

  • Pull down the JSON file
  • Run searches against that JSON file
  • Present the results

All of this was pretty easy to do, and not revolutionary.

The difficult part was making the index in the first place.  The added difficulty is that I wanted to use JavaScript for everything.  So I couldn’t just use a shell script or some other easy way of indexing the files. Basically I wanted to remove barriers to entry, so an OS X shell script would create an obstacle for Windows based HTML developers to get involved.

Trying to do this with JavaScript in a browser was very hard. While there is a File API for the browser, you can’t use it to point at arbitrary directories on the file system like the site itself, you can only really point it at a sandboxed space. This made indexing kinda impossible.

I said I had to use JavaScript; but I didn’t say I had to use a browser. Enter Rhino the Java interpreter for JavaScript. Rhino gave me the ability to call Java File IO classes from JavaScript.  This allowed for easy indexing of the content. Now this might be a bit of a cheat since I am basically calling Java, which is a decidedly server-side technology in this case.  I rationalized my way out of it. ANT is required to build the project, but knowing how to fire off an ANT build and being forced to right full Java are two different things.  I’d love to hear if any of you are put off by this.

Rhino gave me the ability to run JavaScript from the command line, or from ANT.  Since we publish whatever gets checked in to the github repository, and we publish that code through ANT, I could just reindex as part of the build whenever new content comes in. New content causes a reindex, the index is always up to date, and generated on my terms – only JavaScript.

What it actually does:

  • Reads in all HTML files in the site
  • Filters ones that I don’t want in search results.
  • Grabs the title, url, and content from each
  • Writes out this content to JSON on disk

It’s not perfect. Search is pretty primitive – I don’t know how far it will scale. But for now, I have a pretty cool solution to my problem.

Here’s the indexer code:

First New Contributor on

Wow, I’m happy to say we have our first new contribution to  Adam Tuttle fired off a pull request to me this morning to fix some grammatical and spelling issues. I was happy to bring his changes in.

This is a great example of how you contribute to Open Source projects you like without necessarily spending a tremendous amount of time or writing a crap ton of code.  Adam noticed some spelling and grammar issues, probably because I have the spelling ability of a Russian Sex Spammer. Awesome. He fixed it. He contributed it back.

I cannot tell you how happy I am to have someone correct me.  Feel free to pitch in.

I’m pleased to unveil a little project I’ve been working on for a few weeks now: is a reference site for answering the question “Which elements should I use to mark up this HTML semantically?” I’ve been joined in this effort by my coworker, Ray Camden. We’re pleased to put this out there, and eager to see what you can do with it.

The Story Behind It

I was (and still am) incredibly impressed by HTML5 Please.  I think it’s a fantastically on-target site. It showcases its technology and hits on a specific need and fills it brilliantly. I wanted to do something in the same vein without just copying it.  Around the time that I was feeling this, I got into an argument on semantics with someone. Specifically they were asking questions about when they should use article versus
div. Basically I explained what I knew of the spec for article. I gave some analysis, and made a recommendation.

When I was done with the argument I had an idea for a site:  A reference that would help people choose for themselves which tags to use semantically without being authoritarian.  I also wanted to set the tone that there isn’t one right answer to these things – that “semantically correct” isn’t a binary thing, but a position on a continuum.

The Technology Behind It

Another important thing for us in doing this was choice of technology. We placed a couple of constraints on the project:

  • We wanted to be open to other people contributing and offer a few channels for that.
  • No content management or wiki software
  • All code and content would be in HTML/JavaScript/CSS; no server-side technology

To achieve this we made a few choices.  To go open and collaborative without having a wiki, we went with github. Not the usual answer for a content site, but I think we can make it work.  The choice of no server-side tech (other than a vanilla web server) came about so as to not discourage contributions from anyone.  PHP, Ruby, ColdFusion, some JVM language, Python – whatever your back end, you have to know HTML/JavaScript/CSS. So let’s not skew one way, when most of the contributions can be made very simply with the front-end stack.

Working in those constraints wasn’t always easy. Ray got tired of copying and pasting template code around despite my incredibly stupid protestations that “No, it will be okay; we can work that way.” So he came up with a cool way to handle that with some JavaScript and .htaccess magic. I had to come up with a way to provide search without having any sort of server-side tech. We’re not sure if other people will be cool contributing under these constraints, but obviously I hope so.

Get Involved

We’re open to contributions.  We do all of the publishing through an automated build process that looks at the github repository for the project. So git is the path to getting on production.  We’re open to forks and pull requests.  We’re also open to contributions through email.  Basically if you want to contribute, drop us a line, we’ll figure out how to work with you to get you in.