James Aylett: Simple search for Django using Haystack and Xapian

Published at
Tuesday 20th October, 2009
Tagged as

Update: note that this is now pretty old, and so probably won’t work. Further, Django now supports full text search if you’re using PostgreSQL, which is probably a better starting point for getting some basic search up and running.

Search isn’t that hard

Back when programmers were real programmers, and everyone wrote in BCPL, search was hard. You had to roll your own indexing code, write your own storage backend, and probably invent your own ranking function. If you could do all that, you were probably working in the field of information retrieval yourself, which was kind of cheating.

These days, not only are there libraries that will do all the work for you, but there’s direct integration with a growing number of ORMs and web frameworks. I’m going to briefly show how to use one specific combination — Xapian and Django, bound together using Haystack — to add basic search functionality to a simple web application.

The Haystack tutorial is pretty good, but provides more options and different paths than are needed to demonstrate it working. Additionally, for licensing reasons the Xapian backend isn’t a fully-supported part of Haystack, so the options it spends most time talking about are Solr, a document search system built in Java on top of Lucene, and Whoosh, a pure-Python search library. There are ongoing problems with using Whoosh in a live environment, but people are having lots of success with both Solr and Xapian as backends; if Solr is your thing, run through the setup from the Haystack tutorial and either keep going, or come back here; after the initial setup, Haystack makes all the core features work across the different backends. If you’ve never worked with Solr or Java before, Xapian is often easier to get going, and hopefully this walk-through is easier to follow than Haystack’s by leaving out the things you don’t need.

Alternatives

Of course, these being the days of mashups, you could just pull in search from somewhere else, like Google Custom Search. This gives you something out of the box fairly quickly, and can be integrated into your site; it certainly looks like a viable alternative. However there are a number of reasons why you might not want to use it, which include:

For other reasons, it’s no longer really acceptable to build a “search engine” using things like FREETEXT indexes in MySQL, so we’ll skip right over that.

The app we’ll use

We’ll demonstrate this using a simple bookmarking system. We’re really only going to look at the Django models involved, and how to search across them; all of this integrates nicely with the Django admin, so if you want to follow along at home, use that to input some data.

I’ve called the entire project bookmarker, and the app that contains the following models bookmarks. If you create a project using Gareth Rushgrove’s django-project-templates, start the bookmarks app and you should be ready to go.

class RemotePage(SluggableModel):
  # meta; these are updated each time someone bookmarks the page
  title = models.CharField(max_length=255)
  summary = models.TextField()
  uri = models.URLField(verify_exists=False, unique=True)

  def bookmarks(self):
    from bookmarker.bookmarks.models import Bookmark
    return Bookmark.objects.filter(page=self)

  @models.permalink
  def get_absolute_url(self):
    return ('page', (), { 'slug': self.slug } )

  def __unicode__(self):
    return self.title

class Bookmark(LooselyAuditedModel):
  # what have we bookmarked?
  page = models.ForeignKey(RemotePage)

  @staticmethod
  def bookmarks_for_request(request):
    if request.user.is_authenticated():
      return Bookmark.objects.filter(created_by=request.user)
    else:
      return Bookmark.get_stashed_in_session(request.session)

  def __unicode__(self):
    return self.page.title

SluggableModel is from django_auto_sluggable, automatically managing the (inherited) slug field, and LooselyAuditedModel is from django_audited_model, which provides fields tracking when and who created and last modified an object. Both are projects of mine available on github.

The idea is that we store a reference to each unique URL bookmarked, as RemotePage objects; each actual bookmark becomes a Bookmark object. We’re then going to search through the RemotePage objects, based on the title and summary.

The only remaining hard part about search — at least, fairly simple search — is figuring out what to make available for searching. The right way to think about this is to start at the top, at the user: figure out what sort of questions they will want to ask, and then find out where the data to answer those questions will come from.

There’s some good work ongoing on how to approach this problem; for instance, search patterns is a project to document the behavioural and design patterns helpful for this.

For now let’s just assume a very basic scenario: our users will want to find out what pages people have bookmarked about a subject, or matching a set a words. The information for this is all going to come from the RemotePage objects, specifically from the title and summary information extracted when the bookmarks are created.

Search fields

Most search systems have the concept of fields. You’re probably already familiar with the idea from using Google, where for instance you can restrict your search to a single website using the site field, with a search such as site:tartarus.org james aylett.

In this case, we’re going to create two main search fields: title and text, with the latter being the important one, built from both the title and summary. We’ll make the title more important than the summary, in the hope that people give their webpages useful names summarising their contents.

The title field will only be built out of the title, although we’re not going to need this to achieve our goal; it’s just to show how it’s done.

We need to install Xapian, which is available as packages for Debian, Ubuntu, Fedora and others; and as source code. We only care about the core and the python bindings; if building from source these are in xapian-core and xapian-bindings. See the Xapian downloads page for more information.

We also need Haystack, available on PyPI as django-haystack; and also xapian-haystack, available on PyPI as xapian-haystack. So the following should get you up and running:

easy_install django-haystack
easy_install xapian-haystack

Indexing

Haystack makes indexing reasonably simple, and automatic on saving objects. There’s a bit of administration to set up in settings.py first:

Then we need to set up bookmarker.haystack. Create bookmarker/haystack/__init__.py containing:

import haystack
haystack.autodiscover()

When the Haystack app is set up, it will find this module, import it, and find the search index we’re about to create as bookmarker/bookmarks/search_indexes.py:

from haystack import indexes, site
from bookmarker.bookmarks.models import RemotePage

class RemotePageIndex(indexes.SearchIndex):
    text = indexes.CharField(document=True, use_template=True)
    title = indexes.CharField(model_attr='title')

site.register(RemotePage, RemotePageIndex)

If this looks a lot like an admin.py, then that’s good: it’s doing a similar job of registering a helper class alongside the RemotePage model that will take care of indexing it. Haystack then provides a separate facility for us to search this, which we’ll use later on.

We’ll take the text field first. It’s marked document=True, which you’ll want to put on the main search field in each index. Secondly, and more importantly, it’s marked use_template=True. This means that it won’t be simply extracted from a field on the RemotePage object, but rather that a Django template will be rendered to provide the index data. The template is stored under the slightly weird name search/indexes/bookmarks/remotepage_text.txt, which is built up from the app name (bookmarks), the model name (remotepage) and the field name in the SearchIndex (text). The contents I’ve used are:

{{ object.title }}
{{ object.title }}
{{ object.title }}
{{ object.title }}
{{ object.title }}

{{ object.summary }}

Okay, so this looks a little weird. In order to increase the weight of the title field (ie how important it is if words from the user’s search are found in the title), we have to repeat the contents of it. Haystack doesn’t provide another approach to this (and it’s unclear how it ever could, from the way it’s designed). In fact, writing a template like this can actually cause problems with phrase searching: say the title of a page is “Man runs the marathon”, the phrase search "marathon man" (including quotes) should not match the page — but will do because of the above trick.

The title field is simpler, and just gets its data from the title on the model, as specified by the model_attr parameter.

To build the index for the first time, run python manage.py reindex. Normally, saving an object will cause it to be reindexed, so any pages you create or modify after that will have their entry in the search index updated.

Searching

Adding search to the web interface is now really simple; in fact, because Haystack provides a default search view for us, we don’t have to write any code at all. We do two things: first, we create a search endpoint in your URLconf:

(r'^search/', include('haystack.urls')),

and then create a template search/search.html:

<h1>Search</h1>

<form method="get" action=".">
  <label for='id_q'>{{ form.q.label }}</label>
  {{ form.q }}
  <input type="submit" value="Search">
</form>

{% if page.object_list %}
  <ol>
    {% for result in page.object_list %}
      <li>
        <a href='{{ result.object.get_absolute_url }}'>{{ result.object.title }}</a>
        ({{ result.object.bookmarks.count }} bookmarks)
      </li>
    {% endfor %}
  </ol>
{% else %}
  <p>No results found.</p>
{% endif %}

And you can go to http://127.0.0.1:8000/search/ or wherever, and search any pages you’ve created.

Limitations of Haystack

When you get into writing your own search code, Haystack has a lot of cleverness on that side, making it appear much like database queries against Django’s ORM. I don’t actually think this is necessary, and in fact might be confusing in the long term, but it does make it easier to get into if you have no prior experience of search.

However the indexing side of Haystack is very limited. Because of the decision to use templates to generate the content for search fields (which is a fairly Django-ish choice), all sorts of sophisticated index strategies are simply impossible. We saw above the workaround required to upweight a particular ORM field when building a search field, and the dangers of that approach. An inability to finely control data from different sources going into a particular search field is a major drawback of Haystack’s design.

Also, Haystack is limited when it comes to excluding objects from the database. It has a mechanism which allows this to work — sometimes; but it often requires reindexing the entire database, or at least a particular model, to bring things in and out of the database later. There seems to be no way of using a single ORM field to indicate whether a given object should currently be searchable or not (ie in the index or not). This makes it difficult, for instance, to support the common pattern of marking objects as visible or not using a published field; for instance, the User object in django.contrib.auth.models does something similar with the active field.

Going further

Haystack has a bunch more features, allowing more sophisticated fields, and searching involving ranges and so forth. Check out its documentation for more information.

If you’re building an app that needs complex search, I’m not convinced that Haystack is the right choice. However it provides a very easy route in, and once you’ve got something up and running you can start analysing the real use of search in your system to help you plan something more sophisticated.

And having some search is much, much better than having none. Haystack is a great way of getting over that first hurdle, and into the world of Django search.