Finding the Bard

Python, Xapian and Shakespeare

James Aylett

xapian.org

Quotable Shakespeare

  • Shall I compare thee to a summer's ... ?
  • Romeo, Romeo, what's the rest of this speech please?
  • There's a great insult about a hedgehog
  • Shakespeare wrote too much
37 plays, 154 sonnets and various lost works, including 1602's apologia for The Merry Wives of Windsor, in which Pistol punctures Falstaff with a pin in act 1 scene 2, and spends the rest of the play giggling uncontrollably.

Can we Google?

Can we Google Y! Search?

  • Shall I compare thee to a summer's ... ? - first hit
  • Romeo, Romeo ... - tough with only the start
  • Hedgehog - really, did you expect it to work?

Can we grep?

  • We can find the exact text
    $ grep summer\'s.day shaks12.txt
      Shall I compare thee to a summer's day?
        as you shall see in a summer's day. But it is very well; what he
        desire in a summer's day. Here is his Majesty.
        Like stinging bees in hottest summer's day,
    
  • Then search again for the whole lot
    $ grep -A14 "Shall I compare thee to a summer's day?" shaks12.txt
      Shall I compare thee to a summer's day?
      Thou art more lovely and more temperate:
      Rough winds do shake the darling buds of May,
      And summer's lease hath all too short a date:
      [...]
    
  • Awkward, and don't know where it came from
Complete Works of William Shakespeare, from Project Gutenberg (copyrighted World Library, Inc.): http://www.gutenberg.org/etext/100

Can we use sed?

:top
/^[0-9]\+\r$/ b title
/^ACT/ { x; s/ACT.*$//; s/.$//; G; x }
/hedgehog/ { x; p; x; p; i\

}
d # goes to top
:title
n; /^.$/ n; x; d; b top;

$ sed -n -f find.sed shaks12.txt
A MIDSUMMER NIGHT'S DREAM
ACT II. SCENE I.
               Thorny hedgehogs, be not seen;

[...]
  • 'hedgehog' is hardcoded ... and eww

What if you get it wrong?

  • Romeo, Romeo ... again
  • Cousins, that strive by factions and by friends ambitiously for rule and empery ...
    $ grep cousins.*factions shaks12.txt
    $ grep cousins.*friends shaks12.txt
    $ grep cousins.*empery shaks12.txt
    $ grep friends.*empery shaks12.txt
    $ grep factions.*friends shaks12.txt
      MARCUS. Princes, that strive by factions and by friends
    
  • If you get a critical word wrong, can take ages to find
  • Lines in blank verse are short, so speeches get split

Let's use Xapian

  • Search engine library
  • Probabilistic and boolean (filtering) operations
  • (Fairly) portable C++, with language bindings
  • Stemming
  • omega: out of the box tool
  • But let's do it directly because it's more fun

Xapian basics

  • Databases contain Documents
  • WritableDatabases take care of locking
  • Documents contain terms (eg: words)
  • Terms can have positions (for proximity searching)
  • In general terms aren't simply words (eg: stemming)
  • Documents also have arbitrary data
  • ... and some other stuff

A Xapian indexer for Shakespeare

  • Jon Bosak's XML version of the plays
  • Structure
    • PLAY, ACT, SCENE
    • SPEECH, STAGEDIR
    • SPEAKER, LINE
    • TITLE (of PLAY, ACT or SCENE)
  • SAX parser
  • State machine
    • foo (PLAY, ACT, SCENE)
    • foo-title
    • speaker, line, stagedir

Indexing strategy

  • One scene = one document
  • Omega compatible terms
    • Initial letter is special
    • Words are lowercased and stemmed
    • Initial capital => R prefixed, unstemmed
    • S prefix: titles ("subject")
    • XS prefix: speaker
    • Q prefix: unique term identifying document
  • Positions: put 10 between each title

Our indexer: setup

class Handler(xml.sax.handler.ContentHandler):
    def __init__(self, filename):
        self.play = ''
        self.act = ''
        self.scene = ''
        self.state = '' # we'll do this using strings
        self.doc = xapian.Document()
        self.position = 1
        self.filename = filename
        self.stemmer = xapian.Stem("en")
        self.act_count = 0
        self.scene_count = 0

Our indexer: start element

class Handler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        if name=='PLAY':
            self.state='play'
        elif name=='PERSONAE':
            self.state='personae'
        elif name in ('INDUCT', 'ACT', 'PROLOGUE', 'EPILOGUE'):
            self.act_count += 1
            self.state='act'
        elif name=='SCENE':
            self.scene_count += 1
            self.state='scene'
        elif name=='TITLE':
            if self.state in ('play', 'personae', 'act', 'scene'):
                self.state="%s-title" % self.state
        elif name=='STAGEDIR':
            self.state='stagedir'
        elif name=='SPEAKER':
            self.state='speaker'
        elif name=='LINE':
            self.state='line'

Our indexer: content

class Handler(xml.sax.handler.ContentHandler):
    def characters(self, content):
        if self.state=='play-title':
            self.play += ' ' + content
        elif self.state=='personae-title':
            pass
        elif self.state=='act-title':
            self.act += ' ' + content
        elif self.state=='scene-title':
            self.scene += ' ' + content
        elif self.state=='stagedir' or self.state=='line':
            self.index_text(content)
        elif self.state=='speaker':
            self.index_text(content)
            self.index_text(content, 'XS')

Our indexer: end element

class Handler(xml.sax.handler.ContentHandler):
    def endElement(self, name):
        if name=='LINE':
            self.state='scene'
        elif name=='SPEAKER':
            self.state='scene'
        elif name=='STAGEDIR':
            self.state='scene'
        elif name=='TITLE':
            if self.state[-6:]=='-title':
                self.state=self.state[:-6]
        elif name=='SCENE':
            unique_term = self.add_stuff() # index titles, create doc data
            get_db().replace_document(unique_term, self.doc)
            self.doc = xapian.Document()
            self.scene = ''
        elif name=='ACT':
            self.act = ''
        elif name=='PLAY':
            self.play = ''

Our indexer: gubbins

class Handler(xml.sax.handler.ContentHandler):
    def add_stuff(self):
        self.index_text(self.play, 'S', 10)
        self.index_text(self.act, 'S', 10)
        self.index_text(self.scene, 'S', 10)
        self.index_text(self.scene, '', 10)
        data = "play=%s\nact=%s\nscene=%s\nfilename=%s" % (safe(self.play),
                safe(self.act), safe(self.scene), self.filename)
        self.doc.set_data(data)
        unique_term = "Q-%s-%i-%i" % (self.filename, self.act_count, self.scene_count)
        self.doc.add_term(unique_term)
        return unique_term

# get_db() is a singleton WritableDatabase constructor:
# xapian.WritableDatabase('shakespeare', xapian.DB_CREATE_OR_OPEN)

def safe(s): # Get rid of characters we can't carry in document data
    return str(s).replace("\n", " ")

def p_plusminus(c):
    return c=='+' or c=='-'

Our indexer: index_text

class Handler(xml.sax.handler.ContentHandler):
    def index_text(self, text, prefix='', position_inc=0, wdfinc=1):
        ...

This is pretty big.

If you're really interested, ask me afterwards.

if __name__ == "__main__":
    import sys
    for fname in sys.argv[1:]:
        index_file(fname)

Let's go!

Search concepts

Build a query

def auto_enquote(x):
    if x.find(" ")==-1:
        return x
    else:
        return '"' + x + '"'
if __name__ == "__main__":
    import sys
    query = " ".join(map(auto_enquote, sys.argv[1:]))
    qp = xapian.QueryParser()
    qp.add_prefix("title", "S")
    qp.add_prefix("speaker", "XS")
    qp.set_stemmer(xapian.Stem("en"))
    qp.set_stemming_strategy(xapian.QueryParser.STEM_ALL)
    qp.set_default_op(xapian.Query.OP_OR)
    db = xapian.Database("shakespeare")
    qp.set_database(db)
    q = qp.parse_query(query)

Run the query

    enq = xapian.Enquire(db)
    enq.set_query(q)
    mset = enq.get_mset(0, 10)
    for match in mset:
        data = match[4].get_data()
        def r(x, y):
            x[y.split('=')[0]]=y.split('=')[1]
            return x
        f = reduce(r, data.split('\n'), {})
        print "%2.2i (%3.3i%%) %s, %s, %s" % (match[2]+1,
                                              match[3],
                                              f['play'],
                                              f['act'],
                                              f['scene'])

Some results

$ ./search.py cousins that strive by factions and by friends
01 (080%) The Life of Timon of Athens, ACT III, SCENE V.  The same. The senate-house. The Senate sitting.
02 (067%) The Tragedy of Titus Andronicus, ACT I, SCENE I.  Rome. Before the Capitol.
[...]

Changes to the indexer

class Handler(xml.sax.handler.ContentHandler):
    def __init__(self, filename):
        [...]
        self.speech_count = 0
        self.speech = ''
    def add_stuff(self):
        [...]
        data = "play=%s\nact=%s\nscene=%s\nspeech=%s\nfilename=%s" % (safe(self.play),
                                                                     safe(self.act),
                                                                     safe(self.scene),
                                                                     safe(self.speech),
                                                                     self.filename)
        unique_term = "Q-%s-%i-%i-%i" % (self.filename, self.act_count,
                                         self.scene_count, self.speech_count)
    def startElement(self, name, attrs):
	[...]
        elif name=='STAGEDIR' or name=='SPEECH':
            if self.state!='speech':
                self.speech_count += 1
                self.state='speech'

Changes to the indexer

class Handler(xml.sax.handler.ContentHandler):
    def endElement(self, name):
        if name=='LINE':
            self.state='speech'
        elif name=='SPEAKER':
            self.state='speech'
        elif name=='STAGEDIR':
            if self.state=='speech':
                self.add_doc()
                self.speech = ''
                self.state = 'scene'
            else:
                self.state='speech'
        elif name=='SPEECH':
            self.add_doc()
            self.speech = ''
            self.state = 'scene'
        elif name=='SCENE':
            self.scene = ''
        [...]

Changes to the indexer

class Handler(xml.sax.handler.ContentHandler):
    def characters(self, content):
        [...]
        elif self.state=='stagedir' or self.state=='line':
            self.index_text(content)
            self.speech += "\n" + content
        elif self.state=='speaker':
            self.index_text(content)
            self.index_text(content, 'XS')
            self.speech += "\n" + content

    def add_doc(self):
        unique_term = self.add_stuff()
        get_db().replace_document(unique_term, self.doc)
        self.doc = xapian.Document()

Differences at SPEECH level

Changes to the search script

        import textwrap
        print "%2.2i (%3.3i%%) %s, %s, %s\n%s" % (match[2]+1,
                                                  match[3],
                                                  f['play'],
                                                  f['act'],
                                                  f['scene'],
                                                  "\n".join(textwrap.wrap(f['speech'])))

The new output

$ ./search.py cousins that strive by factions and by friends
01 (083%)  The Tragedy of Titus Andronicus,  ACT I,  SCENE I.  Rome. Before the Capitol.
 MARCUS ANDRONICUS Princes, that strive by factions and by friends
Ambitiously for rule and empery, Know that the people of Rome, for
whom we stand A special party, have, by common voice, In election for
the Roman empery, Chosen Andronicus, surnamed Pius For many good and
great deserts to Rome: A nobler man, a braver warrior, Lives not this
day within the city walls: He by the senate is accit'd home From weary
wars against the barbarous Goths; That, with his sons, a terror to our
foes, Hath yoked a nation strong, train'd up in arms. Ten years are
spent since first he undertook This cause of Rome and chastised with
arms Our enemies' pride: five times he hath return'd Bleeding to Rome,
bearing his valiant sons In coffins from the field; And now at last,
laden with horror's spoils, Returns the good Andronicus to Rome,
Renowned Titus, flourishing in arms. Let us entreat, by honour of his
name, Whom worthily you would have now succeed. And in the Capitol and
senate's right, Whom you pretend to honour and adore, That you
withdraw you and abate your strength; Dismiss your followers and, as
suitors should, Plead your deserts in peace and humbleness.
[...]

Questions?

James Aylett

james@tartarus.org

Bardcamp

Unpack this zip on a python (+ python libxml2, python libxslt) CGI-happy webserver and look at search.html for a live search on the database (you have to run indexer3.py to generate the database first). Uses async HTTP requests from Javascript, and probably only works in Firefox (sorry - very proof of concept).