The Porter Stemming AlgorithmThis is the ‘official’ home page for distribution of the Porter Stemming Algorithm, written and maintained by its author, Martin Porter. The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. The algorithm was originally described in Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137. It has since been reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. The Algorithm has been widely used, quoted, and adapted over the past 20 years. Unfortunately variants of it abound which claim to be true implementations, and this can cause confusion. This page contains a demonstration of the stemmer, and downloadable versions of it in ANSI C, Java, Perl and other languages. The original stemmer was coded up in BCPL, a language no longer in vogue. In its final surviving form, this BCPL version has three minor points of difference from the published algorithm, and these are clearly marked in the downloadable ANSI C version. They are discussed further below. The ANSI C, Java and Perl versions are exactly equivalent to the original BCPL version, having been tested on a large corpus of English text. The original paper was, I hope, unambiguous, despite a couple of irritating typos, but even so the ANSI C version nowadays acts as a better definition of the algorithm than the original published paper. Get the algorithm definition
Get the ANSI C version Get the Java version Get the Perl version Daniel van Balen’s Perl version, at http://www.ldc.usb.ve/~vdaniel/porter.pm runs somewhat faster, although it is less tightly coded. A python version of the Porter stemmer was kindly sent to me on 3 Jan 2001 by Vivake Gupta. I know nothing about python, but I have managed to run it against a large test vocabulary, and so check that it is correctly encoded. It follows closely the ANSI C version.
Get the Python version - And a Csharp version was kindly sent to me on 18 Sep 2001 by André Hazelwood, when he was working for The Official Web Guide. I don’t have access to Csharp so can’t test it out, but André has passed it through the sample vocabulary (see below) and got the correct output.
Get the Csharp version On 4 Nov 2002 Leif Azzopardi, a PhD student at the Univerity of Paisley in Scotland, sent another Csharp version with the following note: I have modifed this [Andre Hazelwood’s] version to conform to the Microsofts .NET standards and now the PorterStemmer Class can be imported into any language that has .NET capabilities (such as Delphi 7, Visual Basic.NET, Visual C++.Net, etc ).So,
Get the .NET compliant Csharp version A version in Common Lisp, was sent to me on 21 March 2002 by Steven M. Haflich of Franz Inc. Again, this has not been tested by me, but came with assurances that it passed the test of the sample vocabulary.
Get the Common Lisp version A version in Ruby was sent to me on 31 January 2003 by Allen Condit of the University of Nevada, Las Vegas (UNLV), with the customary assurances. It was written by Ray Pereda, working in the Information Science Research Institute (ISRI) there. This release has further improvements by Dave Thomas.
Get the Ruby version A version in Visual Basic (VB6) was sent to me in April 2003 by Navonil Mustafee, which he did while working as a student at Brunel University in London.
Get the Visual Basic version And version in Visual Basic .NET (VB7) was sent to me in January 2005 by Christos Attikos, which he did while working as a student at the University of Piraeus in Greece. It is, of course, fully .NET compliant.
Get the Visual Basic .NET version A version in php was sent to me in February 2005 by Richard Heyes, who can be contacted through his website at www.phpguru.org. Richard has kindly allowed the GPL licence that was attached to his script to be waived for inclusion on this site.
Get the php version A Delphi version was sent to me in April 2004 by Jo Rabin (a friend of a friend).
Get the Delphi version A Javascript version was sent to me in July 2004 by Andargor, whose amazing game software can be found at www.andargor.com.
Get the Javascript version A Prolog (thread safe) version was sent to me in October 2005 by Philip Brooks, working at the University of Georgia.
Get the Prolog version A Haskell version was sent to me in November 2005 by Dmitry Antonyuk.
Get the Haskell version All these encodings of the algorithm can be used free of charge for any purpose. And to test the programs out: Get a sample vocabulary (0.19 megabytes), and get the corresponding output. Email any comments, suggestions, queries
Points of difference from the published algorithmThere is a new rule in Step 2, (m>0) logi -> log. So archaeology is equated with archaeological etc.The Step 2 rule (m>0) abli -> able is replaced by (m>0) bli -> ble. This is definitely an improvement. possibly is equated with possible etc. The algorithm leaves alone strings of length 1 or 2. In any case a string of length 1 will be unchanged if passed through the algorithm, but strings of length 2 can lose a final s, so as would stem to a and is to i. This hardly matters, since in setting up IR systems as, is, a and i would probably all be stopwords anyway, but not removing the s here is more exact. This last difference may in fact have always been present in the program from which the published algorithm derives. But at a twenty year distance from the original publication it is difficult to say. It must be emphasised that these differences are very small compared to the variations I have noticed in other encodings of the algorithm.
Common errors in other encodingsThe algorithm clearly explains that when a set of rules of the type (condition)S1 -> S2 are presented together, only one rule is applied, the one with the longest matching suffix S1 for the given word. This is true whether the rule succeeds or fails (i.e. whether or not S2 replaces S1). Despite this, in most encodings the rules are simply applied in turn until either one of them succeeds or the list runs out.This leads to small errors in various places, for example in the Step 4 rules (m>1)ement ->, (m>1)ment ->, (m>1)ent -> to remove final ement, ment and ent. Properly, argument stems to argument. The longest matching suffix is -ment. Then stem argu- has measure m equal to 1 and so -ment will not be removed. End of Step 4. But if the three rules are applied in turn, then for suffix -ent the stem argum- has measure m equal to 2, and -ent gets removed. Other common errors are to misinterpret the more delicate rules. Perhaps greater care was required in explaining them. So ((m>1) and (*s or *t))ion is taken to mean (m>1)(s or t)ion. The former means that taking off -ion leaves a stem with measure greater than 1 ending -s or -t; the latter means that taking off -sion or -tion leaves a stem of measure greater than 1. A similar confusion tends to arise in interpreting rule 5b, to reduce final double L to single L. Occasionally one sees cruder errors. For example the test for Y being consonant or vowel set up the wrong way round. It is interesting that although the published paper explains how to do the tests on the strings S1 by a program switch on the last or last but one letter, none of the encodings I have seen use this technique. This means that most encodings are much slower than necessary.
FAQsThe most frequently asked question is why word X should be stemmed to x1, when one would have expected it to be stemmed to x2. It is important to remember that the stemming algorithm cannot achieve perfection. On balance it will (or may) improve IR performance, but in individual cases it may sometimes make what are, or what seem to be, errors.
|
Further resourcesAll my new work on stemming is at http://snowball.tartarus.org.
|