C extension for Porter stemmer

Filed in: Ruby Add comments

When playing with LSI, I noticed that the program runs for too long and uses enormous amounts of memory.
Using a great tool ruby-prof, I found, to my astonishment, that I waste more time in stemming than in SVD.
So I wanted to try to see, if using a compiled C extension will make a difference. So I took the thread-safe porter algorithm from http://tartarus.org/~martin/PorterStemmer/ and wrapped it with swig.
The results were almost in an order of magnitude (10000 rounds for 11 words):

      user     system      total        real
stem :  3.480000   0.250000   3.730000 (  3.719107)
fstem:  0.440000   0.090000   0.530000 (  0.526526)

This I call “performance boost” 🙂

porter.i (for swig):

%module stemmer
char *stem_word(char *word)
  int length, i;
  char *res;
  struct stemmer * z = create_stemmer();
  length  = stem(z, word, strlen(word)-1);
  /* length is the index of last char, add one for size and one for '�' */
  res = (char *)malloc((length+2) * sizeof(char));
  for (i=0; i<=length; i++)
    res[i] = word[i];
  res[length+1] = 0;
  return res;
%newobject stem_word;
char *stem_word(char *);

Leave a Reply