有关Lucene的问题(2):stemming和lemmatization

字体大小: 中小 标准 ->行高大小: 标准

问题:

我试验了一下文章中提到的 stemming 和 lemmatization

  • 将单词缩减为词根形式,如“cars”到“car”等。这种操作称为:stemming。
  • 将单词转变为词根形式,如“drove”到“drive”等。这种操作称为:lemmatization。

试验没有成功

代码如下:

public class TestNorms {    
    public void createIndex() throws IOException {    
        Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));    
        IndexWriter writer = new IndexWriter(d, new StandardAnalyzer(Version.LUCENE_30),  
                                                                                      true, IndexWriter.MaxFieldLength.UNLIMITED);    
        Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED);    
        Document doc = new Document();    
        field.setValue("Hello students was drive");    
        doc.add(field);    
        writer.addDocument(doc);    
        writer.optimize();    
        writer.close();    
    }    
    public void search() throws IOException {    
        Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));    
        IndexReader reader = IndexReader.open(d);    
        IndexSearcher searcher = new IndexSearcher(reader);    
        TopDocs docs = searcher.search(new TermQuery(new Term("desc","drove")), 10);    
        System.out.println(docs.totalHits);    
    }    
    public static void main(String[] args) throws IOException {    
        TestNorms test= new TestNorms();    
        test.createIndex();    
        test.search();    
    }    

不管是单复数,还是单词的变化,都是没有体现的

不知道是不是分词器的原因?

回答:

的确是分词器的问题,StandardAnalyzer并不能进行stemming和lemmatization,因而不能够区分单复数和词型。

文章中讲述的是全文检索的基本原理,理解了他,有利于更好的理解Lucene,但不代表Lucene是完全按照此基本流程进行的。

(1) 有关stemming

作为stemming,一个著名的算法是The Porter Stemming Algorithm,其主页为http://tartarus.org/~martin/PorterStemmer/,也可查看其论文http://tartarus.org/~martin/PorterStemmer/def.txt

通过以下网页可以进行简单的测试:Porter's Stemming Algorithm Online[http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]

cars –> car

driving –> drive

tokenization –> token

然而

drove –> drove

可见stemming是通过规则缩减为词根的,而不能识别词型的变化。

在最新的Lucene 3.0中,已经有了PorterStemFilter这个类来实现上述算法,只可惜没有Analyzer向匹配,不过不要紧,我们可以简单实现:

public class PorterStemAnalyzer extends Analyzer 
{ 
    @Override 
    public TokenStream tokenStream(String fieldName, Reader reader) { 
      return new PorterStemFilter(new LowerCaseTokenizer(reader)); 
    } 
}

把此分词器用在你的程序中,就能够识别单复数和规则的词型变化了。

public void createIndex() throws IOException { 
  Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms")); 
  IndexWriter writer = new IndexWriter(d, new PorterStemAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

  Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED); 
  Document doc = new Document(); 
  field.setValue("Hello students was driving cars professionally"); 
  doc.add(field);

  writer.addDocument(doc); 
  writer.optimize(); 
  writer.close(); 
}

public void search() throws IOException { 
  Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms")); 
  IndexReader reader = IndexReader.open(d); 
  IndexSearcher searcher = new IndexSearcher(reader); 
  TopDocs docs = searcher.search(new TermQuery(new Term("desc", "car")), 10); 
  System.out.println(docs.totalHits); 
  docs = searcher.search(new TermQuery(new Term("desc", "drive")), 10); 
  System.out.println(docs.totalHits); 
  docs = searcher.search(new TermQuery(new Term("desc", "profession")), 10); 
  System.out.println(docs.totalHits); 
}

(2) 有关lemmatization

至于lemmatization,一般是有字典的,方能够由"drove"对应到"drive".

在网上搜了一下,找到European languages lemmatizer[http://lemmatizer.org/],只不过是在linux下面C++开发的,有兴趣可以试验一下。

首先按照网站的说明下载,编译,安装:

libMAFSA is the core of the lemmatizer. All other libraries depend on it. Download the last version from the following page, unpack it and compile:

# tar xzf libMAFSA-0.2.tar.gz
# cd libMAFSA-0.2/
# cmake .
# make
# sudo make install
After this you should install libturglem. You can download it at the same place.
# tar xzf libturglem-0.2.tar.gz
# cd libturglem-0.2
# cmake .
# make
# sudo make install
Next you should install english dictionaries with some additional features to work with.
# tar xzf turglem-english-0.2.tar.gz
# cd turglem-english-0.2
# cmake .
# make
# sudo make install

安装完毕后:

  • /usr/local/include/turglem是头文件,用于编译自己编写的代码
  • /usr/local/share/turglem/english是字典文件,其中lemmas.xml中我们可以看到"drove"和"drive"的对应,"was"和"be"的对应。
  • /usr/local/lib中的libMAFSA.a  libturglem.a  libturglem-english.a  libtxml.a是用于生成应用程序的静态库

<l id="DRIVE" p="6" />

<l id="DROVE" p="6" />

<l id="DRIVING" p="6" />

在turglem-english-0.2目录下有例子测试程序test_utf8.cpp

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <unistd.h> 
#include <turglem/lemmatizer.h> 
#include <turglem/lemmatizer.hpp> 
#include <turglem/english/charset_adapters.hpp>

int main(int argc, char **argv) 
{ 
        char in_s_buf[1024]; 
        char *nl_ptr;

        tl::lemmatizer lem;

        if(argc != 4) 
        { 
                printf("Usage: %s words.dic predict.dic flexias.bin\n", argv[0]); 
                return -1; 
        }

        lem.load_lemmatizer(argv[1], argv[3], argv[2]);

        while (!feof(stdin)) 
        { 
                fgets(in_s_buf, 1024, stdin); 
                nl_ptr = strchr(in_s_buf, '\n'); 
                if (nl_ptr) *nl_ptr = 0; 
                nl_ptr = strchr(in_s_buf, '\r'); 
                if (nl_ptr) *nl_ptr = 0;

                if (in_s_buf[0]) 
                { 
                        printf("processing %s\n", in_s_buf); 
                        tl::lem_result pars; 
                        size_t pcnt = lem.lemmatize<english_utf8_adapter>(in_s_buf, pars); 
                        printf("%d\n", pcnt); 
                        for (size_t i = 0; i < pcnt; i++) 
                        { 
                                std::string s; 
                                u_int32_t src_form = lem.get_src_form(pars, i); 
                                s = lem.get_text<english_utf8_adapter>(pars, i, 0); 
                                printf("PARADIGM %d: normal form '%s'\n", (unsigned int)i, s.c_str()); 
                                printf("\tpart of speech:%d\n", lem.get_part_of_speech(pars, (unsigned int)i, src_form)); 
                        } 
                } 
        }

        return 0; 
}

编译此文件,并且链接静态库:注意链接顺序,否则可能出错。

g++ -g -o output test_utf8.cpp -L/usr/local/lib/ -lturglem-english -lturglem -lMAFSA –ltxml

运行编译好的程序:

./output /usr/local/share/turglem/english/dict_english.auto

/usr/local/share/turglem/english/prediction_english.auto

/usr/local/share/turglem/english/paradigms_english.bin

做测试,虽然对其机制尚不甚了解,但是可以看到lemmatization的作用:

drove 
processing drove 
3 
PARADIGM 0: normal form 'DROVE' 
        part of speech:0 
PARADIGM 1: normal form 'DROVE' 
        part of speech:2 
PARADIGM 2: normal form 'DRIVE' 
        part of speech:2

was 
processing was 
3 
PARADIGM 0: normal form 'BE' 
        part of speech:3 
PARADIGM 1: normal form 'BE' 
        part of speech:3 
PARADIGM 2: normal form 'BE' 
        part of speech:3

此文章由 http://www.ositren.com 收集整理 ,地址为: http://www.ositren.com/htmls/69407.html