Highlighting field in memory-based Lucene indexes

I’m using more and more Lucene these days, and getting in depth on a few subjects, today i’m going to talk to you about how to handle the new Highlighting features available with Lucene 4.1.

One of the main achievements with this new version is the creation of the great PostingsHighlighter. Michael McCandless wrote a great piece about it in his article A new Lucene highlighter is born and i encourage you to read it if you want to get serious about highlighting using Lucene :).

Now let’s say you want to use it on a MemoryIndex, considering the MemoryIndex as the best In-Memory index type with more than ~500k queries/s handled and the « perfect » reset() method, it would be great right ? But it’s a nice dream as the MemoryIndex doesn’t store anything about the raw data, so… we need a plan B.

The plan B can be to use the old-fashioned, but still useful, RAMDirectory index that will still behave like a normal « Directory »-based index and will give you the ability to store the data you need on the field to match. Here is an example on how to use it :

        final int MAX_DOCS = 10;
        final String FIELD_NAME = "text";
        final Directory index = new RAMDirectory();
        final StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_41);

        IndexWriterConfig writerConfig = new IndexWriterConfig(Version.LUCENE_41, analyzer);
        IndexWriter writer = new IndexWriter(index, writerConfig);
        // create document
        Document document = new Document();
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setStored(true); // it needs to be stored to be properly highlighted
        type.setTokenized(true);
        type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); // necessary for PostingsHighlighter
        document.add(new Field(FIELD_NAME, "this an example of text that must be highlighted", type));
        // add it to the index
        writer.addDocument(document);
        writer.commit();
        writer.close();

        Query query = new QueryParser(Version.LUCENE_41, FIELD_NAME, analyzer).parse("example");
        DirectoryReader directoryReader = DirectoryReader.open(index);

        IndexSearcher searcher = new IndexSearcher(directoryReader);
        PostingsHighlighter highlighter = new PostingsHighlighter();
        TopDocs topDocs = searcher.search(query, MAX_DOCS);
        String[] strings = highlighter.highlight(FIELD_NAME, query, searcher, topDocs);
        System.out.println(Arrays.toString(strings));
        // expected output : [this an <b>example</b> of text that must be highlighted]

I’m honestly considering right now to use both indexes querying heavily the MemoryIndex and using the RAMDirectory only when i know there’s a match found and i need the highlighting features. Maybe i’m not done digging up around this hole and there’s a way to make any highlighter work with the MemoryIndex, but i doubt it, both conceptually and after testing everything i could.

If you think otherwise, and know a way to do so, tell me 🙂

Vale

Publicité

un commentaire

  1. Norbert · · Réponse

    Hi, interesting post … could you tell me a use case where I would need the MemoryIndex if I do not get any data back from it?

Votre commentaire

Entrez vos coordonnées ci-dessous ou cliquez sur une icône pour vous connecter:

Logo WordPress.com

Vous commentez à l’aide de votre compte WordPress.com. Déconnexion /  Changer )

Image Twitter

Vous commentez à l’aide de votre compte Twitter. Déconnexion /  Changer )

Photo Facebook

Vous commentez à l’aide de votre compte Facebook. Déconnexion /  Changer )

Connexion à %s

%d blogueurs aiment cette page :