Searching a tree structure with Lucene.Net

 

   1: /*
   2: Newsflash!
   3: 
   4: It’s now just a couple of days and we release Beta5. 
   5: In that the Content Query infrastructure is moved to the Lucene platform. 
   6: Sense/Net 6.0 endorses the eat your own dogfood philosophy, 
   7: so the Content Query feature is massively used across the system 
   8: in the core functionality. Even the repository type system 
   9: relies on the Content Query layer. 
  10: We expect a nice boost in the global system performance 
  11: in every content releated feature. Read on to find out why.
  12: */

Lucene is amazing – it is really everything you want from a software component: it’s smart, does it’s job well - asking very little in response, greatly simplifies your (professional) life :), open source (so you can feel more confident: you will have a higher success rate in mitigating project risks coming from relying on a 3rd party tool),  extensible and proven.

On our way to implement the Sense/Net Content Repository Lucene Adapter we had some experiences that worth sharing on using Lucene with deeply structured content and path based queries.

As Lucene follows a flat model (one and only one document type, arbitrary fields, no structures), the most trivial approach would be to flatten the tree by storing the position of each node as an extra stored/indexed field added to the node’s own fields– let call it the Path field.

The PrefixQuery

Now, finding nodes in a specific sub-tree looks really simple as we have PrefixQuery at hand:

private IEnumerable<DocumentWrapper> Search()
{
var pathquery = new PrefixQuery(new Term("Path", "c:\\Development"));
var result = Searcher.Search(pathquery);
foreach (Document o in result.Iterator)
{
yield return (new DocumentWrapper(o));
}
}

In your first couple of test scenarios it works nice. So you increase the size of your prototype to let’s say handle quarter a million of items now. Just to run into the the dreaded “TooManyClauses was unhandled” exception

image

It turns out the the PrefixQuery compiles into a list of possible TermQueries connected with OR relation, so basically the query Path = “C:\Development*” transforms into this: Path = “C:\Development\Folder1” OR Path = “C:\Development\Folder2” OR … and so on. The list is capped at 1023.

This 1023 limit looks really bad, but googling around the topic quickly brings up a solution that appears to be working.

The PrefixFilter solution!

The PrefixFilter let’s you efficiently filter rows from the query result based on a prefix criteria.

var query = new MatchAllDocsQuery();
var filter = new PrefixFilter(new Term("Path", "c:\\Development"));
var result = Searcher.Search(query, filter);

The query completes nicely, but with a touch of sluggishness. It has two major drawbacks however

- This approach splits the original query into two parts – a query and a filter, and it would complicate things to a level that would likely to be error prone in complex query cases (where multiple references of filters would be needed)

- Performance drops to mediocre/bad this way: in my test environment (Processor: Core2Duo, 2Ghz, Lucene index: 219000 rows, my complete c:'\ drive effectively) searching for c:\Development* returned 34.000 hits in 500 milliseconds. What is even worse: the execution time does not drop significantly as the result count get smaller. You may say: hey, 500 milliseconds is fantastic for such a mighty query!! yes. for the desktop it is… For the “Server Side”– it isn’t. In our case millions of queries will run a day, most of them around content pathes, and if each would cost us 500ms then we are – let’s face it - mucked.

So what else do we have?

The ConstantScoreQuery Solution?

The ConstantScoreQuery wraps a filter – a PrefixFilter in our case – and turns it into a Query class, ready to be combinef with other query parts. It removes the complexity of the split query model at least.

private IEnumerable<DocumentWrapper> Search()
{
var prefixFilter = new PrefixFilter(new Term("Path", "c:\\Development"));
var query = new ConstantScoreQuery(prefixFilter);
var result = Searcher.Search(query);
....
}

But

- The performance is even worse, 800 milliseconds now

- Plus now we need to give up ranking and scoring or at least we are threatened by constant scores instead of the default relevant scores.

And this is the point where we can be relatively sure that the problem lies in our approach and not in the implementation. Lets think in TERMS.

The Ancestors array property way

Instead of querying the Path field for a prefix match let’s extend our Lucene storage model. On the top of the path field let’s store each and every ancestor’s path of the node in the Ancestor array property.

 

private static Field[] CreateAncestorFields(string path, string separator)
{
string[] fragments = path.Split(new string[] { separator }, StringSplitOptions.None);
List<Field> fields = new List<Field>();
for(int i = 0; i < fragments.Length; i++)
{
string value = string.Join(separator, fragments, 0, i+1);
fields.Add(new Field("Ancestor", value, Field.Store.NO, Field.Index.NOT_ANALYZED_NO_NORMS));
}
return fields.ToArray();
}

so:

the document at c:\development\sensenet\luceneadapter\program.cs will have 4 Ancestor fields, these:

c:\
c:\development
c:\development\sensenet
c:\development\sensenet\luceneadapter

Lucene thinks in TERMS. Each of the above lines became a TERM of its own, and when we search for contents with a specific term as a value for one of the content’s fields, then we are back in Lucene’s domain. A simple TermQuery to the Ancestor field will produce the desired result: in no time.
private IEnumerable<DocumentWrapper> Search()
{
var query = new TermQuery(new Term("Ancestor", "c:\\Development"));
var result = Searcher.Search(query);
....
}

 
 Capture
  Times are milliseconds.

 

And what is the trade here? Index gets bigger of course. For ~219K items this meant about 30 extra MBs. Compared to the total index size which is 1.6GB this tradeoff is acceptable. The memory footprint is small however: searching the index gobbled up only an additional 4MB of memory.

Comments (1) -

Tamás Bíró
12/17/2009 6:07:57 PM #

Great solution and well written post. I understood it right away. Smile

Pingbacks and trackbacks (3)+

Comments are closed

Welcome to the blog!

Sense/Net ECM is ever evolving. Community means the world to us so we want to keep you apprised on what’s happening on our side of the woods. Want to make us happy? Add a comment and tell us what you think!

Month List