Pages

Wednesday, October 7, 2015

A search runtime analysis of LIRE on 500k images

Article from http://www.semanticmetadata.net/

Run time for search in LIRE heavily depends on the method used for indexing and search. There are two main ways to store data and two search strategies for linear search and there is approximate indexing of course. The two storing strategies are to (i) store the actual feature vector in a Lucene text field and (ii) to use the Lucene DocValues data format. While the former allows for easy access, more flexibility and compression, the latter is much faster when accessing raw byte[] data. Linear search then needs to open each and every document and compare the query vector to the one stored in the document. For linear search in Lucene text fields, caching boosts performance, so the byte[] data of the feature vectors is read once from the index and stored in memory. For the DocValues data storage format access is fast enough to allow for linear search. With approximate indexing a query string is used on the inverted index and only the first k best matching candidates are used to fin the n << k actual results by linear search. So first a text search is done, then a linear search on much less images is performed [1]. In our tests we used k=500 and n=10.

Tests on 499,207 images have shown that with this order approximate search is already outperforming linear search. The following numbers are given in ms search time. Note at this point that the average value per search differs for a different number of test runs due to the context of the runs, ie. the state of the Java VM, OS processes, file systems, etc. But the trend can be seen.

image

(*) Start-up latency when filling the cache was 6.098 seconds

(**) Recall with 10 results on ten runs was 0.76, on 100 run recall was 0.72

As a conclusion with nearly 500,000 images the DocValues approach might be the best choice, as the approximate indexing is loosing around 25% of the results while not boosting runtime performance that much. Further optimization would be for instance query bundling or index splitting in combination with multithreading.

[1] Gennaro, Claudio, et al. “An approach to content-based image retrieval based on the Lucene search engine library.” Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2010. 55-66.

No comments: