Wednesday, April 8, 2009

Developing a Document Image Retrieval System

Greetings, my name is Konstantinos Zagoris and I am a close friend of Savvas Chatzichristofis and the developer of the img(Anaktisi). I have been invited to this blog to describe a field of Retrieval Information: the Image Document Retrieval System (DIRS) through word spotting.

This technique performs the word matching directly in the document images bypassing OCR and using word-images as queries. The entire system consists of the Offline and the Online procedures. In the Offline procedure, the document images are analyzed and the results are stored in a database. Three main stages, the preprocessing, the word segmentation and the feature extraction stages, constitute the offline procedure. A set of features, capable of capturing the word shape and discard detailed differences due to noise or font differences are used for the word-matching process. The Online procedure consists of four components: the creation of the query image, the preprocessing stage, the feature extraction stage, and finally, the matching procedure.

The overall structure of the Document Image Retrieval System.

In contrast to the descriptors that they hosted in img(Anaktisi), this descriptor uses primarily shape features. The image below depicts the descriptor and the features that it contains. These features was selected in such way that describe satisfactorily the shape of the query words while at the same moment they suppress small differences due to noise, size and type of fonts.

zag_fig2

The description of the above features can be found in the journal article:

or in the conference paper (in a compact form):

You can find the presentation of the above conference paper here. Below is a more simple version for web presentations purposes.

A very early (and rough) version of the proposed DIRS is described in the conference paper:

The proposed system is implemented with the help of the Visual Studio 2008 and is based on the Microsoft .NET Framework 3.5. The programming language which is used is the C#. For user interaction the application employs the AJAX/Javascript and HTML technologies.

The image documents included in the database are created artificially from various texts and then noise was added in order to implement in parallel a text search engine which makes easier the verification and evaluation of the search results of the DIRS system. Furthermore, the database used by the implemented DIRS is the Microsoft SQL Server 2005.

clip_image002

The web address of the implemented system is the http://orpheus.ee.duth.gr/irs2_5

The advantage of the described method is the resilience to the noise. An example of a noisy document is depicted in the below image. This document is the retrieval result for the word “literature”.

zag_fig3

Read Part II

For more information or questions email me at kzagoris@gmail.com.

Dr Konstantinos Zagoris (http://www.zagoris.gr) received the Diploma in Electrical and Computer Engineering in 2003 from Democritus University of Thrace, Greece and his phD from the same univercity in 2010. His research interests include document image retrieval, color image processing and analysis, document analysis, pattern recognition, databases and operating systems. He is a member of the Technical Chamber of Greece.

No comments: