Thursday, September 10, 2009


Juan C. Caicedo

My research topic is about combining visual features and text data for improving the response of an Image Retrieval system. During the writing process of my research proposal, I was using the term "multimodal information retrieval" to indicate that the system take advantage of the information in texts and images simultaneously for solving queries. I found two different surveys in which this approach is mentioned as a promising and underexplored research direction (see Lew2006 and Datta2008 for more details, specially the later, page 37, seccion 3.5: Multimodal Fusion and Retrieval).
Searching in academic databases and digital libraries for scholarly articles on "multimodal information retrieval" leads to a considerable amount of papers. For instance, in GoogleScholar we can find about 200 papers, and the top papers are related to image retrieval, it is also suposed that I have to read all the 200 papers. In general, the literature indicates that multimodal is a good term for expressing our intention of combining text and image data.
I sent my research proposal to a doctoral symposium, in which it got accepted for presentation and publication. Two out of three referees pointed out that multimodal is a confusing term to indicate our intention of combining text and visual features. Later, in the defense of my research proposal one out of two of the committee members also recommended to change that term. Then, I got confused about the right use of this word. I guess I have enough evidence that multimodal has been used to mean the same as I want. But the comments of other experts contradict it.
As far as I understand, multimodal may be used to indicate the interaction between a user and a system using different devices, as one of the referees indicated inside his review (multimodal interaction). On the other hand, when someone talks about multimodal data, it means that you have several sensors to measure different aspects of the same phenomenom (such as this). So, since the multimodal data perspective, images and text would be measures of the same phenomenon: a meaning or a semantic unit. However, it seems to be complicated, and non-natural to explain and understand in that way.
The discussion about multimodal data in the context of our research is still open. May be we can publish a review paper to discuss about that with many other people, in an information retrieval conference for instance. Meanwhile, I think I'll avoid the term unless we can be sure that it will be correctly understood.
[Lew2006] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia information retrieval: State of the art and challenges,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 2, no. 1, pp. 1–19, February 2006.
[Datta2008] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Comput. Surv., vol. 40, no. 2, pp. 1–60, April 2008.

No comments: