Bhathiya Hemanthage
My research focuses on Multimodal Deep Learning, with a special emphasis on Generalized Visual-Language Grounding with Complex Language Contexts.
Visual Grounding is a well-established area of research in multimodal information processing. However, the generalizability of tasks such as Referring Expression Comprehension (REC), Phrase Grounding, and Open Vocabulary Detection (OVD); commonly grouped under Visual Grounding; is limited. For example, REC traditionally assumes that an expression always refers to a single region in an image, whereas Phrase Grounding and OVD typically handle simpler language queries, such as class categories or simple noun phrases. These limitations restrict the practical applicability of visual grounding research in real-world scenarios..
My work addresses these issues and can be categorized into two main task areas:
- Core Skill Enhancement: My work on Generalized Referring Expression Comprehension (GREC), [1] focused on eliminating the one-to-one assumption prevalent in REC. My ongoing research on the Described Object Detection (DoD) task extends the boundaries of OVD to include more sophisticated language expressions.
- Applications of Visual-Language Grounding in Complex Contexts: My prior work, [2] proposed a Symbolic Scene Representation-based approach for vision-language tasks, focusing on the Situated Interactive Multimodal Conversation (SIMMC) task. More recent work, [3] introduced a pseudo-labeling-based modular approach for the Ambiguous Candidate Identification (ACI) subtask within SIMMC, demonstrating the potential of Large Language Models (LLMs) in handling complex language contexts in multimodal dialogues.
My long-term vision is to develop Large Multimodal Models (LMMs) and Multimodal Large Language Models (MLLMs) that inherently ground complex language queries and can be easily fine-tuned for applications such as SIMMC or DTC.
[1] RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets Hemanthage B., Bilen H., Bartie P., Dondrup C., Lemon O. 2024 Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
[2] A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Rep resentation Hemanthage B., Bartie P., Dondrup C., Lemon O. 15th International Conference on Computational Semantics (IWCS), 2023
[3] Divide and Conquer: Rethinking Ambiguous Candidate Identification in Multimodal Dialogues with Pseudo-Labelling Hemanthage B., Dondrup C., Bilen H., Lemon O. 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDial), 2024