Publications
* indicates equal contribution
|
|
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang
arXiv Preprint, 2025
[Paper]
[Code]
[Project Page]
We propose UniVG-R1, a reasoning-guided MLLM for universal visual grounding, which leverages reinforcement learning to enhance reasoning across complex multi-image and multi-modal scenarios.
|
|
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
Sule Bai*, Yong Liu*, Yifei Han, Haoji Zhang, Yansong Tang
arXiv Preprint, 2024
[Paper]
[Code]
We propose a training-free method that enhances CLIP's dense representation through self-calibration without introducing new parameters or relying on additional backbones.
|
|
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu*, Sule Bai*, Guanbin Li, Yitong Wang, Yansong Tang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[Paper]
[Code]
We propose an open-vocabulary segmentation (OVS) method by calibrating in-vocabulary and
domain-biased embedding space with generalized contextual prior of CLIP.
|
|
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
Shiyi Zhang*, Sule Bai*, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, Yansong Tang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[Paper]
[Code]
We investigate a new problem called narrative action evaluation (NAE) and propose a prompt-guided
multimodal interaction framework.
|
|