Publications | Soolab Sibei Yang

2026

Vision Transformer Needs More Than Register

Cheng Shi, Yizhou Yu, and Sibei Yang†

Accepted by CVPR, 2026

arXiv Code
WeaveTime: Streaming from Earlier Frames into Emergent Memory in VideoLLMs

Yulin Zhang, Cheng Shi, and Sibei Yang†

Accepted by CVPR, 2026

arXiv HTML
CVPR Finding

Direct Language Embedding Enables Gaussian Splatting for Large Scenes

Zhida Li, Jianqiao Zhu, Hejin Huang, Yipeng Qin, Sibei Yang, Guanbin Li

Accepted by CVPR Finding, 2026
Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

Jiajin Tang, Gaoyang, Wenjie Wang, Sibei Yang†, Xing Chen

Accepted by ICLR, 2026
RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Hanzhuo Huang, Qingyang Bao, Zekai Gu, Zhongshuo Du, Cheng Lin, Yuan Liu†, Sibei Yang†

Accepted by ICLR, 2026

arXiv HTML Code

2025

Vision Function Layer in Multimodal LLMs

Cheng Shi, Yizhou Yu, and Sibei Yang†

Accepted by NeurIPS, 2025

arXiv PDF Code
Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Yulin Zhang, Cheng Shi, Yang Wang, Sibei Yang†

Accepted by NeurIPS, 2025

arXiv HTML PDF
Discovering Compositional Hallucination in LVLMs

Sibei Yang†, Ge Zheng, Jiajin Tang, Jiaye Qian, Hanzhuo Huang, Cheng Shi

Accepted by NeurIPS, 2025

PDF
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang†

Accepted by NeurIPS, 2025

arXiv PDF Code
Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models

Xu Yue, Chengyan Fu, Li Xiong, Sibei Yang, Wenjie Wang

Accepted by NeurIPS, 2025

arXiv PDF
Sim-DETR: Unlock DETR for Temporal Sentence Grounding

Jiajin Tang*, Zhengxuan Wei*, Yuchen Zhu, Cheng Shi, Guanbin Li, Liang Lin, Sibei Yang†

Accepted by ICCV, 2025

arXiv PDF
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

Ge Zheng*, Jiaye Qian*, Jiajin Tang, Sibei Yang†

Accepted by ICCV, 2025

arXiv PDF
No More Sibling Rivalry: Debiasing Human-Object Interaction Detection

Bin Yang*, Yulin Zhang*, Hong-Yu Zhou, Sibei Yang†

Accepted by ICCV, 2025

arXiv PDF
Closed-Loop Transfer for Weakly-supervised Affordance Grounding

Jiajin Tang*, Zhengxuan Wei*, Ge Zheng, Sibei Yang†

Accepted by ICCV, 2025

arXiv PDF
Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning

Zhengxuan Wei*, Jiajin Tang*, and Sibei Yang†

Accepted by ICCV, 2025

arXiv PDF Code
ICCV2025

Penalizing Boundary Activation for Object Completeness in Diffusion Models

Haoyang Xu, Tianhao Zhao, Sibei Yang, Yutian Lin

Accepted by ICCV, 2025

arXiv PDF Code
ICCV2025

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Ruifei Zhang, Wei Zhang, Xiao Tan, Sibei Yang, Xiang Wan, Xiaonan Luo, Guanbin Li

Accepted by ICCV, 2025

arXiv PDF Code
Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM

Qiyuan Dai, and Sibei Yang†

Accepted by CVPR, 2025

arXiv PDF
Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement

Qiyuan Dai, Hanzhuo Huang, Yu Wu, Sibei Yang†

Accepted by CVPR, 2025

arXiv PDF
Rethinking Query-based Transformer for Continual Image Segmentation

Yuchen Zhu*, Cheng Shi*, Dingyou Wang, Jiajin Tang, Zhengxuan Wei, Yu Wu, Sibei Yang†

Accepted by CVPR, 2025

arXiv PDF Code
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model

Chunlin Yu*, Hanqing Wang*, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, Jingya Wang

Accepted by CVPR, 2025

arXiv HTML Code
VTON 360: High-fidelity virtual try-on from any viewing direction

Zijian He, Yuwei Ning, Yipeng Qin, Guangrun Wang, Sibei Yang, Liang Lin, Guanbin Li

Accepted by CVPR, 2025

arXiv HTML Code
Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Yingdong Shi, Changming Li, Yifan Wang, Yongxiang Zhao, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren

Accepted by CVPR, 2025

arXiv HTML
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing

Yi Wang, Fenghua Weng, Sibei Yang, Zhan Qin, Minlie Huang, Wenjie Wang

Accepted by ACL, 2025

arXiv PDF Code
Don’t Say No: Jailbreaking LLM by Suppressing Refusal

Yukai Zhou, Jian Lou, Zhijie Huang, Zhan Qin, Sibei Yang, Wenjie Wang

Accepted by ACL, 2025

arXiv PDF Code
MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

Hanzhuo Huang*, Yuan Liu*, Ge Zheng, Jiepeng Wang, Zhiyang Dou, Sibei Yang†

Accepted by ICLR, 2025

arXiv HTML Code
CityAnchor: City-scale 3D Visual Grounding with Multi-modality LLMs

jinpeng Li*, Haiping Wang*, Jiabin Chen, Yuan Liu, Zhiyang Dou, Yuexin Ma, Sibei Yang, Yuan Li, Wang Wenping, Zhen Dong, Bisheng Yang

Accepted by ICLR, 2025

PDF Code
ICLR2025

Discovering Influential Neuron Path in Vision Transformers

Yifan Wang, Yifei Liu, Yingdong Shi, Changming Li, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren

Accepted by ICLR, 2025

arXiv Code

2024

TPAMI2024

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-oriented Perspective

Chaoqi Chen*, Yushuang Wu*, Qiyuan Dai*, Hong-Yu Zhou*, Mutian Xu, Sibei Yang†, Xiaoguang Han†, Yizhou Yu†

Accepted by TPAMI, 2024

arXiv
Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

Cheng Shi*, Yulin Zhang*, Bin Yang, Jiajin Tang, Yuexin Ma, Sibei Yang†

Accepted by ECCV, 2024

arXiv Code
Plain-D^Net: A Plain Multi-Dataset Object Detector

Cheng Shi*, Yuchen Zhu*, and Sibei Yang†

Accepted by ECCV, 2024

arXiv Code
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin, Xidong Peng, Peishan Cong, Ge Zheng, Yujing Sun, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma

Accepted by ECCV, 2024

arXiv Code
CVPR2024

Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

Qiyuan Dai, and Sibei Yang†

Accepted by CVPR, 2024

arXiv
The Devil is in the Object Boundary: Towards Annotation-free Instance Segmentation Using Foundation Models

Cheng Shi, and Sibei Yang†

Accepted by ICLR, 2024

arXiv Code
OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers

Han Liang, Jiacheng Bao, Ruichi Zhang, Sihan Ren, Yuecheng Xu, Sibei Yang, Xin Chen, Jingyi Yu, Lan Xu

Accepted by CVPR, 2024

arXiv Code Video
RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Yumeng Liu*, Yaxun Yang*, Youzhuo Wang*, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, Xuming He, Yuexin Ma

Accepted by IJCAI, 2024

arXiv

2023

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Ge Zheng*, Bin Yang*, Jiajin Tang*, Hong-Yu Zhou, Sibei Yang†

Accepted by NeurIPS, 2023

arXiv Code
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

Hanzhuo Huang*, Yufan Feng*, Cheng Shi, Lan Xu, Jingyi Yu, Sibei Yang†

Accepted by NeurIPS, 2023

arXiv Code
ICCV2023

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

Cheng Shi, and Sibei Yang†

Accepted by ICCV, 2023

arXiv HTML PDF
ICCV2023

EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment

Cheng Shi, and Sibei Yang†

Accepted by ICCV, 2023

arXiv HTML PDF
ICCV2023

CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection

Jiajin Tang*, Ge Zheng*, Jingyi Yu, Sibei Yang†

Accepted by ICCV, 2023

arXiv HTML PDF
ICCV2023

Temporal Collection and Distribution for Referring Video Object Segmentation

Jiajin Tang, Ge Zheng, and Sibei Yang†

Accepted by ICCV, 2023

HTML PDF
ICCV2023

Grounded lmage Text Matching with Mismatched Relation Reasoning

Yu Wu*, Yana Wei*, Haozhe Wang, Yongfei Liu, Sibei Yang, Xuming He†

Accepted by ICCV, 2023

arXiv
CVPR2023

Contrastive Grouping with Transformer for Referring Image Segmentation

Jiajin Tang, Ge Zheng, Cheng Shi, Sibei Yang†

Accepted by CVPR, 2023

PDF Code
AAAI2023

CCQ: Cross-Class Query Network for Partially Labeled Organ Segmentation

Xuyang Liu, Bingbing Wen, and Sibei Yang†

Accepted by AAAI, 2023

HTML Code
SIGGRAPH2023

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

Longwen Zhang*, Qiwei Qiu*, Hongyang Lin*, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang†, Lan Xu†, Jingyi Yu†

Accepted by SIGGRAPH, 2023

arXiv HTML PDF Video
TPAMI2023

A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis

Hong-Yu Zhou*, Chixiang Lu*, Chaoqi Chen, Sibei Yang, Yizhou Yu†

Accepted by TPAMI, 2023

arXiv
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin, Xidong Peng, Peishan Cong, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma

arXiv preprint arXiv:2304.05645, 2023

arXiv

2022

ECCV2022

Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding

Cheng Shi, and Sibei Yang†

Accepted by ECCV, 2022

PDF Code

2021

TMM2021

Structured attention network for referring image segmentation

Liang Lin, Pengxiang Yan, Xiaoqian Xu, Sibei Yang, Kun Zeng, Guanbin Li

Accepted by Transactions on Multimedia, 2021

HTML
CVPR2021

Bottom-up shift and reasoning for referring image segmentation

Sibei Yang, Meng Xia, Guanbin Li, Hong-Yu Zhou, Yizhou Yu

CVPR, 2021

PDF
ICCV2021

Convnets vs. transformers: Whose visual representations are more transferable?

Hong-Yu Zhou, Chixiang Lu, Sibei Yang, Yizhou Yu

Accepted by ICCV, 2021

arXiv
ICCV2021

Preservational learning improves self-supervised medical image models by reconstructing diverse contexts

Hong-Yu Zhou, Chixiang Lu, Sibei Yang, Xiaoguang Han, Yizhou Yu

Accepted by ICCV, 2021

arXiv Code

2020

TPAMI2020

Relationship-embedded representation learning for grounding referring expressions

Sibei Yang, Guanbin Li, and Yizhou Yu

TPAMI, 2020

arXiv
CVPR2020

Graph-structured referring expression reasoning in the wild

Sibei Yang, Guanbin Li, and Yizhou Yu

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

Abs arXiv Code

Grounding referring expressions aims to locate in an image an object referred to by a natural language expression. The linguistic structure of a referring expression provides a layout of reasoning over the visual contents, and it is often crucial to align and jointly understand the image and the referring expression. In this paper, we propose a scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression. In particular, we model the image as a structured semantic graph, and parse the expression into a language scene graph. The language scene graph not only decodes the linguistic structure of the expression, but also has a consistent representation with the image semantic graph. In addition to exploring structured solutions to grounding referring expressions, we also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning. We automatically generate referring expressions over the scene graphs of images using diverse expression templates and functional programs. This dataset is equipped with real-world visual contents as well as semantically rich expressions with different reasoning layouts. Experimental results show that our SGMN not only significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset, but also surpasses state-of-the-art structured methods on commonly used benchmark datasets. It can also provide interpretable visual evidences of reasoning.
ECCV2020

Propagating over phrase relations for one-stage visual grounding

Sibei Yang, Guanbin Li, and Yizhou Yu

Accepted by ECCV, 2020

HTML

2019

AAAI2019

Non-Local Context Encoder: Robust Biomedical Image Segmentation against Adversarial Attacks

Xiang He, Sibei Yang, Guanbin Li, Haofeng Li, HuiYou Chang, Yizhou Yu

Accepted by AAAI, 2019

arXiv HTML
Dynamic Graph Attention for Referring Expression Comprehension

Sibei Yang, Guanbin Li, and Yizhou Yu

Accepted by ICCV, Oct 2019

HTML Video
CVPR2019

Cross-Modal Relationship Inference for Grounding Referring Expressions

Sibei Yang, Guanbin Li, and Yizhou Yu

Accepted by CVPR, Oct 2019

HTML PDF Code

2018

CVPR2018

Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection and Semantic Segmentation Based on Weakly Supervised Learning

Weifeng Ge, Sibei Yang, and Yizhou Yu

Accepted by CVPR, Jun 2018

arXiv HTML