Zhuofan Zong
I am a second-year Ph.D. student from MMLab, The Chinese University of Hong Kong. I'm supervised by Prof. Hongsheng Li. I received both my Bachelor's and Master's degrees from Beihang University, supervised by Prof. Biao Leng.
I worked as a research intern in the Base Model Department at SenseTime Research, working closely with Guanglu Song and Yu Liu. During my internship at SenseTime, I was a core member of the founding team for frontline R&D projects, including the large vision fundation model, the multimodal interactive model, and the AIGC product SenseMirage.
Feel free to reach out for research discussions, collaborations or just to have a chat!
Email  / 
Google Scholar  / 
Github  / 
WeChat
|
|
Research
My research interests lie in the area of Diffusion Models and Multimodal Large Language Models. My previous works also included 2D and 3D Visual Perception. (*: Equal Contribution †: Corresponding Author)
|
|
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
Zhuofan Zong,
Dongzhi Jiang,
Bingqi Ma,
Guanglu Song,
Hao Shao,
Dazhong Shen,
Yu Liu,
Hongsheng Li†
arXiv, 2024
project page
/
paper
/
code
We present EasyRef, the first work that capable of modeling the consistent visual elements of various group image references with a single generalist multimodal LLM for diffusion models.
|
|
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping
Hao Shao,
Shulun Wang,
Yang Zhou,
Guanglu Song,
Dailan He,
Shuo Qin,
Zhuofan Zong,
Bingqi Ma,
Yu Liu†,
Hongsheng Li†
arXiv, 2024
project page
/
paper
/
code
We propose a diffusion-based framework for video face swapping, featuring hybrid training, an AIDT dataset, and 3D reconstruction for superior identity preservation and temporal consistency.
|
|
MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Zhuofan Zong*,
Bingqi Ma*,
Dazhong Shen,
Guanglu Song,
Hao Shao,
Dongzhi Jiang,
Hongsheng Li†,
Yu Liu†
NeurIPS, 2024
paper
/
code
MoVA is a novel MLLM that can adaptively route and fuse multiple task-specific vision experts in
a coarse-to-fine mechanism, alleviating the bias of CLIP vision encoder. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods.
|
|
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
Bingqi Ma*,
Zhuofan Zong*,
Guanglu Song,
Hongsheng Li,
Yu Liu†
NeurIPS, 2024
paper
/
model API
We propose to unleash the prompt encoding capability of large language models (LLMs) for diffusion models. LiDiT-10B surpasses state-of-the-art models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The proposed method is also one of the core technologies powering SenseMirage.
|
|
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Hao Shao,
Shengju Qian,
Xiao Han,
Guanglu Song,
Zhuofan Zong,
Letian Wang,
Yu Liu†,
Hongsheng Li†
NeurIPS, 2024,   (Spotlight Presentation)
project page
/
paper
/
code
We propose Visual CoT, including a new pipeline/dataset/benchmark that enhances the interpretability of MLLMs by incorporating visual Chain-of-Thought reasoning, optimizing for complex visual inputs.
|
|
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang,
Guanglu Song,
Xiaoshi Wu,
Renrui Zhang,
Dazhong Shen,
Zhuofan Zong,
Yu Liu†,
Hongsheng Li†
NeurIPS, 2024
project page
/
paper
/
code
We propose a fine-tuning strategy to address the text-to-image misalignment issue with image-to-text concept matching. The training data only includes text prompts.
|
|
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
Zeyue Xue*,
Guanglu Song*,
Qiushan Guo,
Boxiao Liu,
Zhuofan Zong,
Yu Liu†,
Ping Luo†
NeurIPS, 2023
paper
/
model API
RAPHAEL proposes space-MoE and time-MoE layers and achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. The proposed method is also one of the core technologies powering SenseMirage.
|
|
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction
Zhuofan Zong*,
Dongzhi Jiang*,
Guanglu Song,
Zeyue Xue,
Jingyong Su,
Hongsheng Li†,
Yu Liu†
ICCV, 2023
paper
/
code
We propose a new paradigm, named Historical Object Prediction (HoP) for multi-view 3D detection to leverage temporal information more effectively. HoP achieves 68.5% NDS and 62.4% mAP with ViT-L, outperforming all the counterparts on the nuScenes detection (camera-only) leaderboard.
|
|
DETRs with Collaborative Hybrid Assignments Training
Zhuofan Zong,
Guanglu Song,
Yu Liu†
ICCV, 2023
paper
/
code
/
中文解读
We present a novel collaborative hybrid assignments training scheme and achieve state-of-the-art performances on object detection and instance segmentation tasks. Specifically, Co-DETR is the first model to achieve 66.0 box AP and 57.0 mask AP on the COCO test-dev leaderboard.
|
|
Large-batch Optimization for Dense Visual Predictions: Training Faster R-CNN in 4.2 Minutes
Zeyue Xue,
Jianming Liang,
Guanglu Song,
Zhuofan Zong,
Liang Chen,
Yu Liu†,
Ping Luo†
NeurIPS, 2022
paper
/
code
/
中文解读
We propose Adaptive Gradient Variance Modulator (AGVM) to train dense visual predictors with very large batch size. It enables training an object detector with one billion parameters in just 3.5 hours, reducing the training time by 20.9×, whilst achieving 62.2 mAP on the COCO val leaderboard.
|
|
Self-slimmed Vision Transformer
Zhuofan Zong*,
Kunchang Li*,
Guanglu Song,
Yali Wang,
Yu Qiao,
Biao Leng,
Yu Liu†
ECCV, 2022
paper
/
code
We propose a generic self-slimmed learning approach for ViT token pruning. Our method can speed up ViTs by 1.7x with negligible accuracy drop, and even speed up ViTs by 3.6x while maintaining 97% of their performance on the ImageNet-1K dataset.
|
|
RCNet: Reverse Feature Pyramid and Cross-scale Shift Network for Object Detection
Zhuofan Zong,
Qianggang Cao,
Biao Leng†
ACM MM, 2021
paper
/
code
We introduce RCNet, a novel architecture for multiscale feature fusion in object detection, addressing the inefficiencies and limitations of traditional Feature Pyramid Networks (FPN). Experiments on the MS COCO dataset show RCNet can bring significant performance gains.
|
|
Graph Attention Based Proposal 3D ConvNets for Action Detection
Jun Li,
Xianglong Liu†,
Zhuofan Zong,
Wanru Zhao,
Mingyuan Zhang,
Jingkuan Song
AAAI, 2021
paper
We propose graph attention-based 3D CNNs (AGCN) for video action detection, addressing the limitations of existing models that overlook intra and inter-proposal relationships. AGCN achieves the state-of-the-art performance and improves the average mAP by 3.7% on the THUMOS 2014 dataset.
|
|
Ph.D. student in Multimedia Lab (MMLab) @ The Chinese University of Hong Kong
Sep. 2023 - Now
Advisor: Prof. Hongsheng Li
|
|
Master in Computer Technology @ Beihang University
Sep. 2020 - Jan. 2023
Advisor: Prof. Biao Leng
|
 |
Bachelor in Computer Science and Engineering @ Beihang University
Sep. 2016 - Jun. 2020
Advisor: Prof. Biao Leng
|
|
SenseTime
Research Intern
Large vision fundation models
May, 2021 - Jan, 2025
Mentor: Guanglu Song
Base Model Department
|
-
Conference Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, AISTATS
-
Journal Reviewer: TMLR, TCSVT, PR, IMAVIS
-
Fundamentals of Applied Electromagnetics (ELEG3213), Fall 2024
-
Introduction to Digital Signal Processing (ELEG3503), Spring 2025
|