Zhuofan Zong (宗卓凡)

Large Multimodal Models

	VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping Hao Shao, Shulun Wang, Yang Zhou, Guanglu Song, Dailan He, Shuo Qin, Zhuofan Zong, Bingqi Ma, Yu Liu†, Hongsheng Li† arXiv, 2024 project page / paper / code We propose a diffusion-based framework for video face swapping, featuring hybrid training, an AIDT dataset, and 3D reconstruction for superior identity preservation and temporal consistency.
	EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li† ICML, 2025 project page / paper / code We present EasyRef, the first work that capable of modeling the consistent visual elements of various group image references with a single generalist multimodal LLM for diffusion models.
	MoVA: Adapting Mixture of Vision Experts to Multimodal Context Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li†, Yu Liu† NeurIPS, 2024 paper / code MoVA is a novel MLLM that can adaptively route and fuse multiple task-specific vision experts in a coarse-to-fine mechanism, alleviating the bias of CLIP vision encoder. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods.
	Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu† NeurIPS, 2024 paper / model API We propose to unleash the prompt encoding capability of LLMs for diffusion models. LiDiT-10B surpasses state-of-the-art models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The proposed method is also one of the core technologies powering SenseMirage.
	Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu†, Hongsheng Li† NeurIPS, 2024, (Spotlight Presentation) project page / paper / code We propose Visual CoT, including a new pipeline/dataset/benchmark that enhances the interpretability of MLLMs by incorporating visual Chain-of-Thought reasoning, optimizing for complex visual inputs.
	CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu†, Hongsheng Li† NeurIPS, 2024 project page / paper / code We propose a fine-tuning strategy to address the text-to-image misalignment issue with image-to-text concept matching. The training data only includes text prompts.
	RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu†, Ping Luo† NeurIPS, 2023 paper / model API RAPHAEL proposes space-MoE and time-MoE layers and achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. The proposed method is also one of the core technologies powering SenseMirage.

Large Vision Foundation Models

	Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction Zhuofan Zong, Dongzhi Jiang, Guanglu Song, Zeyue Xue, Jingyong Su, Hongsheng Li†, Yu Liu† ICCV, 2023 paper / code We propose a new paradigm, named Historical Object Prediction (HoP) for multi-view 3D detection to leverage temporal information more effectively. HoP achieves 68.5% NDS and 62.4% mAP with ViT-L, outperforming all the counterparts on the nuScenes detection (camera-only) leaderboard.
	DETRs with Collaborative Hybrid Assignments Training Zhuofan Zong, Guanglu Song, Yu Liu† ICCV, 2023 paper / code / 中文解读 We present a novel collaborative hybrid assignments training scheme and achieve state-of-the-art performances on object detection and instance segmentation tasks. Specifically, Co-DETR is the first model to achieve 66.0 box AP and 57.0 mask AP on the COCO test-dev leaderboard.
	Large-batch Optimization for Dense Visual Predictions: Training Faster R-CNN in 4.2 Minutes Zeyue Xue, Jianming Liang, Guanglu Song, Zhuofan Zong, Liang Chen, Yu Liu†, Ping Luo† NeurIPS, 2022 paper / code / 中文解读 We propose Adaptive Gradient Variance Modulator (AGVM) to train dense visual predictors with very large batch size. It enables training an object detector with one billion parameters in just 3.5 hours, reducing the training time by 20.9×, whilst achieving 62.2 mAP on the COCO val leaderboard.
	Self-slimmed Vision Transformer Zhuofan Zong, Kunchang Li, Guanglu Song, Yali Wang, Yu Qiao, Biao Leng, Yu Liu† ECCV, 2022 paper / code We propose a generic self-slimmed learning approach for ViT token pruning. Our method can speed up ViTs by 1.7x with negligible accuracy drop, and even speed up ViTs by 3.6x while maintaining 97% of their performance on the ImageNet-1K dataset.
	RCNet: Reverse Feature Pyramid and Cross-scale Shift Network for Object Detection Zhuofan Zong, Qianggang Cao, Biao Leng† ACM MM, 2021 paper / code We introduce RCNet, a novel architecture for multiscale feature fusion in object detection, addressing the inefficiencies and limitations of traditional Feature Pyramid Networks (FPN). Experiments on the MS COCO dataset show RCNet can bring significant performance gains.
	Graph Attention Based Proposal 3D ConvNets for Action Detection Jun Li, Xianglong Liu†, Zhuofan Zong, Wanru Zhao, Mingyuan Zhang, Jingkuan Song AAAI, 2020 paper We propose graph attention-based 3D CNNs (AGCN) for video action detection, addressing the limitations of existing models that overlook intra and inter-proposal relationships. AGCN achieves the state-of-the-art performance and improves the average mAP by 3.7% on the THUMOS 2014 dataset.