Zhexiao Xiong

I am a third year Ph.D. candidate in Computer Science in Multimodal Vision Research Laboratory (MVRL) at Washington University in St. Louis, advised by Prof. Nathan Jacobs. Before that, I received my bachelor's degree in Electrical and Information Engineering from Tianjin University. I spent a wonderful year at Institute of Automation, Chinese Academy of Sciences(CASIA), working with Prof. Jinqiao Wang and Dr. Xu Zhao. I was a research intern at OPPO Research.

Email  /  CV  /  Google Scholar  /  Linkedin  /  Github

profile photo
Research

My research lies broadly in computer vision and multi-modal learning, especially generative models and AIGC-related topics, including personalized text-to-image generation, image editing, cross-view & novel view synthesis. I am also interested in geometric computer vision and its combination with generative models.


I am actively looking for research internship in 2025 summer. Feel free to contact me!
GroundingBooth: Grounding Text-to-Image Customization
Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, Nathan Jacobs

Recent studies in text-to-image customization show great success in generating personalized object variants given several images of a subject. While existing methods focus more on preserving the identity of the subject, they often fall short of controlling the spatial relationship between objects. In this work, we introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed text-image grounding module and masked cross-attention layer allow us to generate personalized images with both accurate layout alignment and identity preservation while maintaining text-image coherence. With such layout control, our model inherently enables the customization of multiple subjects at once. Our model is evaluated on both layout-guided image synthesis and reference-based customization tasks, showing strong results compared to existing methods. Our work is the first work to achieve a joint grounding of both subject-driven foreground generation and text-driven background generation. The project page is available at https://groundingbooth.github.io.

Mixed-View Panorama Synthesis using Geospatially Guided Diffusion
Zhexiao Xiong, Xin Xing, Scott Workman, Subash Khanal, Nathan Jacobs arXiv

We introduce the task of mixed-view panorama synthesis, where the goal is to synthesize a novel panorama given a small set of input panoramas and a satellite image of the area. This contrasts with previous work which only uses input panoramas (same-view synthesis), or an input satellite image (cross-view synthesis). We argue that the mixed-view setting is the most natural to support panorama synthesis for arbitrary locations worldwide. A critical challenge is that the spatial coverage of panoramas is uneven, with few panoramas available in many regions of the world. We introduce an approach that utilizes diffusion-based modeling and an attention-based architecture for extracting information from all available input imagery. Experimental results demonstrate the effectiveness of our proposed method. In particular, our model can handle scenarios when the available panoramas are sparse or far from the location of the panorama we are attempting to synthesize.

MCPDepth: Panorama Depth Estimation from Multi Cylindrical Panorama by Stereo Matching
Feng Qiao, Zhexiao Xiong, Xinge Zhu, Yuexin Ma, Qiumeng He, Nathan Jacobs

We introduce Multi-Cylindrical Panoramic Depth Estimation (MCPDepth), a two-stage framework for omnidirectional depth estimation via stereo matching between multiple cylindrical panoramas. MCPDepth uses cylindrical panoramas for initial stereo matching and then fuses the resulting depth maps across views. A circular attention module is employed to overcome the distortion along the vertical axis. MCPDepth exclusively utilizes standard network components, simplifying deployment to embedded devices and outperforming previous methods that require custom kernels. We theoretically and experimentally compare spherical and cylindrical projections for stereo matching, highlighting the advantages of the cylindrical projection. MCPDepth achieves state-of-the-art performance with an 18.8% reduction in mean absolute error (MAE) for depth on the outdoor synthetic dataset Deep360 and a 19.9% reduction on the indoor real-scene dataset 3D60.

Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning
Xin Xing, Zhexiao Xiong, Abby Stylianou, Srikumar Sastry, Liyu Gong , Nathan Jacobs
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW), 2024
arXiv

We propose a novel approach called Vision-Language Pseudo-Labeling (VLPL), which uses a visionlanguage model to suggest strong positive and negative pseudo-labels, and outperform the current SOTA methods by 5.5% on Pascal VOC, 18.4% on MS-COCO, 15.2% on NUS-WIDE, and 8.4% on CUB-Birds.

StereoFlowGAN: Co-training for Stereo and Flow with Unsupervised Domain Adaptation
Zhexiao Xiong, Feng Qiao, Yu Zhang, Nathan Jacobs,
British Machine Vision Conference (BMVC), 2023
arXiv

We introduce a novel training strategy for stereo matching and optical flow estima- tion that utilizes image-to-image translation between synthetic and real image domains. Our approach enables the training of models that excel in real image scenarios while relying solely on ground-truth information from synthetic images. To facilitate task- agnostic domain adaptation and the training of task-specific components, we introduce a bidirectional feature warping module that handles both left-right and forward-backward directions. Experimental results show competitive performance over previous domain translation-based methods, which substantiate the efficacy of our proposed framework, effectively leveraging the benefits of unsupervised domain adaptation, stereo matching, and optical flow estimation.

PruneFaceDet: Pruning lightweight face detection network by sparsity training
Nanfei Jiang , Zhexiao Xiong, Hui Tian , Xu Zhao, Xiaojie Du , Chaoyang Zhao , Jinqiao Wang,
Cognitive Computation and Systems

We propose a network pruning pipeline,PruneFaceDet, to prune the lightweight face detection network, which performs training with L1 regularisation before CP. We compare two thresholding methods to get proper pruning thresholds in the CP stage. We apply the proposed pruning pipeline on the lightweight face detector and evaluate the performance on the WiderFace dataset, and get the result of a 56.3% decline of parameter size with almost no accuracy drop.

Services

Conference Reviewer: ECCV 2024, NeurIPS 2024, ICLR 2025
Internships: CASIA (2021), OPPO (2022 Spring), OPPO US Research Center(2024 summer)


Thank Jon Barron for sharing his website's source code.