Research | Pan ZHOU

My research target is to build ”efficient and effective artificial intelligent systems” so that machines can cognize, understand and interact with the environment. Currently, I mainly focus on three research topics across machine learning, computer vision and optimization.

1) Learning Framework: design effective learning framework / training task / loss to formulate a problem so that the AI models can learn desired knowledge to handle general / specific tasks.

a) Self-Supervised (multi-modal) Learning: design effective and efficient self-supervised (multi-modal) learning frameworks that enable AI models to learn desired knowledge and achieve human’s data understanding and reasoning ability.
- Single-modal Learning (click here for representative works)
  PCL (ICLR, 800+ citations, ) is the first clustering contrastive learning method to learn data cluster structure, and its improved version, Mugs ( ), develops multi-granular contrastive learning for multi-granular representation learning. See more works like SANE(NeurIPS, spotlight), TEC, and PGCL (TNNLS).
- Multi-modal Learning (click here for representative works)
  PTP (CVPR & TPAMI, ) is the pioneer to enhance grounding ability in multi-modal models, and LOVA³ is to enpower models with humans' ability, including the answering, asking and accessing ability. See more works like CLOT(CVPR, ) for exploring humans' creativity, CoVGT (TPAMI & ECCV, ) for video question answering, Genixer to empower multi-modal models as a powerful data generator, Wav-BERT (AAAI) for acoustic and linguistic representation learning.
b) Generative Models: design generative models like diffusion models that generate image/3D/video data and empower AI models with imagination and creativity akin to that of humans.
- Image Generation (click here for representative works)
  MDT (ICCV, , ) achieves SoTA image synthesis performance on ImageNet (256x256), and improves the learning speed of DiT ( ), the core component of SORA, by at least 10x. See more works like EditAnything ( ), FDT ( ), ScaleLong (NeurIPS, ), and PPOGAN (NeurIPS, ).
- 3D Generation (click here for representative works)
  Consistent3D (CVPR, ) is the pioneer to use ODE sampling as guidance in text-to-3D task, overcoming the unpredicable and unstable SDE guidance in SDS/DreamFusion. See more works, e.g., DTC123 (CVPR, ) and Gamba for image-to-3D generation, Instant3D ( ) for fast text-to-3D task.

2) Network Architecture Design: develop innovative network topology that posses high capacity and efficiency for acquiring knowledge, thereby improving the overall model capacity of AI.

Manually-Designed Network (click here for representative works)
MetaFormer (CVPR ORAL, 600+ citations, ) replaces self-attention in ViT with pooling and convolutions independently, and achieves impressive performance, breaking the slogan “self-attention is all you need”. It reveals network design princeples that if a network contains two kinds of operations, including spatial information exchanging operations (e.g., attention, pooling and convolution) and channel information exchanging operations (e.g., MLP), then the network can perfor well. Its improved version CAFormer network sets a new recording accuracy of 85.5% on ImageNet under supervised settings without extra training data, and achives top-2 performance on ImageNet-C( ). See more works like IFormer ( ), InceptionNeXt ( ), DualFormer ( ), and SUN (ECCV, ).
Automatically-Designed Network (click here for representative works)
PR-DARTS (NeurIPS ORAL, ) automatically designs effective network architectures, reducing the reliance on expert trial and error. It provides the first theory to show why previous network search methods (a.k.a. AutoML) often collapse due to selecting too many skip-connections, and then proposes a new method that can avoid previous collapse and thus automatically selects and combines various network operations, e.g. pooling and convolution, to search more effective network.

3) Parameter Optimizer: design efficient optimizers to train AI models efficiently.

Faster Optimizer (click here for representative works)
Adan ( ) is about 2x faster than the SoTA optimizers, e.g. Adam, while achieving higher or comparable performance on many networks, e.g., CNNs, ViTs and MAE in the CV field, UNet and ViTs in AIGC field, GPT2 and billion-scale LLaMA in the NLP field, networks in RL tasks. It has been included in popular deep-learning codebases, e.g., NVIDIA NeMo ( ) for training large language models and multi-modal models, HuggingFace Timm ( ) and OpenMMLab ( ) which both train AI models for CV tasks like classification, detection and segmentation, Jittor of Tsinghua University ( ) for 3D generation, and is the default optimizer in DreamFusion ( ) and MDT ( ) for SoTA 3D and image generation tasks respectively.
See more works, e.g., SLRLA (NeurIPS, ) which improves lookahead, R-SPIDER (TPAMI & AISTATS), and HSDMPG (ICML & TPAMI).
Unified Plug-and-play Accelration Framework (click here for representative works)
Win (JMLR & ICLR, ) can accelerate AdamW, Adam, LAMB and SGD by 1.5X on vision classification tasks and language modeling tasks with both CNN and Transformer backbones.
Network Optimization Theory (click here for representative works)
This work (NeurIPS, 200+ citations ) provides the first theory to explain "why SGD generalizes better than ADAM in deep learning". See more works like SLRLA (NeurIPS, ) to analyzes "why Lookahead generalizes better than SGD".