Wentao Hu 胡文韬

Hi, I am Wentao Hu, a Master's student in the College of Computing and Data Science (CCDS) at Nanyang Technological University (NTU), advised by Prof. Hanwang Zhang in Mreal Lab. Before that, I earned my Bachelor's degree in Statistics from Hunan University(HNU).

My research interests lie in AIGC and MLLM, and I am eager to explore the unknown.

Curriculum Vitae

Education
  • Nanyang Technological University
    Nanyang Technological University
    College of Computing and Data Science
    M.Eng. Student
    Aug. 2023 - present
  • Hunan University
    Hunan University
    B.S. in Statistics
    Sep. 2019 - Jun. 2023
Experience
  • Central Media Technology Institute(Singapore), Huawei 2012 laboratory
    Central Media Technology Institute(Singapore), Huawei 2012 laboratory
    AI Research Intern
    Aug. 2024 - May. 2025
  • Kuaishou Technology
    Kuaishou Technology
    Intern
    Dec. 2022 - Feb. 2023
News
2025
3 papers were submitted to Neurips 2025.
May 15
1 paper were accepted at ICML 2025.
May 01
2024
I started my research internship in the Huawei Singapore Research Center and joined Seltok Team.
Aug 19
Publications
On Path to Multimodal Generalist: Levels and Benchmarks
On Path to Multimodal Generalist: Levels and Benchmarks

Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Weiming Wu, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Hanwang Zhang, Shuicheng Yan

International Conference on Machine Learning (ICML) 2025 (Oral),

We propose a General-Level framework, inspired by the five-level capability grading mechanisms in the autonomous driving industry, to assess the performance and generality of Multimodal Language Models (MLLMs) across five levels. Central to this framework is the concept of Synergy, which categorizes capabilities based on whether MLLMs preserve synergy across comprehension, generation, and multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present General-Bench, a massive, ever-growing multimodal benchmark that encompasses a broad spectrum of skills, modalities, formats, and capabilities — with over 700 tasks and 325,800 instances.

On Path to Multimodal Generalist: Levels and Benchmarks

Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Weiming Wu, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Hanwang Zhang, Shuicheng Yan

International Conference on Machine Learning (ICML) 2025 (Oral),

We propose a General-Level framework, inspired by the five-level capability grading mechanisms in the autonomous driving industry, to assess the performance and generality of Multimodal Language Models (MLLMs) across five levels. Central to this framework is the concept of Synergy, which categorizes capabilities based on whether MLLMs preserve synergy across comprehension, generation, and multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present General-Bench, a massive, ever-growing multimodal benchmark that encompasses a broad spectrum of skills, modalities, formats, and capabilities — with over 700 tasks and 325,800 instances.

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

arXiv Preprint,Under Review 2025

We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-Consistency Tokenizer (Selftok). Selftok is a SOTA tokenizer that achieves both high-quality reconstruction and high compression bit rate. After representing the training images as Selftok tokens, as a pure AR model, our VLM achieves both SOTA visual comprehension and generation performances.

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

arXiv Preprint,Under Review 2025

We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-Consistency Tokenizer (Selftok). Selftok is a SOTA tokenizer that achieves both high-quality reconstruction and high compression bit rate. After representing the training images as Selftok tokens, as a pure AR model, our VLM achieves both SOTA visual comprehension and generation performances.

Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning
Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

Wang Lin*, Liyu Jia*, Wentao Hu*, Kaihang Pan, Zhongqi Yue, Jingyuan Chen, Fei Wu, Hanwang Zhang (* co-first authors)

arXiv Preprint,Under Review 2025

We propose an autoregressive-based video generation framework Phys-AR that incorporates a symbolic reasoning process into the generation process, thus maintaining the physical correctness of the generated videos.

Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

Wang Lin*, Liyu Jia*, Wentao Hu*, Kaihang Pan, Zhongqi Yue, Jingyuan Chen, Fei Wu, Hanwang Zhang (* co-first authors)

arXiv Preprint,Under Review 2025

We propose an autoregressive-based video generation framework Phys-AR that incorporates a symbolic reasoning process into the generation process, thus maintaining the physical correctness of the generated videos.