Hi, I am Wentao Hu, a Master's student in the College of Computing and Data Science (CCDS) at Nanyang Technological University (NTU), advised by Prof. Hanwang Zhang in Mreal Lab. Before that, I earned my Bachelor's degree in Statistics from Hunan University(HNU).
My research interests lie in AIGC and MLLM, and I am eager to explore the unknown.
") does not match the recommended repository name for your site ("
").
", so that your site can be accessed directly at "http://
".
However, if the current repository name is intended, you can ignore this message by removing "{% include widgets/debug_repo_name.html %}
" in index.html
.
",
which does not match the baseurl
("
") configured in _config.yml
.
baseurl
in _config.yml
to "
".
Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Weiming Wu, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Hanwang Zhang, Shuicheng Yan
International Conference on Machine Learning (ICML) 2025 (Oral),
We propose a General-Level framework, inspired by the five-level capability grading mechanisms in the autonomous driving industry, to assess the performance and generality of Multimodal Language Models (MLLMs) across five levels. Central to this framework is the concept of Synergy, which categorizes capabilities based on whether MLLMs preserve synergy across comprehension, generation, and multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present General-Bench, a massive, ever-growing multimodal benchmark that encompasses a broad spectrum of skills, modalities, formats, and capabilities — with over 700 tasks and 325,800 instances.
[arXiv] [Project Page] [Leaderboard] [Huggingface Benchmark] [新智元] [机器之心]
Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Weiming Wu, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Hanwang Zhang, Shuicheng Yan
International Conference on Machine Learning (ICML) 2025 (Oral),
We propose a General-Level framework, inspired by the five-level capability grading mechanisms in the autonomous driving industry, to assess the performance and generality of Multimodal Language Models (MLLMs) across five levels. Central to this framework is the concept of Synergy, which categorizes capabilities based on whether MLLMs preserve synergy across comprehension, generation, and multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present General-Bench, a massive, ever-growing multimodal benchmark that encompasses a broad spectrum of skills, modalities, formats, and capabilities — with over 700 tasks and 325,800 instances.
[arXiv] [Project Page] [Leaderboard] [Huggingface Benchmark] [新智元] [机器之心]
Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang
arXiv Preprint,Under Review 2025
We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-Consistency Tokenizer (Selftok). Selftok is a SOTA tokenizer that achieves both high-quality reconstruction and high compression bit rate. After representing the training images as Selftok tokens, as a pure AR model, our VLM achieves both SOTA visual comprehension and generation performances.
Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang
arXiv Preprint,Under Review 2025
We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-Consistency Tokenizer (Selftok). Selftok is a SOTA tokenizer that achieves both high-quality reconstruction and high compression bit rate. After representing the training images as Selftok tokens, as a pure AR model, our VLM achieves both SOTA visual comprehension and generation performances.
Wang Lin*, Liyu Jia*, Wentao Hu*, Kaihang Pan, Zhongqi Yue, Jingyuan Chen, Fei Wu, Hanwang Zhang (* co-first authors)
arXiv Preprint,Under Review 2025
We propose an autoregressive-based video generation framework Phys-AR that incorporates a symbolic reasoning process into the generation process, thus maintaining the physical correctness of the generated videos.
Wang Lin*, Liyu Jia*, Wentao Hu*, Kaihang Pan, Zhongqi Yue, Jingyuan Chen, Fei Wu, Hanwang Zhang (* co-first authors)
arXiv Preprint,Under Review 2025
We propose an autoregressive-based video generation framework Phys-AR that incorporates a symbolic reasoning process into the generation process, thus maintaining the physical correctness of the generated videos.