Hao Luo (罗昊)

Learning from Videos — VLA

CVPR 2026

PDF Blog Code

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, Zongqing Lu

Developing JEPT, JALA scales VLA pretraining by aligning hidden states with inverse dynamics to learn unified latent actions from both labeled and unannotated human videos.

CVPR 2026

PDF Blog Code

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu

VIPA-VLA enables spatially grounded VLA pretraining from human videos through visual-physical alignment, addressing the gap between 2D visual perception and 3D physical action.

Learning from Videos — Visual Policy

ICLR 2025

PDF Code

Learning Video-Conditioned Policy on Unlabelled Data with Joint Embedding Predictive Transformer

Hao Luo, Zongqing Lu

Inspired by JEPA, JEPT learns predictive visual representations of task-relevant transitions through joint embedding prediction. To our knowledge, it is the earliest work to bring JEPA-style learning into real policy learning. This enables video-conditioned policy learning with unlabeled videos, reducing data cost and improving one-shot generalization.

ECCV 2024

PDF Code

Pre-trained Visual Dynamics Representations for Efficient Policy Learning

Hao Luo, Bohan Zhou, Zongqing Lu

PVDR learns abstract visual dynamics priors from unlabeled videos and adapts them for efficient policy learning in domain-gapped tasks.

ECCV 2024

PDF Code

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang*, Junpeng Yue*, Hao Luo, Ziluo Ding, Zongqing Lu

CLIP4MC learns a more RL-friendly vision-language model from large-scale internet videos, providing better reward signals for open-ended embodied tasks.

VLM

NeurIPS 2025

PDF Code

OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data

Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, Zongqing Lu

We enhance large multimodal models for egocentric video understanding with large-scale first-person QA data and training recipes tailored to first-person dynamics.

ICCV 2025 Highlight

PDF Blog Code

Unified Multimodal Understanding via Byte-Pair Visual Encoding

Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

Building upon the visual BPE tokenizer proposed in the previous work, we further designed a complete training framework and our Being-VL-0.5 model.

ICCV 2025

PDF

VideoOrion: Tokenizing Object Dynamics in Videos

Yicheng Feng*, Yijiang Li*, Wanpeng Zhang, Sipeng Zheng, Hao Luo, Zihao Yue, Zongqing Lu

VideoOrion encodes videos with a two-branch design, using object tokens from a detect-segment-track pipeline to capture object dynamics alongside scene context.

Hao Luo · 罗昊

Dexterous Foundation Models Scalable with Human Videos

Selected Publications

Experience

BeingBeyond Technology

Beijing Academy of Artificial Intelligence (BAAI)

Honors and Services

Awards

Reviewer

Teaching Assistant