Hao Luo · 罗昊

Hao Luo

I am a fourth-year Ph.D. student at School of Computer Science, Peking University, advised by Prof. Zongqing Lu, and expect to graduate in 2027. Before starting my Ph.D., I received my B.E. in Computer Science from Peking University.

My current research focuses on Embodied AI, especially Embodied Foundation Models. Specifically, I am working on generalizable VLAs scalable with human videos and action-oriented latent representations learned with world models. I have previously worked on VLMs, particularly for egocentric video understanding. Feel free to reach out if you are interested in discussions or collaborations.

Hao Luo

Personally, I am particularly inspired by JEPA, and believe that transition-aware and interaction-centric representations can boost progress toward truly generalizable policies with broad in-context reasoning.

Being-H Series @ BeingBeyond

Dexterous Foundation Models Scalable with Human Videos

The Being-H series explores scaling embodied foundation models with human videos.

  • Being-H0 validates that thousand-hour human videos can pretrain dexterous VLAs with human priors transferable in downstream tasks.
  • Being-H0.5 further scales this recipe to 35,000+ hours through unified training on human, robot, and visual-text data, enabling cross-embodiment policies.
Being-H0

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo*, et al.

First-Listed Co-First

PDF Blog Code HF
Being-H0.5

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Hao Luo*, et al.

First-Listed Co-First

PDF Blog Code HF

Selected Publications

All Papers

Learning from Videos — VLA

CVPR 2026 JALA
PDF Blog Code

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, Zongqing Lu

Developing JEPT, JALA scales VLA pretraining by aligning hidden states with inverse dynamics to learn unified latent actions from both labeled and unannotated human videos.
CVPR 2026 Spatial-Aware VLA
PDF Blog Code

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu

VIPA-VLA enables spatially grounded VLA pretraining from human videos through visual-physical alignment, addressing the gap between 2D visual perception and 3D physical action.

Learning from Videos — Visual Policy

ICLR 2025 JEPT
PDF Code

Learning Video-Conditioned Policy on Unlabelled Data with Joint Embedding Predictive Transformer

Hao Luo, Zongqing Lu

Inspired by JEPA, JEPT learns predictive visual representations of task-relevant transitions through joint embedding prediction. To our knowledge, it is the earliest work to bring JEPA-style learning into real policy learning. This enables video-conditioned policy learning with unlabeled videos, reducing data cost and improving one-shot generalization.
ECCV 2024 PVDR
PDF Code

Pre-trained Visual Dynamics Representations for Efficient Policy Learning

Hao Luo, Bohan Zhou, Zongqing Lu

PVDR learns abstract visual dynamics priors from unlabeled videos and adapts them for efficient policy learning in domain-gapped tasks.
ECCV 2024 CLIP4MC
PDF Code

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang*, Junpeng Yue*, Hao Luo, Ziluo Ding, Zongqing Lu

CLIP4MC learns a more RL-friendly vision-language model from large-scale internet videos, providing better reward signals for open-ended embodied tasks.

VLM

NeurIPS 2025 OpenMMEgo
PDF Code

OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data

Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, Zongqing Lu

We enhance large multimodal models for egocentric video understanding with large-scale first-person QA data and training recipes tailored to first-person dynamics.
ICCV 2025 Highlight Being-VL0.5
PDF Blog Code

Unified Multimodal Understanding via Byte-Pair Visual Encoding

Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

Building upon the visual BPE tokenizer proposed in the previous work, we further designed a complete training framework and our Being-VL-0.5 model.
ICCV 2025 VideoOrion
PDF

VideoOrion: Tokenizing Object Dynamics in Videos

Yicheng Feng*, Yijiang Li*, Wanpeng Zhang, Sipeng Zheng, Hao Luo, Zihao Yue, Zongqing Lu

VideoOrion encodes videos with a two-branch design, using object tokens from a detect-segment-track pipeline to capture object dynamics alongside scene context.

Experience

Apr 2025 - Present

BeingBeyond Technology

Research Intern, Being Exploring-Large Models

Developing the Being-H series of embodied foundation models pretrained from human videos.

Exploring scalable VLA and latent world-action models for embodied intelligence.

Jan 2023 - Mar 2025

Beijing Academy of Artificial Intelligence (BAAI)

Research Intern, Multimodal Interaction Group

Conducted researches on video-based representation learning for policy learning and VLMs.

Honors and Services

Awards

Lingjun Pilot Scholarship, Merit Student, Academic Excellence Award, Social Work Award, all at PKU

Reviewer

ICML, NeurIPS (before 2026), ICLR, CVPR, ICCV, ECCV, BMVC

Teaching Assistant

Algorithms, PKU, Spring 2021, 2022
Deep Reinforcement Learning, PKU, Spring 2024