All Papers

Back to Home

* equal contribution    † corresponding author

Being-H Series

Preprint 2025 Being-H0
PDF Blog Code HF

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo*, Yicheng Feng*, Wanpeng Zhang*, Sipeng Zheng*, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu†

Being-H0 acquires dexterous manipulation skills by learning from large-scale human videos in the UniHand dataset via physical instruction tuning. By explicitly modeling hand motions, the resulting foundation model seamlessly transfers from human hand demonstrations to robotic manipulation.
Preprint 2026 Being-H0.5
PDF Blog Code HF

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Hao Luo*, Ye Wang*, Wanpeng Zhang*, Sipeng Zheng*, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, Zongqing Lu†

Being-H0.5 extends our exploration of human-video pretraining into a more general paradigm of human-centric robot learning. Instead of treating human videos as a narrow source of dexterous supervision, it scales the training recipe to 35,000+ hours and unifies human data, robot data, and visual-text data within a single embodied foundation model. The core goal is cross-embodiment generalization: learning transferable action priors that can bridge heterogeneous embodiments, tasks, and sensory setups.

Learning from Videos — VLA

CVPR 2026 JALA
PDF Blog Code

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, Zongqing Lu†

Developing JEPT, JALA scales VLA pretraining by aligning hidden states with inverse dynamics to learn unified latent actions from both labeled and unannotated human videos.
CVPR 2026 Spatial-Aware VLA
PDF Blog Code

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu†

VIPA-VLA enables spatially grounded VLA pretraining from human videos through visual-physical alignment, addressing the gap between 2D visual perception and 3D physical action.
Preprint 2026 Rethinking
PDF Blog Code

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

Ye Wang*, Sipeng Zheng*, Hao Luo*, Wanpeng Zhang*, Haoqi Yuan, Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, Zongqing Lu, Qin Jin†

A controlled study of VLA scaling shows EEF-relative alignment is the most robust action default, naive heterogeneous data pooling can cause destructive interference.
Preprint 2026 PTR
PDF Blog Code

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan, Zongqing Lu†

PTR performs reward-free offline policy improvement by conservatively reweighting offline data according to whether an action leads to an identifiable downstream outcome.
Preprint 2025 DiG-Flow
PDF Blog Code

DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models

Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng, Sipeng Zheng, Qin Jin, Zongqing Lu†

DiG-Flow is a plug-and-play module for flow-matching based VLAs that rebalances control between the autoregressive foundation model and the flow expert.

Learning from Videos — Visual Policy

ICLR 2025 JEPT
PDF Code

Learning Video-Conditioned Policy on Unlabelled Data with Joint Embedding Predictive Transformer

Hao Luo, Zongqing Lu†

Inspired by JEPA, JEPT learns predictive visual representations of task-relevant transitions through joint embedding prediction. To our knowledge, it is the earliest work to bring JEPA-style learning into real policy learning. This enables video-conditioned policy learning with unlabeled videos, reducing data cost and improving one-shot generalization.
ECCV 2024 PVDR
PDF Code

Pre-trained Visual Dynamics Representations for Efficient Policy Learning

Hao Luo, Bohan Zhou, Zongqing Lu†

PVDR learns abstract visual dynamics priors from unlabeled videos and adapts them for efficient policy learning in domain-gapped tasks.
ECCV 2024 CLIP4MC
PDF Code

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo, Ziluo Ding, Zongqing Lu†

CLIP4MC learns a more RL-friendly vision-language model from large-scale internet videos, providing better reward signals for open-ended embodied tasks.

VLM

NeurIPS 2025 OpenMMEgo
PDF Code

OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data

Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, Zongqing Lu†

We enhance large multimodal models for egocentric video understanding with large-scale first-person QA data and training recipes tailored to first-person dynamics.
ICCV 2025 Highlight Being-VL0.5
PDF Blog Code

Unified Multimodal Understanding via Byte-Pair Visual Encoding

Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu†

Building upon the visual BPE tokenizer proposed in the previous work, we further designed a complete training framework and our Being-VL-0.5 model.
ICCV 2025 VideoOrion
PDF

VideoOrion: Tokenizing Object Dynamics in Videos

Yicheng Feng*, Yijiang Li*, Wanpeng Zhang, Sipeng Zheng, Hao Luo, Zihao Yue, Zongqing Lu†

VideoOrion encodes videos with a two-branch design, using object tokens from a detect-segment-track pipeline to capture object dynamics alongside scene context.

Other Pulications & Preprints

TMLR 2023 Transformers in Reinforcement Learning
PDF

A Survey on Transformers in Reinforcement Learning

Wenzhe Li*, Hao Luo*, Zichuan Lin*, Chongjie Zhang†, Zongqing Lu†, Deheng Ye†

This survey provides a systematic review of how Transformers are used in reinforcement learning, covering representation learning, model learning, sequential decision-making, and generalist agents. It aims to clarify the motivations, design choices, and challenges of bringing Transformer architectures into RL.
CVPR 2026 OpenT2M
PDF

OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data

Bin Cao, Sipeng Zheng, Hao Luo, Boyuan Li, Jing Liu, Zongqing Lu†

OpenT2M improves text-to-motion generalization task with large-scale, high-quality motion data. Built on it, MonoFrill shows that strong motion generation does not require overly complicated designs.
Preprint 2023 MDPO
PDF

Model-Based Decentralized Policy Optimization

Hao Luo, Jiechuan Jiang, Zongqing Lu†

MDPO addresses the non-stationarity of decentralized multi-agent learning with a latent dynamics model. It makes policy optimization more stable and closer to monotonic improvement in cooperative multi-agent tasks.