From Human Skill to Robotic Mastery

Introduction

The Beginning of Human Data

纵观人工智能的发展历程，模型能力的跨越式突破，始终离不开大规模数据与算力的规模化迭代。与自动驾驶、大语言模型这类坐拥海量存量数据的领域截然不同，具身智能从诞生之初就深陷核心数据困境：既无历史沉淀的可用数据集，也无法在商业化落地过程中自然积累有效训练数据。也正因如此，实现训练数据的规模化供给，成为了卡住整个具身智能领域发展的核心命题。人类在日常生产、作业场景中，凭借双臂双手完成了海量高精细、高灵巧的操作行为，这些天然的行为数据，本就是具身智能最理想的数据燃料。但长期以来，将人类数据用于机器人模型训练，始终横亘着两大难以逾越的瓶颈：一是人类与机器人存在本质的本体差异（Embodiment Gap），直接复用人类手部数据，会面临运动学结构、动力学特性完全不匹配的问题；二是现有人类数据多来源于网络第一人称视频或小规模开源数据集，无论数据体量还是采集质量，都远未达到规模化预训练的标准。如何高效挖掘、科学利用人类数据，如何实现人类数据的规模化落地赋能，始终是行业内悬而未决的关键难题。基于这一痛点，我们团队提出了Psi-R2 与 Psi-W0 双模型架构，打造出业内首个基于10 万小时量级人类数据完成预训练的具身智能模型。接下来，我们将详细拆解 Psi-R2 和 Psi-W0 的核心设计思路、技术考量，以及二者在整个框架中各自承担的关键作用。

Introduction

The Beginning of Human Data

Historically, advances in artificial intelligence (AI) have been driven by the scaling of training data and computational resources. Embodied intelligence stands apart from established fields such as autonomous driving and large language models in a critical aspect: it lacks a pre-existing large-scale data repository, and cannot naturally accumulate task-relevant data through commercial deployment. Consequently, scaling up high-quality training data has become the paramount challenge facing the field. Humans continuously perform dexterous bimanual manipulations in daily life and industrial operations, rendering human behavioral data a natural and invaluable data source for training embodied agents. Yet learning from human demonstration data has long been plagued by two fundamental limitations. First, the inherent embodiment gap induces inconsistencies between human and robotic systems in kinematic structures and dynamic profiles. Second, the majority of existing human behavioral data is sourced from unconstrained egocentric internet videos or limited open-source datasets, such that neither the quantity nor quality of data meets the requirements of large-scale pretraining. The effective utilization of human data and its potential following large-scale deployment thus remain critical open questions. Psi-R2 and Psi-W0 represent the first suite of embodied intelligence models pretrained on human demonstration data at the 100,000-hour scale. In the subsequent sections, we elaborate on their architectural designs, the core design principles that motivate them, and the unique functional roles each model fulfills within the overall framework.

Chapter 1

1. Psi-R2：What Matters in Learning from Human Data

Psi-R2以图像与语言指令作为输入，输出未来视觉帧预测结果与机器人可执行控制动作，其整体框架基于预训练视频生成模型搭建，在技术定义上与当前领域主流的世界行动模型（WAM）完全契合；同时依托视觉-语言双模态输入、动作序列输出的结构，模型也具备视觉-语言-动作模型（VLA）的核心能力，二者以视频生成网络为骨干实现深度融合，概念互不冲突。

Psi-R2采用开源的Wan2.2-IT2V-5B-480P作为训练骨干网络，以未来视频帧与机器人动作联合预测为训练目标。模型在预训练阶段同步使用真机数据与人类数据，其中真机数据全部来自灵初Psi-MobiDex数据集，总时长为5417小时；人类数据包含灵初自研外骨骼手套采集数据与裸手操作数据，总规模达95472小时，覆盖294种场景、4821种任务与1382种物体。完成预训练后，模型仅需借助少于100条轨迹的极少量真机数据微调，即可执行手机装配、工业包装、纸盒折叠等长时序、高精细度的复杂操作任务。

针对人类数据与真机数据的融合处理，我们采用极简的工程化方案：直接通过运动学映射实现人类关节与机器人本体的维度对齐，图像数据保持原始状态不做任何预处理，直接输入模型作为状态观测与动作标签。我们的研究核心是利用人类数据赋能单一机器人本体的预训练，而非构建可跨本体通用的模型——前者是推动机器人商业化落地的关键，后者仍属于前沿探索性问题。

在研发过程中，我们尝试了图像修复、关键点辅助损失函数、人与机器人数据跨空间对齐等多种人类数据精细化处理方法，最终选择遵循The Bitter Lesson核心思想：Raw Data In, Raw Data Out。实验结果表明，当数据量达到足够规模后，所有人工设计的模块都会成为性能瓶颈。采用拉近人类数据与机器人数据分布的处理方式，虽能降低拾取放置等简单任务的训练难度，却会对长时序、高精细操作产生负面影响——这类方法的本质是模糊两类数据的差异，让模型学习共性特征，但人类与机器人的动力学特性存在本质区别，模糊边界会放大精细操作的误差，最终导致任务失败。因此我们摒弃复杂的人工设计，仅做输入输出维度对齐，让模型从海量原始数据中自主学习。最终结果显示，小数据规模下极简设计的优势不明显，但数据量突破阈值后，原始数据直入直出的方案能带来最优的泛化能力、长时序执行能力与操作性能上限。

在整个研究过程中我们发现，模型性能的核心影响因素并非架构设计，而是对数据尤其是人类数据的全链路治理能力，具体包括数据配比、训练调度器、课程学习、数据检索与评估体系。训练样本的输入比例与迭代阶段设计、配对数据的高效检索、人类数据知识向机器人模型迁移的效果验证，都是决定模型性能的关键。这些能力高度依赖Psi-W0这类动作条件型世界模型的支撑，尤其是策略评估环节——唯有实现精准评估，才能完成有效迭代，验证人类数据知识的有效迁移。这也是我们同步训练Psi-W0的核心原因，具体实现方案将在第二章详细说明。

与此同时，端到端的数据管线同样至关重要，涵盖数据清洗、自动标注、质量检测与人工审核全流程。我们采用全自动标注方案完成数据质检与标注工作，仅在最终环节引入人工校验；同时将Psi-W0融入数据质检环节，依托世界模型的可视化能力完成数据质量评分。在整套管线设计之初，我们便希望人类数据不仅能提供图像中的高层语义信息，更能承载完整的操作细节，因此极致追求人类数据3D轨迹的精准度。对于纯第一人称视频数据，我们训练了一个端到端的第一视角人手检测模型，它能融合时空信息，精准判断左右手是否入镜，并直接预测其在相机坐标系下的 MANO 参数与位姿。为了获取相机内外参及图像深度，从而将 MANO 轨迹统一到世界坐标系，我们创新性地采用了 DPVO 与 Any4D 结合的技术栈。针对手部短暂出画的帧，系统通过平滑插值完成补全，摒弃了 HaWoR 中误差较大的 Infiller 模块。值得注意的是，仅通过纯第一人称视频恢复的人手操作轨迹误差可达毫米级，而借助灵初外骨骼手套，我们可将这一误差精准降至亚毫米级。例如在数据采集前，我们通过数采手套遥操作机器人完成参数校准，确保运动学映射精度；在数据管线中进一步修正位姿漂移与误差，保障最终轨迹可在真机上高保真复现。关于数据的详细分析见第三章。此外，推理速度慢是WAM架构的典型痛点，我们通过DiT缓存、Torch编译、模型量化等多项工程优化，将单次推理耗时从2.2秒压缩至100毫秒以内，保障机器人操作的灵巧性与流畅性。

点击任意缩略图，查看对应的 Demo 视频。

Chapter 1

1. Psi-R2: What Matters in Learning from Human Data

Psi-R2 accepts visual images and natural language instructions as inputs, and generates predictions of future video frames alongside executable robot action sequences. Its backbone network is built upon a pre-trained video generative model, thereby aligning with the conceptual definition of the emerging World Action Model (WAM). Concurrently, by processing visual and linguistic inputs to output robotic actions, it can also be categorized as a vision-language-action (VLA) model. These two characterizations are not mutually exclusive; the distinction lies solely in its adoption of a video generative architecture as the core backbone.

Psi-R2 is initialized and trained on the open-source Wan2.2-IT2V-5B-480P backbone, with the training objective defined as the joint prediction of future video frames and robot control actions. During the pre-training phase, we integrate both real-robot demonstration data and human behavioral data. The robot data is exclusively sourced from PsiBot’s Psi-MobiDex dataset, totaling 5,417 hours. The human data combines motion recordings collected via PsiBot’s in-house exoskeleton glove and uninstrumented bare-hand data, amounting to 95,472 hours that cover 294 distinct scenes, 4,821 task types, and 1,382 object categories. Following pre-training, the model can be fine-tuned using an extremely small volume of real-robot data—fewer than 100 trajectories—to execute long-horizon, fine-grained manipulation tasks including smartphone assembly, industrial packaging, and paper box folding.

For the integration of human data and real-robot data, we adopt a deliberately minimalist processing paradigm: human joint configurations are aligned to the robot kinematic chain via direct kinematic mapping, raw image inputs remain unaltered, and the aligned representations are directly fed into the model as state observations and action labels. In essence, our research focuses on leveraging human data to boost pre-training for a target specific robot embodiment, rather than developing a universal model capable of generalizing across heterogeneous embodiments. The former is a critical enabler for commercial robotic deployment, while the latter remains an open research problem.

Over the course of this research, we explored numerous specialized processing pipelines for human data, including image inpainting, keypoint-guided auxiliary loss functions, and shared latent space alignment between human and robot data. Ultimately, we adhere to the core insight of the Bitter Lesson: raw data in, raw data out. Once the training corpus reaches sufficient scale, any hand-engineered module becomes a performance bottleneck. Methods that artificially homogenize human and robot data improve learning efficiency on simple tasks such as pick-and-place, yet severely degrade performance on long-horizon, high-precision manipulation tasks. Most such approaches attempt to blur the distributional discrepancy between human and robot data to facilitate shared representation learning, but ignore the inherent divergence in their dynamic profiles. For fine manipulation tasks, indistinguishability between the two data domains introduces amplified error and task failure. Instead of deploying complex hand-crafted processing, we only perform input-output dimensional alignment and enable the model to learn end-to-end from large-scale raw data. The results demonstrate a clear trend: under limited data scale, the advantage is negligible, yet with sufficient data volume, the minimalist design consistently yields superior performance. Direct utilization of raw inputs and outputs yields the strongest generalization capability, most robust long-horizon behavior, and highest performance ceiling for manipulation tasks.

A key insight derived from this work is that model performance depends less on sophisticated architectural design than on principled understanding and curation of training data, especially human behavioral data. This includes data mixture ratios, diffusion schedulers, curriculum learning strategies, data retrieval mechanisms, and quantitative evaluation protocols. The core challenges involve designing staged training data combinations, retrieving semantically paired data samples, and validating whether actionable knowledge from human data is effectively transferred into the robot policy. Many of these capabilities rely on Psi-W0 as an Action-Conditioned World Model, particularly for rigorous policy evaluation. Performance improvement is only attainable through measurable evaluation, which serves as the definitive verification that the model has authentically absorbed knowledge from human demonstrations. This motivates the joint training of Psi-W0 alongside Psi-R2, with detailed elaboration provided in Chapter 2.

The end-to-end data pipeline is equally critical to overall system performance. We developed dedicated pipelines for data cleaning, automatic annotation, quality assurance, and expert review. All annotation and quality control procedures are implemented via auto-labeling, with human intervention only for final validation. We further integrate Psi-W0 into the data quality assurance stage, leveraging the world model’s visual predictive capability to assign data quality scores. From the pipeline’s inception, our objective was for human data to provide not only high-level semantic cues from visual inputs but also fine-grained manipulation details, driving us to maximize the precision of 3D human hand trajectories. For purely egocentric video data, we trained an end-to-end first-person hand detection model that fuses spatiotemporal information to accurately determine whether the left and right hands are in the frame, and directly predicts their MANO parameters and poses in the camera coordinate system. To acquire camera intrinsic and extrinsic parameters as well as image depth—thereby unifying the MANO trajectories into the world coordinate system—we innovatively adopted a technical stack combining DPVO and Any4D. For frames where hands are temporarily out of the frame, the system completes frame filling through smooth interpolation, abandoning the error-prone Infiller module in HaWoR. Hand motion recovery from purely egocentric video typically incurs millimeter-scale errors, while PsiBot’s exoskeleton glove reduces this error to sub-millimeter precision. For instance, we calibrate kinematic mappings via teleoperation with the glove prior to data collection, and further correct pose drift or inaccuracies within the post-processing pipeline to ensure final trajectories can be faithfully replayed on physical robots. A thorough analysis of the data pipeline is presented in Chapter 3. Finally, slow inference speed is a well-documented limitation of WAM-based systems; we therefore invested extensive engineering efforts into DiT caching, Torch Compile, quantization, and other system-level optimizations, reducing single-pass inference latency from 2.2 seconds to under 100 ms and enabling smooth, dexterous robotic manipulation.

Click any thumbnail to open the corresponding demo.

Chapter 2

2. Psi-W0: Do RL in the World Model

Psi-W0以图像、语言指令与机器人动作轨迹为输入，输出未来场景的预测视频。其框架同样基于预训练视频生成模型构建，与Psi-R2的核心差异在于，动作信号作为条件输入直接调控视频生成过程，因此该模型也被定义为动作条件型世界模型（Action-Conditioned World Model, AC-WM）。

或许有人会疑惑，既然Psi-R2已具备未来视频预测能力，为何还需要构建AC-WM？其核心价值在于反事实推理。Psi-R2仅基于有明确目标且执行成功的样本训练，无法生成失败场景的预测结果，而这类反事实样本对于策略学习、尤其是强化学习迭代至关重要。因此，Psi-W0的核心功能是完成策略评估、性能优化与数据飞轮构建，其底层支撑正是强化学习技术。

Psi-W0同样采用Wan2.2-IT2V-5B-480P作为训练骨干，数据格式与Psi-R2保持一致。在此基础上，我们额外补充约30%的失败样本数据，这些数据来源于专项采集或常规数据采集、推理过程，以保障模型对反事实场景的精准建模。

Psi-W0承担两大核心职能：其一，在Psi-R2训练过程中完成策略评估，通过在人类数据场景中对Psi-R2执行轨迹推演，验证策略模型是否真正习得人类数据中的操作知识；其二，也是更核心的职能，实现人类数据向机器人数据的转换。这一思路最早源于我们2024年发表的论文《Object-Centric Dexterous Manipulation from Human Motion Data》：将人类数据应用于机器人的核心，是通过强化学习实现人类动力学特性向机器人动力学特性的迁移。以抓取苹果任务为例，人类抓取数据经运动学重定向后，受本体差异（Embodiment Gap）影响，转换得到的机器人关节角度通常存在微小偏差，难以直接完成精准操作。

针对这一问题，最直接的方案是通过强化学习进行微调，让转换后的机器人轨迹符合物理约束并完成目标操作。传统方案需要将物体重建至仿真环境中开展强化学习，该流程复杂度高，且受仿真与现实差异（Sim2Real Gap）限制，难以适配可变形物体。为此，我们采用AC-WM替代传统仿真器：在Psi-R2完成单条人类数据学习后，将其输出输入Psi-W0执行轨迹推演，切换至机器人视觉与动力学模式，再通过强化学习微调策略，使其在适配机器人动力学特性的前提下完成人类数据对应的任务。这一过程不仅将人类数据中的知识融入策略模型、提升模型性能，还能生成全新的有效数据。我们可筛选优质数据回流至Psi-R2与Psi-W0的训练流程，形成闭环数据飞轮；也可通过对Psi-W0进行随机化扰动，生成新场景与新训练数据。

由此可见，Psi-W0与Psi-R2的功能定位截然不同。Psi-W0的目标并非直接执行任务，而是实现策略评估、性能优化，同时完成人类数据到机器人数据的转换。因此，Psi-W0的训练数据不仅需要任务成功样本，更需要纳入失败、无明确意义的样本，为策略模型在AC-WM中开展强化学习训练提供完整的数据支撑。

Psi-W0 将人手示范转化为机械手可执行的操作轨迹。

Chapter 2

2. Psi-W0: Doing RL in the World Model

Psi-W0 accepts visual observations, natural language instructions, and robot action trajectories as inputs, and predicts corresponding future video frames. Analogous to Psi-R2, it is constructed upon a pre-trained video-generation backbone network. The core distinction lies in that action signals are incorporated as conditional inputs to regulate the generated video content, which renders it formally denoted as an Action-Conditioned World Model (AC-WM).

One may question the necessity of an AC-WM given that Psi-R2 already enables future video prediction. The critical justification resides in the modeling of counterfactual outcomes. All training trajectories employed for Psi-R2 are purposeful and successful executions, rendering the model inherently incapable of representing failure scenarios. Nevertheless, failure cases and other counterfactual predictions are indispensable for robust policy learning, particularly within reinforcement learning paradigms. This renders Psi-W0 invaluable not merely as a predictive module, but as a foundational instrument for policy evaluation, performance improvement, and the construction of a closed training flywheel, with reinforcement learning (RL) serving as its core mechanism.

Psi-W0 is initialized and trained using the identical Wan2.2-IT2V-5B-480P backbone as Psi-R2, and adheres to the same fundamental data formatting schema. Beyond the training data utilized for Psi-R2, Psi-W0 additionally incorporates approximately 30% supplementary failure trajectories—either explicitly collected or generated during routine data acquisition and inference procedures—to enable accurate modeling of counterfactual scenarios.

Psi-W0 fulfills two primary functions. The first is policy evaluation during the training of Psi-R2: we execute rollouts of Psi-R2 within scenes derived from human demonstration data, and employ Psi-W0 to verify whether the policy has genuinely acquired the actionable knowledge embedded within such demonstrations. The second, and more foundational, function is the translation of human data into robot-compatible trajectories. The core principle underlying this capability was initially proposed in our 2024 publication Object-Centric Dexterous Manipulation from Human Motion Data: the key to deploying human data on robotic systems is to transfer human dynamics to robot dynamics via reinforcement learning. As a representative example, consider the task of grasping an apple: direct kinematic retargeting of human apple-grasping motion to the robot typically yields slightly biased robot joint configurations due to the embodiment gap, despite the apparent subtlety of such errors.

The most straightforward resolution is to introduce an additional reinforcement-learning fine-tuning stage, such that the retargeted robot trajectory can accomplish physically plausible object grasping. Prior methodologies rely on object reconstruction within simulation environments to conduct RL, yet such pipelines suffer from excessive computational overhead, while the sim-to-real gap further impedes reliable modeling of deformable objects. Our framework replaces the simulator with an AC-WM to overcome these limitations. Given pre-trained Psi-R2 and Psi-W0 models, we feed the trajectory inferred by Psi-R2 from a human demonstration into Psi-W0 for rollout execution, switch the visual and dynamic representations of Psi-W0 to the robot domain, and conduct reinforcement-learning fine-tuning to enable the policy to complete the target task under robot-specific dynamics. In this process, the knowledge embedded in the human demonstration is absorbed into the policy, model performance is enhanced, and novel valid trajectories are generated. We subsequently filter high-quality samples from the generated data and reintroduce them into the training loops of Psi-R2 and Psi-W0 to construct a self-sustaining data flywheel. Furthermore, we can introduce stochastic variations within Psi-W0 to synthesize novel scenes and generate additional training trajectories.

Consequently, Psi-W0 is not designed as a secondary task-execution model. In contrast to Psi-R2, its primary objective is not direct task completion, but rather policy evaluation, performance optimization, and simultaneous translation of human data into robot-executable trajectories. This dictates that Psi-W0 requires not only successful task trajectories, but also failed and even non-meaningful samples, to provide the comprehensive distribution of outcomes necessary for effective policy learning within the AC-WM framework.

Psi-W0 translates human-hand demonstrations into manipulation trajectories executable by the robot hand.

Chapter 3

3. What Really Determines Data Value

人类数据在具身智能训练中的应用并非新鲜议题，早在领域发展初期便已出现相关探索，而近年来更是迎来了新一轮研究热潮。伴随这一趋势，各类人类数据采集方案层出不穷：硬件层面，既有头戴式相机的简易采集方案，也有高精度光学全身位姿追踪系统，更有搭载触觉传感器的采集手套；采集场景上，涵盖野外自由环境、移动操作场景、封闭实验室空间以及标准化数据采集工厂。究竟何种数据才能真正赋能模型预训练？这一问题难以通过理论严格论证，但我们历经数月的实验与研究，最终得出了明确结论：数据信噪比是决定人类数据能否有效支撑预训练的核心因素，低信噪比数据甚至会对模型训练产生负面作用。进一步来看，数据信噪比的评判存在清晰优先级：数据集分布层面，操作任务多样性 > 物体多样性 >> 场景多样性；感知模态层面，精准3D位姿 >> 触觉模态 > 2D图像特征。

在数据集构建维度，场景多样性的重要性处于最低层级。得益于预训练视频生成模型的强大泛化能力，无论是策略模型还是世界模型，都能轻松实现跨背景泛化，即便训练数据的背景多样性极低甚至缺失也不受影响。我们的实验验证，在视频生成与策略操作任务中，加载预训练Wan权重的模型可轻松适配未知场景，反之则极易出现泛化失效，这也说明场景泛化能力主要源自视频生成基座。具身智能的操作行为本质是以物体为中心（Object-Centric），所有任务都可建模为“操作对象+操作方式”的组合，因此物体多样性是模型泛化能力的重要支撑。而对于预训练数据集而言，任务多样性是核心质量指标，只有覆盖足够丰富的任务类型，才能适配多元化的下游落地场景。实验结果表明，任务泛化是模型最难习得的能力，且呈现明确规律：预训练阶段接触的任务类型越丰富，跨任务泛化性能就越强。因此，我们在构建人类数据集时，将任务与物体的多样化迭代作为核心，场景背景则并非关键考量。

在感知模态层面，我们采用多模态融合采集策略，让模型从海量数据中自主挖掘有效信息。除语言、视频、机器人关节角等常规输入外，我们还接入了全手织物触觉信号，其中人手全域3D位姿追踪是最核心的模态信息，也是将2D视频生成模型升级为3D空间操作模型的关键支撑。为此我们投入大量资源优化3D位姿采集精度，包括搭载高精度传感器、位姿误差后补偿等技术手段。借助Psi-W0，我们设计了严谨的验证实验：将同一任务下不同精度设备采集的3D轨迹，通过第二章的人类数据转机器人数据流程处理，再由Psi-W0生成对应的机器人操作数据。实验结果直观印证：灵初采集手套获取的3D轨迹，与机器人动力学特性的匹配度最高，转换效果最优；

触觉模态的详细分析我们将在第五章展开，而2D图像特征（关键点、分割掩码等）本质属于人工设计模块，在大规模数据场景下对模型性能几乎无增益，因此我们在最终方案中完全舍弃了这类特征。基于实验结论，我们将人类数据划分为两大类型：第一类是高精度可回放数据，轨迹基于机器人空间构建，经处理后可与真机数据等效、直接在机器人上复现，这类数据有望在后期训练中替代真机数据，灵初采集手套在严格条件下获取的数据便是典型代表；第二类是高泛化可扩缩数据，这类数据舍弃部分精度以换取规模化采集能力，虽无法直接用于后期训练，但可快速扩充数据规模，为预训练模型提供强大的泛化能力。我们认为，这两类数据对于具身智能模型训练而言，均具备不可替代的价值。

该可视化展示了由灵初数采设备采集的高精度操作数据。

Chapter 3

3. What Really Determines Data Value

Although human behavioral data have been utilized for training embodied intelligence models since the early stages of the field, this research direction has witnessed a remarkable resurgence in recent years. Concurrent with this resurgence, a diverse array of data collection paradigms has emerged. From the hardware perspective, some systems employ head-mounted cameras for unconstrained recording; others leverage high-precision optical motion capture to track full-body keypoints; and certain setups integrate tactile gloves to acquire haptic feedback signals. The data collection environments also span a wide spectrum, including unconstrained in-the-wild settings, mobile manipulation scenarios, confined laboratory rooms, and dedicated fixed data acquisition factories. A critical question thus arises: which categories of human data are genuinely beneficial for large-scale model pre-training? While rigorous theoretical validation remains challenging, extensive empirical investigations over months of research lead us to a definitive conclusion: within our training pipeline, the signal-to-noise ratio (SNR) stands as the paramount metric governing whether human data exert a positive or detrimental impact on model learning, with low-SNR data actively degrading training stability and performance. Concretely, we identify a clear hierarchy in dataset composition: task diversity dominates object diversity, which in turn is vastly more influential than scene diversity. For sensory modalities, the priority follows precise 3D human pose >> tactile feedback >> 2D visual features.

For structured dataset construction, scene diversity exhibits the least significance for model performance. Benefiting from the strong representational power of pre-trained video-generation models, both the policy model and the world model achieve surprisingly robust cross-background generalization, even when the background diversity of the training corpus is severely limited. Our experimental results validate that models initialized with pre-trained Wan weights readily generalize to unseen scenes in both video generation and policy execution tasks, whereas models without such initialization suffer from frequent generalization failures. This indicates that cross-scene generalization capability is predominantly inherited from the pre-trained video foundation model. In contrast, robotic manipulation is inherently object-centric, as every manipulation task can be formulated as interacting with a target object and executing a specific interaction pattern. Accordingly, object diversity constitutes a critical factor for model generalization. Nevertheless, the core determinant of pre-training dataset quality is task diversity, as it directly governs the model’s coverage of downstream real-world tasks. Our experiments further reveal that cross-task generalization represents the most challenging capability to acquire, with a clear empirical trend: the greater the variety of tasks observed during pre-training, the stronger the model’s cross-task generalization performance. For this reason, we prioritize frequent variations in task types and object categories during human dataset construction, while regarding background variation as the least critical factor.

Regarding sensory modalities, we adopt a multi-modal collection strategy to acquire as many complementary signals as possible, enabling the model to autonomously discover task-relevant cues. In addition to language instructions, visual observations, and robot joint states, our input pipeline incorporates full-hand fabric-based tactile sensing. Among all modalities, full-hand 3D pose tracking serves as the most critical signal, acting as the essential bridge for upgrading 2D video-generation models to 3D physical manipulation models. We thus devoted substantial engineering efforts to optimizing this module, including high-precision sensing hardware, post-hoc pose error compensation, and systematic kinematic calibration. Psi-W0 provides a reliable paradigm for validating the significance of high-precision 3D pose data. We conduct comparative experiments by feeding 3D trajectories of the same task—collected via devices with varying precision—into the human-to-robot data conversion pipeline described in Chapter 2, and use Psi-W0 to synthesize corresponding robot manipulation trajectories. The results are unambiguous: trajectories captured via PsiBot’s exoskeleton glove yield optimal transfer performance, as they are inherently well-aligned with robot dynamic profiles.

We defer an in-depth discussion of tactile sensing to Chapter 5. As for 2D visual features including keypoints and segmentation masks, they exhibit characteristics analogous to the hand-engineered modules discussed previously: their contribution to model performance becomes negligible under large-scale data regimes, leading us to omit these features entirely in the final pipeline. In practice, we categorize human behavioral data into two distinct classes. The first class is precision-centric data: trajectories defined in the robot kinematic space that, after standardized processing, are functionally equivalent to real-robot demonstration data and can be faithfully replayed on physical robotic platforms. Such data hold the potential to replace real-robot data during model post-training, with high-precision data collected via PsiBot’s exoskeleton glove under controlled conditions serving as a representative instance. The second class is generalization-centric data: these data exhibit lower precision but enable extreme scalability; while they are unsuitable for direct post-training, they can be rapidly accumulated to endow pre-trained models with strong cross-domain generalization capabilities. We posit that both categories of human data are indispensable for building high-performance embodied intelligence systems.

This visualization shows high-precision manipulation data captured with Lingchu's data-collection hardware.

Chapter 4

4. Human Data or Not Human Data, That is the Question

行业内对于人类数据的应用始终存在两个核心疑问：其一，若未来真机遥操作数据足够丰富，是否可以完全舍弃人类数据，仅依靠真机数据完成模型训练？其二，仅使用人类数据、不掺入任何真机数据，能否直接训练出可落地的可用模型？在我们看来，这两个问题指向同一个答案：面向真实商业化场景落地，仅依靠人类数据训练可用模型，大概率是必须攻克的技术方向。

商业化落地与实验室Demo演示存在本质区别，其中最核心、却在实验室中被普遍忽视的指标便是作业节拍。一线产业工人的标准作业流程（SOP）是经过长期反复打磨优化的最优解，在工厂实际生产中，某一工序哪怕只多出一个动作、节拍稍有延迟，都可能造成数千万级的经济损失。在实验室做Demo时，机器人无法完成指定动作便可以更换任务，或采用流程更长、更易实现的方式完成目标，但在真实产业落地中，这类妥协完全不可行。也正因如此，最具价值的数据，往往是工人在真实作业场景下产生的实操数据。

人类数据恰好能从两个维度解决这一核心痛点：第一，人类数据采集可以与工人作业流程完美贴合，例如让产线工人佩戴采集手套完成装配工作、超市收银员佩戴手套完成收银操作。这种模式无需专职数据采集人员，将采集成本压至最低，同时获取的正是一线真实作业数据，与落地需求高度匹配。第二，也是更关键的一点：人类数据的作业节拍可以达到机械臂物理运动上限，这是遥操作无法企及的速度。遥操作需要操作员实时观测机器人与物体的交互状态，操作速度普遍偏慢——假设机械臂物理运动上限为1200，遥操作速度往往仅能达到800甚至更低；而人类在本职工作中可以直接以1200的极限速度完成操作并同步采集数据。从落地实用性来看，这类贴合真实SOP与作业节拍的数据，质量反而优于纯遥操作数据，这也是单纯依靠遥操作难以实现的效果。

因此，将这种高节拍、高质量的人类数据融入模型训练，成为一项至关重要的技术能力，也印证了纯人类数据训练范式的重要价值。这一逻辑与我们早年研究仿真到现实迁移（Sim2Real）高度相似：验证仿真数据有效性需要零样本仿真到现实迁移，而验证人类数据有效性则需要零样本人类到机器人迁移（Zero-Shot Human2Robot），只是将传统仿真器替换为世界模型。这一过程需要Psi-R2与Psi-W0协同配合，依托上一章提到的数据转换能力，将人类数据转化为机器人可执行的有效数据。在内部实验中，我们已尝试使用转换后的人类数据直接替代真机数据进行后期训练，在拾取放置、布袋抓取等简单任务上取得了良好效果；但在手机装配、纸盒插接等高精细度任务中，成功率仍有提升空间。尽管这一技术方向仍需持续深入研究，但我们坚信，它是机器人商业化落地的必经之路。

Chapter 4

4. Human Data or Not Human Data, That Is the Question

The community frequently raises two fundamental questions regarding the utilization of human data. First, given sufficient real-robot teleoperation data in the future, can we entirely rely on robot data for model training and dispense with human data? Second, can we pursue the alternative paradigm of training deployable models exclusively from human data, without incorporating any real-robot data? In our perspective, both questions converge to an identical conclusion: for practical commercial deployment, the capability to train models solely from human data may ultimately become indispensable.

Commercial robotic deployment differs fundamentally from laboratory demo development. A core metric that is frequently overlooked in laboratory research is takt time, the standardized cycle time of industrial operations. The standard operating procedure (SOP) of frontline workers is the product of iterative optimization and typically approximates operational optimality. In industrial factories, even a single redundant step or minor slowdown in a workflow can result in financial losses amounting to tens of millions. In laboratory demonstrations, non-functional behaviors are often circumvented by switching tasks or adopting longer yet simpler workflows; such compromises are generally infeasible in real-world deployment. For this reason, the most valuable training data are often those generated by workers executing practical operations in authentic industrial environments.

Human data resolve this dilemma through two key advantages. First, the data collection process can be nearly perfectly aligned with the actual operational workflows of workers: for instance, production line workers can wear data-collection gloves during routine operations, and cashiers can use the same equipment during checkout. This setup eliminates the demand for specialized data collectors, minimizes data acquisition costs, and yields frontline operational data that are highly consistent with the requirements of real-world deployment. Second, and more critically, human data can capture task execution tempos that are unattainable via teleoperation. Teleoperation relies on operators to monitor and control robotic manipulation, which inherently limits operational speed; if the physical velocity limit of a robotic arm is denoted as 1200 units, teleoperation typically only achieves 800 units or lower. In contrast, human workers can generate data at the full physical velocity (1200 units) during natural operation. From a commercial deployment perspective, this renders human data of higher quality than teleoperated robot data, as they faithfully preserve the authentic SOP and operational tempo.

This underscores the critical importance of integrating high-tempo, high-quality human data into model training, and validates the significance of the exclusive human-data training paradigm. This scenario is highly analogous to the sim-to-real transfer problem: verifying the utility of simulated data requires zero-shot sim-to-real evaluation, and validating the effectiveness of human data likewise demands zero-shot human-to-robot transfer, with the simulator replaced by the world model in this framework. This necessitates the collaborative operation of Psi-R2 and Psi-W0, which leverage the human-to-robot data conversion pipeline introduced in the preceding chapter to translate human data into robot-executable trajectories. In our internal experiments, we have attempted to directly substitute real-robot data with converted human data during the post-training stage. Encouraging results have been achieved on simple manipulation tasks such as pick-and-place and bag grasping; however, the success rate remains relatively low for high-precision tasks including smartphone assembly and box insertion. While further research and iteration are required for this direction, we maintain that it represents an indispensable path for the commercialization of embodied intelligence.

Chapter 5

5. Tactile: A Medium for Cross-Embodiment Transfer

从人类操作的直觉逻辑来看，触觉感知无疑是实现精细操控的核心要素。可即便如此，当前业界最先进的机器人通用决策模型中，触觉模态的输入却几乎处于缺位状态，其核心原因在于真机触觉数据的极度匮乏。在实际部署中，搭载于机器人本体的触觉传感器普遍存在稳定性不足的问题，加之不同硬件厂商的数据格式互不兼容，导致触觉数据很难实现规模化采集与复用。而面向人类端的触觉数据采集则具备显著优势：采集成本不足机器人端的十分之一，且采集设备更加轻便便携，这让我们得以在人类数据层面实现触觉信息的大规模扩充，为模型训练提供关键支撑。

我们之所以格外看重人类数据中的触觉模态，本质是因为物理接触是机器人操作的核心。尽管人类与机器人在外观形态、关节结构上存在明显差异，运动学与动力学特性也相去甚远，但二者在执行操作时的接触信号具备高度共通性。因此我们认为，触觉信息能够在人手操作向机械手迁移的过程中发挥关键桥梁作用，也正是基于这一考量，我们在自研数据采集手套的设计中专门加入了触觉采集模块，期望借助触觉信号在世界模型中更好地补偿人与机器人之间的动力学差异。

由于真机触觉数据的格式与采集手套输出难以直接对齐，且现实场景中绝大多数机器人并未搭载触觉传感器，我们采用了掩码训练（Mask Training）的策略：在输入真机数据时直接屏蔽触觉通道，同时将触觉信号作为动作标签供模型预测，而非作为观测输入。实验结果表明，引入触觉信息后的世界模型性能相较原版有明显提升，同时模型还具备了一定的机器人-物体交互预判能力，这意味着模型已有效学习到物理接触与碰撞的核心规律。

Chapter 5

5. Tactile: A Medium for Cross-Embodiment Transfer

From an intuitive perspective, tactile sensing is undeniably indispensable for dexterous human manipulation. It is thus natural to ask why the most advanced general-purpose robot “brain” models integrate little to no tactile feedback. The primary constraint lies in the extreme scarcity of high-quality tactile data from physical robots. In practical deployments, tactile sensors mounted on robotic platforms frequently suffer from instability, and their output formats vary drastically across hardware systems, rendering large-scale scaling of robotic tactile data highly challenging. Human-sourced tactile data present a stark contrast: the cost of data collection is less than one-tenth that of robotic tactile acquisition, and the corresponding sensing devices are relatively portable. This unique advantage grants us a viable opportunity to scale up tactile data collection on the human side and leverage such data to enhance the training of embodied intelligence models.

We posit that tactile sensing holds particular significance for human behavioral data, because the fundamental essence of robotic manipulation resides in physical contact. Despite the discrepancies in visual appearance, kinematic architectures, and dynamic profiles between human and robotic systems, the binary and continuous signals that encode the occurrence and state of physical contact are fundamentally shared across embodiments. For this reason, we believe tactile information can serve as a critical enabler for knowledge transfer from human hands to robotic end-effectors. This motivated the integration of tactile sensing capabilities into the design of our custom data-collection glove: we aim to employ tactile signals to compensate for the dynamic gap between human and robotic hands within the world model.

Given that tactile data formats from physical robots are generally incompatible with those from our data-collection glove, and considering that most deployed robotic platforms lack tactile sensing hardware altogether, we adopt a mask-training strategy. When real-robot data are fed as input, we directly mask the tactile modalities; rather than treating tactile signals as observational inputs, we reformulate them as predictive action labels for model training. Experimental results clearly demonstrate that the integration of tactile data yields substantial performance improvements to the world model, while also endowing the model with the ability to anticipate robot-object interactions, indicating that the model has effectively learned contact and collision cues.

Closing

Collaborate with Us

如果你对合作感兴趣，欢迎联系我们。我们尤其期待与持续扩大数据采集规模的公司合作。

我们也正在招聘。如果你有兴趣加入我们，也欢迎联系。

如果你对我们的工作、合作机会或其他问题感兴趣，请发送邮件至 market@psirobot.ai。

Back to top

Closing

Collaborate with Us

If you are interested in collaborating, please reach out.

We are also hiring. If you would like to join us, we would love to hear from you.

If you are interested in our work, potential collaborations, or any other questions, please write to market@psirobot.ai.

Back to top