Helix Accelerating Real-World Logistics

Bringing humanoid robots into the workforce is at the heart of Figure’s mission. Today, we’re introducing a new real-world application for Figure robots: logistics package manipulation and triaging. This task demands human-level speed, precision, and adaptability, pushing the boundaries of pixels-to-actions learned manipulation. Last week we introduced Helix, Figure’s internally designed Vision-Language-Action (VLA) model that unifies perception, language understanding, and learned control. In this report, we focus on a series of general improvements we made to System 1 (S1) of Helix, the low-level visuo-motor control policy, while iterating on this challenging new commercial use case:

  • Implicit stereo vision – Helix System 1 now has rich 3D understanding enabling more precise depth-aware motion. 

  • Multi-scale visual representation – The lower-level policy captures fine-grained details while retaining scene level understanding for more accurate manipulation.

  • Learned visual proprioception – Each Figure robot can now calibrate itself, making cross-robot transfer seamless.

  • Sport mode – Using a simple test-time speed up technique, Helix achieves faster-than-demonstrator execution speed while maintaining high success rate and dexterity.

We also explore the trade off between data quality and quantity for this particular use case, and show that just 8 hours of well curated demonstration data can yield a dexterous and flexible policy.

Video 1: Helix accelerating real-world logistics.

The Use Case

Package handling and sorting is a fundamental operation in logistics. This often involves transferring packages from one conveyor belt to another, while also ensuring the shipping label is correctly oriented for scanning. This task presents several key challenges: packages may come in a wide variety of sizes, shapes, weights, and rigidity – from rigid boxes to deformable bags, making it difficult to replicate in simulation. The system must determine the optimal moment and method for grasping the moving object and reorienting each package to expose the label. Furthermore, it needs to track the dynamic flow of numerous packages on a continuously moving conveyor and maintain a high throughput. As the environment can never be fully predictable, the system must be able to self-correct. Addressing these challenges isn't only a key application of Figure's business, it also yielded generic new improvements to Helix System 1 that all other use cases now benefit from.

Architectural Improvements to Helix's Visuo-Motor Policy (System 1)

Visual representation

Where our prior System 1 relied on monocular visual input, our new System 1 now leverages a stereo vision backbone coupled with a multiscale feature extraction network to capture rich spatial hierarchies. Rather than feeding image feature tokens from each camera independently, features from both cameras are merged in a multiscale stereo network before being tokenized, keeping the overall number of visual tokens fed to our cross-attention transformer constant and avoiding computational overhead. The multiscale features allow the system to interpret fine details as well as broader contextual cues, together contributing to more reliable control from vision.

Cross robot transfer

Deploying a single policy on many robots requires addressing distribution shifts in the observation and action spaces due to small individual robot hardware variations. These include sensor calibration differences (affecting input observations) and joint response characteristics (affecting action execution), which can impact policy performance if not properly compensated for. Especially with a high dimensional whole-upper-body action space, traditional manual robot calibration doesn't scale over a fleet of robots. Instead, we train a visual proprioception model to estimate the 6D poses of end effectors entirely from each robot's onboard visual input.  This online "self-calibration" allows strong cross-robot policy transfer with minimal downtime.

Figure 1: Scalable online visual calibration unlocks strong cross-robot transfer.

Data curation

On the data side, we took particular care in filtering human demonstrations, excluding the slower, missed, or failed ones. However, we deliberately kept demonstrations that naturally included corrective behavior, when the failure that prompted the correction was deemed due to environmental stochasticity rather than operator error. Working closely with the teleoperators to refine and uniformize manipulation strategies also resulted in significant improvements.

Inference-time manipulation speedup

Our systems need to approach and eventually go beyond human manipulation speed. We apply a simple, but effective, test-time technique that yields faster-than-demonstrator learned behavior: interpolate the policy action chunk output (we call this “Sport Mode”). Our S1 policies output action "chunks", representing a series of robot actions at 200hz. In practice, we can for example achieve a 20% test-time speedup, without any modifications to the training procedure, by linearly re-sampling an action chunk of [T x action_dim]—representing an T-millisecond trajectory—to a shorter [0.8 * T x action_dim] trajectory, then executing the shorter chunk at the original 200 Hz control rate. 

Results and Discussion

We measure the system's performance using the normalized effective throughput* T_eff , which represents how fast packages are handled compared to the demonstrator data it is trained on. This takes into account any time spent resetting the system if necessary. As an example, T_eff > 1.1 represents a manipulation speed 10% faster than the expert trajectory collected for training.

The importance of stereo

Figure 2 (a) shows the impact of adding a multiscale feature extractor, as well as stereo inputs on the system’s T_eff. Both the multiscale feature extraction as well as implicit stereo input significantly improve system performance. Particularly noteworthy is the improved robustness to various package sizes when adding stereo: as shown in Figure 2 (a), the stereo model achieves a 60% increase in throughput over non-stereo baselines.

Figure 2: (a) Ablation study on the impact of various visual representations and (b) effect of data curation on effective throughput.

Additionally, we find that the stereo-equipped S1 can generalize to flat envelopes that the system was never trained on.

Quality over quantity

We find that for a single use case, data quality and consistency matter much more than data quantity. Figure 2 (b) shows that a model trained with curated, high quality demonstrations achieves 40% better throughput despite being trained with ⅓ less data.

Sport mode

Speeding up the policy execution via linear re-sampling (“sport mode”) is surprisingly effective up to 50% speed up. This is likely rendered possible by the high temporal resolution (200Hz) of the action outputs chunks. However when going beyond 50% speed up, the effective throughput starts to drop substantially as motions become too imprecise and the system needs to be reset frequently. Figure 3 shows that with a 50% speed increase the policy achieves faster object handling compared to the expert trajectories it is trained on ( T_eff>1).

Figure 3: Test time speed up via action chunk re-sampling. With a 50% test time speed up, S1 achieves higher effective throughput than demonstration data (T_eff>1).

Cross-robot transfer

Finally, by leveraging the learned calibration and visual proprioception module, we were able to apply the same policy, initially trained on a single robot’s data, to multiple additional robots. Despite variations in sensor calibration and small hardware differences, the system maintained a comparable level of manipulation performance across all platforms. This consistency underscores the effectiveness of learned calibration in mitigating covariate shifts, effectively reducing the need for tedious per-robot recalibration and making large-scale deployment more practical.

Conclusion

We have presented how a high quality dataset, combined with architectural refinements such as stereo multiscale vision, online calibration, and a test-time speed up can achieve faster-than-demonstrator dexterous robotic manipulation in a real-world logistics triaging scenario—all while using relatively modest amounts of demonstration data. The results highlight the potential for scaling end-to-end visuo-motor policies to complex industrial applications where speed and precision are important.  

If you are excited about embodied AI and bringing humanoid robots to the workforce, check out our open roles here.