TacUMI:
A Multi-Modal Universal Manipulation Interface for Contact-Rich Tasks
Abstract
Task decomposition is critical for understanding and learning complex long-horizon manipulation tasks. Especially for tasks involving rich physical interactions, relying solely on visual observations and robot proprioceptive information often fails to reveal the underlying event transitions. This raises the requrement for efficient collection of high-quality multi-modal data as well as robust segmentation method to decompose demonstrations into meaningful modules.
Building on the idea of the handheld demonstration device Universal Manipulation Interface (UMI), we introduce TacUMI, a multi-modal data collection system that integrates additionally ViTac sensors, force–torque sensor, and pose tracker into a compact, robot-compatible gripper design, which enables synchronized acquisition of all these modalities during human demonstrations. We then propose a multi-modal segmentation framework that leverages temporal models to detect semantically meaningful event boundaries in sequential manipulations. Evaluation on a challenging cable mounting task shows more than 90% segmentation accuracy and highlights a remarkable improvement with more modalities, which validates that TacUMI establishes a practical foundation for both scalable collection and segmentation of multi-modal demonstrations in contact-rich tasks.
1. Hardware Design
TacUMI is a handheld platform designed for synchronized multi-modal data collection in long-horizon, contact-rich tasks. The system integrates a 6D force–torque sensor, a Vive tracker, and ViTac sensor on fingertips. A continuous self-locking mechanism allows stable grasps without sustained trigger pressure, ensuring that force–torque readings reflect only external interactions.
The F/T data collected by the teleoperated robot and TacUMI
TacUMI’s continuous locking ensured stable grasps across cable sizes, and its data closely matched that of a teleoperated robot, confirming transferable, robot-consumable measurements.
The F/T data collected by the design without locking mechanism
Design without a locking mechanism successfully detected F/T data, however, since the operator had to continuously hold the trigger, an additional non-constant actuation force was introduced, which could not be removed through preprocessing.
The F/T data collected by the design with a ratchet-based locking mechanism
Design with a ratchet-based locking mechanism exhibited discontinuous locking of the grasping width. As a result, the gripper failed to securely hold the cable, causing it to slip out during tightening and leading to unsuccessful task execution.
The F/T data collected by the design with tension spring-driven fingertips
The fingertips are normally closed by a tension spring and open when the trigger is pulled backward. Although F/T data are consistent with those collected from the teleoperation, it lacks generalizability due to the stiffness of the tension spring.
Interactive 3D Model
2. Event Segmentation Algorithm
Our event segmentation framework consists of four main stages. First, we perform data extraction from tactile, visual, force–torque, and pose modalities. These are synchronized into a fused feature sequence. Second, we apply a sliding window of length 50 with a stride of 10 to segment the sequence into overlapping chunks. Third, each window is processed by sequence models, which capture temporal dependencies and predict per-frame skills. Finally, we use soft voting to merge overlapping predictions and recover accurate frame-level segmentations of the entire demonstration.
Ablation on Model Architectures and Input Modalities
Our ablation study shows that BiLSTM achieves the best segmentation accuracy, clearly outperforming TCN and Transformer. Among input modalities, vision alone performs worst, while adding tactile or F/T signals brings substantial improvements. TCP pose provides little additional benefit, so the best results come from combining tactile and F/T with vision. Class-wise, common phases like idle and grasped are recognized reliably, but short and subtle phases such as released remain challenging.
Frame-wise Accuracy on TacUMI’s data
| Input Modality | BiLSTM | TCN | Transformer |
|---|---|---|---|
| Camera only | 0.7608 | 0.7217 | 0.3180 |
| Camera + Tactile | 0.9076 | 0.8880 | 0.6765 |
| Camera + F/T | 0.8632 | 0.8325 | 0.6645 |
| Camera + Pose | 0.8165 | 0.7675 | 0.4242 |
| Camera + Tactile + F/T | 0.9359 | 0.9051 | 0.7459 |
| Camera + Tactile + F/T + Pose | 0.9402 | 0.8945 | 0.7596 |
Cross-Platform Validation on Robot Data
Models trained on TacUMI data were tested on robot demonstrations. Vision alone performed poorly due to the domain gap, while adding tactile or F/T signals greatly improved accuracy. Combining both nearly closed the gap, confirming that TacUMI data transfers well to robot platforms and showing the importance of multimodal fusion.
Frame-wise Accuracy on teleoperated robot’s data
| Input Modality | BiLSTM | TCN | Transformer |
|---|---|---|---|
| Camera only | 0.2288 | 0.1820 | 0.2227 |
| Camera + Tactile | 0.7474 | 0.6002 | 0.5351 |
| Camera + F/T | 0.6611 | 0.6366 | 0.4126 |
| Camera + Pose | 0.4694 | 0.4636 | 0.2256 |
| Camera + Tactile + F/T | 0.9155 | 0.8793 | 0.8092 |
| Camera + Tactile + F/T + Pose | 0.9104 | 0.7262 | 0.7796 |
3. F/T data Preprocessing
Using our event segmentation algorithm, we filter trigger artifacts from raw F/T signals to obtain clean interaction-only streams, ensuring consistent and transferable data.
|
|
4. Cable Mounting Process
We validate our framework on dual-arm cable mounting, using multimodal data from both TacUMI handheld demonstrations and teleoperated Franka robots, showing robust segmentation and cross-platform generalization.