TacUMI:
A Multi-Modal Universal Manipulation Interface for Contact-Rich Tasks

Abstract

Task decomposition is critical for understanding and learning complex long-horizon manipulation tasks. Especially for tasks involving rich physical interactions, relying solely on visual observations and robot proprioceptive information often fails to reveal the underlying event transitions. This raises the requrement for efficient collection of high-quality multi-modal data as well as robust segmentation method to decompose demonstrations into meaningful modules.

Building on the idea of the handheld demonstration device Universal Manipulation Interface (UMI), we introduce TacUMI, a multi-modal data collection system that integrates additionally ViTac sensors, force–torque sensor, and pose tracker into a compact, robot-compatible gripper design, which enables synchronized acquisition of all these modalities during human demonstrations. We then propose a multi-modal segmentation framework that leverages temporal models to detect semantically meaningful event boundaries in sequential manipulations. Evaluation on a challenging cable mounting task shows more than 90% segmentation accuracy and highlights a remarkable improvement with more modalities, which validates that TacUMI establishes a practical foundation for both scalable collection and segmentation of multi-modal demonstrations in contact-rich tasks.

1. Hardware Design

TacUMI is a handheld platform designed for synchronized multi-modal data collection in long-horizon, contact-rich tasks. The system integrates a 6D force–torque sensor, a Vive tracker, and ViTac sensor on fingertips. A continuous self-locking mechanism allows stable grasps without sustained trigger pressure, ensuring that force–torque readings reflect only external interactions.

The F/T data collected by the teleoperated robot and TacUMI

TacUMI’s continuous locking ensured stable grasps across cable sizes, and its data closely matched that of a teleoperated robot, confirming transferable, robot-consumable measurements.

The F/T data collected by the design without locking mechanism

Design without a locking mechanism successfully detected F/T data, however, since the operator had to continuously hold the trigger, an additional non-constant actuation force was introduced, which could not be removed through preprocessing.

The F/T data collected by the design with a ratchet-based locking mechanism

Design with a ratchet-based locking mechanism exhibited discontinuous locking of the grasping width. As a result, the gripper failed to securely hold the cable, causing it to slip out during tightening and leading to unsuccessful task execution.

The F/T data collected by the design with tension spring-driven fingertips

The fingertips are normally closed by a tension spring and open when the trigger is pulled backward. Although F/T data are consistent with those collected from the teleoperation, it lacks generalizability due to the stiffness of the tension spring.

Interactive 3D Model

2. Event Segmentation Algorithm

Our event segmentation framework consists of four main stages. First, we perform data extraction from tactile, visual, force–torque, and pose modalities. These are synchronized into a fused feature sequence. Second, we apply a sliding window of length 50 with a stride of 10 to segment the sequence into overlapping chunks. Third, each window is processed by sequence models, which capture temporal dependencies and predict per-frame skills. Finally, we use soft voting to merge overlapping predictions and recover accurate frame-level segmentations of the entire demonstration.

Ablation on Model Architectures and Input Modalities

Our ablation study shows that BiLSTM achieves the best segmentation accuracy, clearly outperforming TCN and Transformer. Among input modalities, vision alone performs worst, while adding tactile or F/T signals brings substantial improvements. TCP pose provides little additional benefit, so the best results come from combining tactile and F/T with vision. Class-wise, common phases like idle and grasped are recognized reliably, but short and subtle phases such as released remain challenging.

Frame-wise Accuracy on TacUMI’s data

Input Modality	BiLSTM	TCN	Transformer
Camera only	0.7608	0.7217	0.3180
Camera + Tactile	0.9076	0.8880	0.6765
Camera + F/T	0.8632	0.8325	0.6645
Camera + Pose	0.8165	0.7675	0.4242
Camera + Tactile + F/T	0.9359	0.9051	0.7459
Camera + Tactile + F/T + Pose	0.9402	0.8945	0.7596

Cross-Platform Validation on Robot Data

Models trained on TacUMI data were tested on robot demonstrations. Vision alone performed poorly due to the domain gap, while adding tactile or F/T signals greatly improved accuracy. Combining both nearly closed the gap, confirming that TacUMI data transfers well to robot platforms and showing the importance of multimodal fusion.

Frame-wise Accuracy on teleoperated robot’s data

Input Modality	BiLSTM	TCN	Transformer
Camera only	0.2288	0.1820	0.2227
Camera + Tactile	0.7474	0.6002	0.5351
Camera + F/T	0.6611	0.6366	0.4126
Camera + Pose	0.4694	0.4636	0.2256
Camera + Tactile + F/T	0.9155	0.8793	0.8092
Camera + Tactile + F/T + Pose	0.9104	0.7262	0.7796

3. F/T data Preprocessing

Using our event segmentation algorithm, we filter trigger artifacts from raw F/T signals to obtain clean interaction-only streams, ensuring consistent and transferable data.

4. Cable Mounting Process

We validate our framework on dual-arm cable mounting, using multimodal data from both TacUMI handheld demonstrations and teleoperated Franka robots, showing robust segmentation and cross-platform generalization.

×5

×22

TacUMI:A Multi-Modal Universal Manipulation Interface for Contact-Rich Tasks