sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

Abstract

Understanding articulated objects from monocular video is a crucial yet challenging task in robotics and digital twin creation. Existing methods often rely on complex multi-view setups, high-fidelity object scans, or fragile long-term point tracks that frequently fail in casual real-world captures. In this paper, we present sim2art, a data-driven framework that recovers the 3D part segmentation and joint parameters of articulated objects from a single monocular video captured by a freely moving camera. Our core insight is a robust representation based on per-frame surface point sampling, which we augment with short-term scene flow and DINOv3 semantic features. Unlike previous works that depend on error-prone long-term correspondences, our representation is easy to obtain and exhibits a negligible difference between simulation and reality without requiring domain adaptation. Also, by construction, our method relies on single-viewpoint visibility, ensuring that the geometric representation remains consistent across synthetic and real data despite noise and occlusions. Leveraging a suitable Transformer-based architecture, sim2art is trained exclusively on synthetic data yet generalizes strongly to real-world sequences. To address the lack of standardized benchmarks in the field, we introduce two datasets featuring a significantly higher diversity of object categories and instances than prior work. Our evaluations show that sim2art effectively handles large camera motions and complex articulations, outperforming state-of-the-art optimization-based and tracking-dependent methods. sim2art offers a scalable solution that can be easily extended to new object categories without the need for cumbersome real-world annotations.

Method Overview

Our method takes as input a sequence of images, from which we get the masks of the objects, the depth maps, and the camera parameters. We sample 2D points over the masks, lift them to 3D, and augment them with their scene flows and DINOv3 features. From this input, we predict the parts, joint parameters for each part, and amounts of motion for each part and each time step.

Comparison Results

We compare our method against ReArt, Articulate-Anything, Video2Articulation, FeatClust, and Artipoint.

Real Data

Synthetic Data

Downstream Application

Given joint parameters and part labels of an object as an output of our method, we can obtain the full digital twin by optimizing 2D/3D Gaussians.

BibTeX

@article{artykov2025sim2art,
  title={sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only},
  author={Artykov, Arslan and Ravaud, Tom and Sautier, Corentin and Lepetit, Vincent},
  journal={arXiv preprint},
  year={2025}
}