sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

Abstract

Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments.

Method Overview

Our method takes as input a sequences of images, from which we obtain the masks of the objects, the depth maps, and the camera parameters. We sample 2D points over the masks, lift them to 3D, and augment them with their scene flows and DINOv3 features. From this input, we predict the part segmentations, joint parameters for each part, and amounts of motions for each part and each time step.

Comparison Results

Downstream Applications

The output of our method can be used to obtain full digital twin of an articulated object.

BibTeX

@inproceedings{artykov2025sim2art,
  title={sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only},
  author={Artykov, Arslan and Sautier, Corentin and Lepetit, Vincent},
  booktitle={arXiv preprint},
  year={2025}
}