Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video

Abstract

Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods.

Method Overview

Given a sequence of depth maps and camera poses extracted from a video, we backproject the frames into a sequence of partial point clouds in world coordinates. We jointly optimize a set of superquadric primitives and part assignments, where each primitive is softly assigned to a part via a differentiable allocation, and each part carries its own joint parameters (revolute or prismatic) and per-timestep motion amounts. The optimization is driven by reconstruction and scene flow losses, with regularization encouraging a sparse part and primitive decomposition.

Results

Real-World Data Results

We evaluate the performance of Articulation in Prime (AiP) on AiP-real and Arti4D dataset. We compare with four baselines: Articulate-Anything [1], Artipoint [2], Video2Articulation [3], ReArt [4].

Qualitative Results on AiP-real. Red arrows denote predicted joint axes. 'x' indicates that the method fails.

artI4D_qualitative — **Qualitative Results on Arti4D**. Red, green, and blue arrows denote Ours, Artipoint, and Ground-Truth joint axes.

❮ ❯

Synthetic Data Results

We evaluate the performance of Articulation in Prime(AiP) on AiP-synth and Video2Articulation-S dataset.

Fitted Superquadrics Results

Beyond serving as proxies to group object points into geometrically and dynamically consistent regions, primitives also provide interpretability. We present primitive decompositions for AiP-synth and AiP-real objects.

sq_res_synth — **Fitted Superquadrics Results on AiP-synth**. Red arrows denote predicted joint axes.

sq_res_real — **Fitted Superquadrics Results on AiP-real**. Red arrows denote predicted joint axes.

❮ ❯

References

Le, L., Xie, J., Liang, W., Wang, H.J., Yang, Y., Ma, Y.J., Vedder, K., Krishna, A., Jayaraman, D., Eaton, E.: Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision- Language Foundation Model. In: International Conference for Learning Representations (2024)
Werby, A., Buechner, M., Roefer, A., Huang, C., Burgard, W., Valada, A.: Articulated Object Estimation in the Wild. In: CoRL (2025)
Peng, W., Lv, J., Lu, C., Savva, M.: iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos. In: International Conference on 3D Vision (2026)
Liu, S., Gupta, S., Wang, S.: Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds. In: Conference on Computer Vision and Pattern Recognition (2023)

BibTeX

@article{artykov2026AiP,
  title={Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video},
  author={Artykov, Arslan and Ravaud, Tom and Violante, Nicolás and Lepetit, Vincent},
  journal={arXiv preprint},
  year={2026}
}