Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods.
Given a sequence of depth maps and camera poses extracted from a video, we backproject the frames into a sequence of partial point clouds in world coordinates. We jointly optimize a set of superquadric primitives and part assignments, where each primitive is softly assigned to a part via a differentiable allocation, and each part carries its own joint parameters (revolute or prismatic) and per-timestep motion amounts. The optimization is driven by reconstruction and scene flow losses, with regularization encouraging a sparse part and primitive decomposition.
We evaluate the performance of Articulation in Prime (AiP) on AiP-real and Arti4D dataset. We compare with four baselines: Articulate-Anything [1], Artipoint [2], Video2Articulation [3], ReArt [4].
We evaluate the performance of Articulation in Prime(AiP) on AiP-synth and Video2Articulation-S dataset.
Beyond serving as proxies to group object points into geometrically and dynamically consistent regions, primitives also provide interpretability. We present primitive decompositions for AiP-synth and AiP-real objects.
@article{artykov2026AiP,
title={Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video},
author={Artykov, Arslan and Ravaud, Tom and Violante, Nicolás and Lepetit, Vincent},
journal={arXiv preprint},
year={2026}
}