Articulated Object Understanding from a Single Video Sequence

Abstract

We introduce a novel method for estimating the structure and joint parameters of articulated objects from a single casual video, captured by a potentially moving camera. Unlike previous works that rely on multiple static views or a priori knowledge of the object category, our approach leverages 2D point tracking and depth map prediction to generate 3D trajectories of points on the object. By analyzing these trajectories, we generate and evaluate hypotheses about joint parameters, selecting the best combination using the Bayesian Information Criterion (BIC) to avoid overfitting. We then optimize a dense 3D model of the object using Gaussian Splatting, guided by the selected joint hypotheses. Our method accurately recovers the geometry, segmentation into parts, joint parameters, and motion of each part, enabling the rendering of the object from new viewpoints and under new articulation states. Extensive evaluations on several datasets demonstrate the effectiveness of our approach.

Method Overview

Given RGBD frames of the input video sequence, we first calculate 3D trajectories of the surface points of the target object with a state-of-the-art point tracker. Then our method randomly selects a trajectory T_i, randomly picks a type of articulation, and computes joint parameters for this trajectory.

3D trajectories

prismatic joint

revolute joint

Then we check if there are other trajectories that can also be explained by these joint parameters. When it is the case, we keep the computed joint parameters as a good hypothesis. By iterating this process we generate hypothesis set of joint parameters, H.

Visualization of the joint parameters in the hypothesis set H

We select correct number of hypotheses from the hypothesis set H by using Bayesian Information Criterion. We keep the combination C, which yields the lowest BIC(C) value.

BIC(C) = k(C) · ln(n) + λL(C)

Finally, we conduct Gaussian Splatting optimization, where we initialize the gaussian centers using the points in the trajectories.

Qualitative Results

Below we show part segmentation and joint parameters prediction results on the multi-part objects from PartNet dataset and on the real casual videos captured with Iphone camera.

Ground Truth

DTA

Ours

BibTeX

@inproceedings{artykov2025articulated, title = {Articulated Object Understanding from a Single Video Sequence}, author = {Artykov, Arslan and Boittiaux, Clémentin and Lepetit, Vincent}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)}, year = {2025} }

Articulated Object Understanding from a Single Video Sequence

Current research models articulated objects either from multi-view images or demo videos with scans, which is impractical in real settings. We propose a method to jointly model 3D shape, motion, and joint parameters from only a casual input video.

Abstract

Method Overview

Qualitative Results

BibTeX