This document describes the full pipeline for preparing custom reference motions for TM2M (Text/Motion-to-Motion) inference, including video generation, motion extraction, format conversion, and quality gating.
The pipeline consists of four main steps:
- Generate Reference Video - Use a text-to-video model to generate human motion videos
- Extract Motion from Video - Use an HMR (Human Mesh Recovery) model to extract SMPL-X parameters
- Convert to Motion Representation - Convert extracted parameters to our 276-dim motion format
- Quality Gating - Determine whether each motion is suitable for reference conditioning
Generate a reference video using any text-to-video model that produces human motion videos.
Wan 2.2 is a state-of-the-art text-to-video generation model:
# Single-GPU inference (requires 80GB+ VRAM)
python generate.py --task t2v-A14B --size 1280*720 \
--ckpt_dir ./Wan2.2-T2V-A14B \
--offload_model True --convert_model_dtype \
--prompt "A person performs a jumping jack" \
--save_file output_video.mp4Other text-to-video models (e.g., CogVideoX, Open-Sora) can also be used.
Extract SMPL-X parameters from the video using a visual motion capture (HMR) model.
SMPLest-X is a state-of-the-art expressive human pose and shape estimation model:
# Run inference on video (outputs to ./demo/)
sh scripts/inference.sh smplest_x_h output_video.mp4 30Other HMR models (e.g., CameraHMR, 4DHumans, WHAM) can also be used.
⚠️ Important: By default, SMPLest-X outputs rendered mesh overlays but does not save the raw SMPL-X parameters. Use the drop-in scripts inmotion_rep/smplest_x_scripts/(see Modifying SMPLest-X to Export Parameters) to export the required parameters.
Regardless of which HMR model you use, the output .pt file should contain:
Required:
| Key | Shape | Description |
|---|---|---|
global_orient |
(T, 3) | Root orientation in axis-angle |
body_pose |
(T, 63) | Body pose (21 joints × 3) in axis-angle |
transl |
(T, 3) | Root translation |
Optional (for reprojection to original video):
| Key | Shape | Description |
|---|---|---|
focal_length |
scalar or (T, 1) | Camera focal length |
width |
scalar or (T, 1) | Image width |
height |
scalar or (T, 1) | Image height |
Convert HMR output to our 276-dim motion format:
python motion_rep/convert_hmr_to_motion.py \
--input /path/to/hmr_output.pt \
--output /path/to/motion.ptThe output .pt file contains:
motion: Tensor of shape[T-1, 276](per-frame motion features)intrinsic: Camera intrinsic matrix (3×3)extrinsic: Camera extrinsic matrix (4×4)
See motion_rep/README.md for the detailed 276-dim layout.
# Render depth video only
python motion_rep/motion_checker.py \
--motion_file /path/to/motion.pt \
--output_dir /path/to/output
# Render overlay on original video
python motion_rep/motion_checker.py \
--motion_file /path/to/motion.pt \
--output_dir /path/to/output \
--video_file /path/to/original_video.mp4Not all extracted motions are suitable for conditioning. We provide a motion gating pipeline to determine whether each motion should be used as a reference (use_ref_motion=true) or only as weak guidance (use_ref_motion=false).
Create a JSON file with your motion entries (see data_samples/example_archive_wi_ref.json for format):
[
{
"id": 0,
"prompt": "A person performs a jumping jack",
"motion_path": "path/to/motion.pt",
...
}
]# Set PYOPENGL_PLATFORM for offscreen rendering
export PYOPENGL_PLATFORM=osmesa # or egl
python motion_gating/render_mbench_videos.py \
--meta-json data_samples/your_archive.json \
--output-json data_samples/your_archive_eval.json \
--output-dir data_samples/your_renderThis renders each motion as an MP4 video and updates the JSON with video_path and mbench_eval_path.
python motion_gating/apply_quality_gate.py \
--meta-json data_samples/your_archive_eval.json \
--quality-report data_samples/your_archive_quality.json \
--gemini-api-key "YOUR_GEMINI_API_KEY" \
--jitter-threshold 0.04This script:
- Computes Jitter Degree: Measures motion smoothness (lower is better)
- VLM Analysis: Uses Gemini to check if rendered motion matches the text description
- Sets
use_ref_motion:trueif jitter < threshold AND VLM matchesfalseotherwise
[
{
"id": 0,
"jitter_degree": 0.0039,
"vlm_analysis": "The motion shows a clear jumping jack pattern.",
"vlm_matches": true,
"use_ref_motion": true
}
]SMPLest-X by default outputs rendered mesh overlays but does not save the raw SMPL-X parameters. To avoid manual code edits, we provide ready-to-use scripts under motion_rep/smplest_x_scripts/ that you can copy into your SMPLest-X checkout.
Assume your SMPLest-X repo is at $SMPLestX_ROOT:
# In ViMoGen repo
SMPLestX_ROOT=/path/to/SMPLest-X
# Optional: backup originals
cp "$SMPLestX_ROOT/main/inference.py" "$SMPLestX_ROOT/main/inference.py.bak"
cp "$SMPLestX_ROOT/scripts/inference.sh" "$SMPLestX_ROOT/scripts/inference.sh.bak"
# Install drop-in scripts
cp motion_rep/smplest_x_scripts/inference.py "$SMPLestX_ROOT/main/inference.py"
cp motion_rep/smplest_x_scripts/inference.sh "$SMPLestX_ROOT/scripts/inference.sh"
chmod +x "$SMPLestX_ROOT/scripts/inference.sh"Then run SMPLest-X inference as usual (in the SMPLest-X repo):
cd "$SMPLestX_ROOT"
sh scripts/inference.sh smplest_x_h output_video.mp4 30The exported parameters will be saved to:
$SMPLestX_ROOT/demo/<video_basename>_params.pt
motion_rep/smplest_x_scripts/inference.shenables--retarget_camby default, which retargetstranslso all frames share a fixed camera (focal from the first frame, principal point at image center). The raw per-frame camera values are kept in*_rawfields in the exported.pt.- For multi-person videos, only the first detected person (
bbox_id == 0) is exported to ensure a fixed sequence length.