DiffProxy
Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

1PCA Lab, Nanjing University of Science and Technology 2Nanjing University, School of Intelligent Science and Technology
*Corresponding authors
DiffProxy teaser

Existing human mesh recovery methods face distinct challenges. (A) End-to-end methods (SMPLest-X, HeatFormer) produce imprecise image-mesh alignment; (B) keypoint-based fitting (EasyMoCap) is sparse and unstable under challenging conditions; (C) prior dense correspondence predictors (DensePose) produce noisy and incomplete surface mappings. We introduce DiffProxy, which uses a diffusion model trained on synthetic data to predict accurate dense pixel-to-surface proxies, then fits SMPL-X against them via a unified reprojection objective, achieving precise alignment across diverse real-world scenarios.

Method Overview

Method overview. The proxy generator first produces initial body proxies; the segmentation component $\mathbf{P}_{v,\mathrm{init}}^{\mathrm{seg}}$ is then used to crop and enlarge hand regions for a second pass (hand refinement). Test-time scaling then draws $K$ stochastic samples for both body and hand proxies, aggregates them, and derives per-pixel uncertainty weight maps $\mathbf{W}_v$. Finally, SMPL-X parameters are recovered by minimizing an uncertainty-weighted reprojection objective $\mathcal{L}_{\mathrm{reproj}}$.

Architecture

Diffusion-based proxy generator architecture. Our model is built on Stable Diffusion~2.1 with a frozen UNet backbone, equipped with three conditioning signals ($\mathbf{c}_{\text{text}}$, $\mathbf{c}_{\text{T2I}}$, $\mathbf{c}_{\text{DINO}}$) and four trainable attention modules ($\mathcal{A}_{\mathrm{text}}$, $\mathcal{A}_{\mathrm{img}}$, $\mathcal{A}_{\mathrm{cm}}$, $\mathcal{A}_{\mathrm{epi}}$) for multi-view consistent proxy generation.

Synthetic Data

Our synthetic dataset contains 105,487 clothed SMPL-X subjects rendered into 843,896 images across eight randomized views, featuring HDR lighting, realistic occlusions, and physics-based clothing. The dataset provides pixel-perfect SMPL-X proxy annotations, eliminating annotation noise inherent in real-world datasets and enabling strong generalization to real-world scenarios.

Our Results

Quantitative Comparisons

3dhp rich behave
Method PA-MPJPEMPJPEPA-MPVPEMPVPE PA-MPJPEMPJPEPA-MPVPEMPVPE PA-MPJPEMPJPEPA-MPVPEMPVPE
SMPLest-X 32.5*48.1*46.6*62.3* 31.8*67.6*40.3*81.1* 30.0*48.6*43.5*63.6*
Human3R 53.184.367.7104.1 52.1101.164.4116.9 38.385.053.8102.3
SAM3DBody ---- ---- ----
U-HMR 67.3*142.9*78.5*164.2* 55.5134.967.8159.3 46.1118.751.9135.0
MUC 40.2-51.3- 35.7*-42.7*- 28.7-41.8-
HeatFormer 29.2*49.9*34.7*53.3* 40.180.152.993.8 33.470.247.780.5
EasyMoCap 34.343.743.251.5 31.249.844.260.3 21.829.635.140.4
Ours 32.841.144.750.9 24.233.728.935.1 22.832.731.339.9
Unit: mm
moyo 4ddress 4ddress-partial
Method PA-MPJPEMPJPEPA-MPVPEMPVPE PA-MPJPEMPJPEPA-MPVPEMPVPE PA-MPJPEMPJPEPA-MPVPEMPVPE
SMPLest-X 45.6*63.3*62.1*81.4* 34.852.152.470.2 72.2101.7111.3140.6
Human3R 65.087.881.6107.6 30.052.442.466.7 72.3117.6103.5153.9
SAM3DBody ---- ---- ----
U-HMR 88.3177.7104.9209.7 42.180.352.698.8 65.2136.187.2175.2
MUC 62.8-76.0- 33.2-49.2- 60.1-92.9-
HeatFormer 67.3126.979.6123.6 43.473.361.690.2 143.8291.6175.2327.0
EasyMoCap 31.441.244.851.6 19.924.931.835.3 48.294.869.8119.4
Ours 28.332.435.238.8 16.620.523.325.6 27.631.641.343.1
* Method was trained on the corresponding dataset.

Qualitative Comparisons

Benefiting from the strong visual priors of diffusion models and training exclusively on synthetic data, our method achieves substantial improvements over prior approaches. In many cases, our results exhibit tighter image-mesh alignment and stronger robustness under challenging real-world conditions.

BibTeX

@article{wang2025diffproxy,
  title={DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies},
  author={Wang, Renke and Zhang, Zhenyu and Tai, Ying and Yang, Jian},
  journal={arXiv preprint; identifier to be added},
  year={2025}
}