DiffProxy
Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

Renke Wang¹, Zhenyu Zhang^2*, Ying Tai², Jun Li^1*, Jian Yang^1*

¹PCA Lab, Nanjing University of Science and Technology ²Nanjing University, School of Intelligent Science and Technology

^*Corresponding authors

Paper Code

Existing human mesh recovery methods face distinct challenges. (A) End-to-end methods (SMPLest-X, HeatFormer) produce imprecise image-mesh alignment; (B) keypoint-based fitting (EasyMoCap) is sparse and unstable under challenging conditions; (C) prior dense correspondence predictors (DensePose) produce noisy and incomplete surface mappings. We introduce DiffProxy, which uses a diffusion model trained on synthetic data to predict accurate dense pixel-to-surface proxies, then fits SMPL-X against them via a unified reprojection objective, achieving precise alignment across diverse real-world scenarios.

Method Overview

Method overview. The proxy generator first produces initial body proxies; the segmentation component $\mathbf{P}_{v,\mathrm{init}}^{\mathrm{seg}}$ is then used to crop and enlarge hand regions for a second pass (hand refinement). Test-time scaling then draws $K$ stochastic samples for both body and hand proxies, aggregates them, and derives per-pixel uncertainty weight maps $\mathbf{W}_v$. Finally, SMPL-X parameters are recovered by minimizing an uncertainty-weighted reprojection objective $\mathcal{L}_{\mathrm{reproj}}$.

Architecture

Diffusion-based proxy generator architecture. Our model is built on Stable Diffusion~2.1 with a frozen UNet backbone, equipped with three conditioning signals ($\mathbf{c}_{\text{text}}$, $\mathbf{c}_{\text{T2I}}$, $\mathbf{c}_{\text{DINO}}$) and four trainable attention modules ($\mathcal{A}_{\mathrm{text}}$, $\mathcal{A}_{\mathrm{img}}$, $\mathcal{A}_{\mathrm{cm}}$, $\mathcal{A}_{\mathrm{epi}}$) for multi-view consistent proxy generation.

Synthetic Data

Our synthetic dataset contains 105,487 clothed SMPL-X subjects rendered into 843,896 images across eight randomized views, featuring HDR lighting, realistic occlusions, and physics-based clothing. The dataset provides pixel-perfect SMPL-X proxy annotations, eliminating annotation noise inherent in real-world datasets and enabling strong generalization to real-world scenarios.

Our Results

Quantitative Comparisons

	3dhp				rich				behave
Method	PA-MPJPE	MPJPE	PA-MPVPE	MPVPE	PA-MPJPE	MPJPE	PA-MPVPE	MPVPE	PA-MPJPE	MPJPE	PA-MPVPE	MPVPE
SMPLest-X	32.5^*	48.1^*	46.6^*	62.3^*	31.8^*	67.6^*	40.3^*	81.1^*	30.0^*	48.6^*	43.5^*	63.6^*
Human3R	53.1	84.3	67.7	104.1	52.1	101.1	64.4	116.9	38.3	85.0	53.8	102.3
SAM3DBody	-	-	-	-	-	-	-	-	-	-	-	-
U-HMR	67.3^*	142.9^*	78.5^*	164.2^*	55.5	134.9	67.8	159.3	46.1	118.7	51.9	135.0
MUC	40.2	-	51.3	-	35.7^*	-	42.7^*	-	28.7	-	41.8	-
HeatFormer	29.2^*	49.9^*	34.7^*	53.3^*	40.1	80.1	52.9	93.8	33.4	70.2	47.7	80.5
EasyMoCap	34.3	43.7	43.2	51.5	31.2	49.8	44.2	60.3	21.8	29.6	35.1	40.4
Ours	32.8	41.1	44.7	50.9	24.2	33.7	28.9	35.1	22.8	32.7	31.3	39.9

Unit: mm

	moyo				4ddress				4ddress-partial
Method	PA-MPJPE	MPJPE	PA-MPVPE	MPVPE	PA-MPJPE	MPJPE	PA-MPVPE	MPVPE	PA-MPJPE	MPJPE	PA-MPVPE	MPVPE
SMPLest-X	45.6^*	63.3^*	62.1^*	81.4^*	34.8	52.1	52.4	70.2	72.2	101.7	111.3	140.6
Human3R	65.0	87.8	81.6	107.6	30.0	52.4	42.4	66.7	72.3	117.6	103.5	153.9
SAM3DBody	-	-	-	-	-	-	-	-	-	-	-	-
U-HMR	88.3	177.7	104.9	209.7	42.1	80.3	52.6	98.8	65.2	136.1	87.2	175.2
MUC	62.8	-	76.0	-	33.2	-	49.2	-	60.1	-	92.9	-
HeatFormer	67.3	126.9	79.6	123.6	43.4	73.3	61.6	90.2	143.8	291.6	175.2	327.0
EasyMoCap	31.4	41.2	44.8	51.6	19.9	24.9	31.8	35.3	48.2	94.8	69.8	119.4
Ours	28.3	32.4	35.2	38.8	16.6	20.5	23.3	25.6	27.6	31.6	41.3	43.1

^* Method was trained on the corresponding dataset.

Qualitative Comparisons

Benefiting from the strong visual priors of diffusion models and training exclusively on synthetic data, our method achieves substantial improvements over prior approaches. In many cases, our results exhibit tighter image-mesh alignment and stronger robustness under challenging real-world conditions.

BibTeX

@article{wang2025diffproxy,
  title={DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies},
  author={Wang, Renke and Zhang, Zhenyu and Tai, Ying and Yang, Jian},
  journal={arXiv preprint; identifier to be added},
  year={2025}
}

DiffProxy Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies