English | 简体中文
Keypoint Extraction Exploration Project Based on Zero123Plus Model - Stage 1 Research Report.
Stable123Keypoints aims to explore the application potential of the sudo-ai/zero123plus-v1.2 model in keypoint detection tasks. This stage focuses on evaluating the direct usability of Zero123Plus pre-trained weights under the same architecture as StableImageKeypoints v1.5.
- Baseline Model:
sd-legacy/stable-diffusion-v1-5 - Test Model:
sudo-ai/zero123plus-v1.2 - Network Architecture: Kept basically consistent with
StableImageKeypoints v1.5 - Comparison Dimensions:
- Loss function convergence
- Attention mechanism activation patterns
- Keypoint extraction effectiveness
Please refer to the environment configuration requirements of StableImageKeypoints v1.5.
-
Clone the Project
git clone https://github.com/SoarCraft/Stable123Keypoints.git cd Stable123Keypoints -
Install Dependencies
Follow the dependency installation process of
StableImageKeypoints v1.5. -
Preprocess Data
Run the following command to generate image matting results for
Zero123Plus:python -m datasets.cub_preprocess
-
Training/Inference
The remaining operation steps are consistent with the
StableImageKeypointsproject.
As shown in the figure, when training with Zero123Plus model weights, the loss function converges normally, initially indicating that the model has learning capability.
However, through visualization analysis of the attention maps after the model is activated by context, we discovered a critical issue: the attention distribution exhibits a divergent state, failing to form the expected concentrated response pattern at keypoint locations.
To rule out the influence of loading methods, we conducted the following comparative tests:
- Full Zero123Plus Pipeline Loading: Attention divergence ❌
- Zero123Plus Weights Only (without Pipeline): Attention divergence ❌
- Using stable-diffusion-v1-5 Weights (same architecture and configuration): Keypoint extraction normal ✅
Without targeted code modifications, the Zero123Plus pre-trained weights cannot be directly applied to keypoint extraction tasks.
Although the loss function converges normally during model training, the model does not produce the expected response to pure context. Specifically:
- ✅ Training Feasibility: Loss function convergence is normal
- ❌ Functional Effectiveness: Attention mechanism not activated at keypoint locations
- ✅ Code Correctness:
SD-1.5weights work normally with the same code
Considering the minimal structural differences between Zero123Plus and Stable Diffusion v1.5, we infer:
The special operations introduced during Zero123Plus pre-training (such as multi-view condition injection, reference image attention, etc.) have fundamentally changed how the model's internal weights process encoder_hidden_states.
This change is not a simple feature extraction difference, but involves deep reconstruction of the attention mechanism, making it difficult for the model to produce spatially localized responses to pure text context like the original SD model.
Caution
Do not use FP16 precision
Using half-precision floating-point numbers will cause significant precision loss, which will prevent the model from converging.

