DiT360 is a framework for high-quality panoramic image generation, leveraging both perspective and panoramic data in a hybrid training scheme. It adopts a two-level strategy—image-level cross-domain guidance and token-level hybrid supervision—to enhance perceptual realism and geometric fidelity.
- High Perceptual Realism: It produces panoramic images with high resolution and clear details for lifelike visual quality.
- Precise Geometric Fidelity: It correctly models multi-scale distortions in panoramic images with smooth, continuous edges.
- Versatile Applications: It enables robust handling of generated assets across multiple tasks, seamlessly supporting both inpainting and outpainting.
30/10/2025
- Release mix-training code.
17/10/2025
- Release the inpainting and outpainting code.
15/10/2025
- Release the training code.
14/10/2025
- Release our paper on arXiv.
11/10/2025
- Release the refined Matterport3D dataset.
10/10/2025
- Release the pretrained model and inference code.
Clone the repo first:
git clone https://github.com/Insta360-Research-Team/DiT360.git
cd DiT360(Optional) Create a fresh conda env:
conda create -n dit360 python=3.12
conda activate dit360Install necessary packages (torch > 2):
# pytorch (select correct CUDA version, we test our code on torch==2.6.0 and torchvision==0.21.0)
pip install torch==2.6.0 torchvision==0.21.0
# other dependencies
pip install -r requirements.txtWe have uploaded the dataset to Hugging Face. For more details, please visit Insta360-Research/Matterport3D_polished.
For a quick start, you can try:
from datasets import load_dataset
ds = load_dataset("Insta360-Research/Matterport3D_polished")
# check the data
print(ds["train"][0])If you encounter any issues, please refer to the official Hugging Face documentation.
For a quick use, you can just try:
python inference.pyWe provide a training pipeline based on Insta360-Research/Matterport3D_polished, along with corresponding launch scripts. You can start training with a single command:
bash train.shAfter training is completed, you will find a checkpoint file saved under the output directory, typically like:
model_saved/lightning_logs/version_x/checkpoints/vsclip_epoch=xxx.ckpt/checkpoint/mp_rank_00_model_states.ptYou can extract the LoRA weights from the full .pt checkpoint by running:
python get_lora_weights.py <path_to_your_pt_file> <output_dir>If you don’t specify output_dir, the extracted weights will be saved by default to:
lora_output/After that, you can directly use your trained LoRA in the inference script.
Simply replace the default model path "fenghora/DiT360-Panorama-Image-Generation" in inference.py with your output directory (e.g., "lora_output"), and then run:
python inference.pyMix training aims to leverage both panoramic images and perspective images to improve the model’s generalization across different viewpoints.
You need to prepare two .jsonl files:
- One for panoramic images
- One for perspective images
Each line in a .jsonl file should represent a single training sample with the following format:
{"image": "path/to/image.jpg", "caption": "a description of the scene", "mask": "path/to/mask.png"}The mask is a PNG (or similar) image used to specify which regions should be supervised during training:
- White regions (
255, 255, 255) indicate areas that are supervised. - Black regions (
0, 0, 0) indicate areas that are ignored.
Specifically:
- For panoramic images, the
maskis typically an all-white image (meaning the entire image is supervised). - For perspective images, the
maskcorresponds to the valid projected area derived from the panoramic-to-perspective mapping.
The perspective images and their corresponding masks can be generated from panoramic images using an equirectangular-to-perspective projection.
We highly recommend using the excellent open-source library below for this purpose:
This library provides high-quality conversions between panoramic and perspective views, making it easy to generate consistent training data for mixed-view learning.
To start training, please refer to the provided scripts:
train_mix_staged_lora_dynamic.sh and train_mix_staged_lora_dynamic.py.
We treat both inpainting and outpainting as image completion tasks, where the key lies in how the mask is defined. A simple example is already provided in our codebase.
For a quick start, you can simply run:
python editing.pyIn our implementation, regions with a mask value of 1 correspond to the parts preserved from the source image. Therefore, in our example, you can invert the mask as follows for inpainting:
mask = 1 - mask # for inpaintingThis part is built upon Personalize Anything.
We appreciate the open source of the following projects:
@misc{dit360,
title={DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training},
author={Haoran Feng and Dizhe Zhang and Xiangtai Li and Bo Du and Lu Qi},
year={2025},
eprint={2510.11712},
archivePrefix={arXiv},
}
If you find our dataset useful, please include a citation for Matterport3D:
@article{Matterport3D,
title={Matterport3D: Learning from RGB-D Data in Indoor Environments},
author={Chang, Angel and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda},
journal={International Conference on 3D Vision (3DV)},
year={2017}
}
If you find our inpainting & outpainting useful, please include a citation for Personalize Anything:
@inproceedings{feng2026personalize,
title={Personalize anything for free with diffusion transformer},
author={Feng, Haoran and Huang, Zehuan and Li, Lin and Sheng, Lu},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={40},
number={5},
pages={3921--3929},
year={2026}
}

