PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Zehong Ma¹, Ruihan Xu¹, Shiliang Zhang¹^*,

¹State Key Laboratory of Multimedia Information Processing,
School of Computer Science, Peking University,

Figure 1: Visualization of 512x512 images generated by our PixelGen.

🫖 Introduction

We introduce PixelGen, a simple pixel diffusion framework with perceptual loss. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79.

We achieve 5.11 FID on ImageNet256x256 without CFG at 80 epochs, surpassing REPA's 5.90 FID at 800 epochs.
We achieve 1.83 FID on ImageNet256x256 with CFG 160 epochs, competetive with latent diffusion models.
We achieve 0.79 overall score on GenEval Benchmark with PixelGen-XXL/16.
If you like our project, please kindly give us a star ⭐ on GitHub. We hope to collaborate with you on building better pixel diffusion models, specifically looking at better samplers, CFG strategies, architectures, and refined loss design. Please feel free to reach out if you'd like to discuss ideas.

Illustration of Perceptual Manifold

Illustration of different manifolds within the pixel space. The image manifold is a large manifold containing both perceptually significant information and imperceptible signals. The perceptual manifold contains perceptually important signals, providing a better target for pixel space diffusion. P-DINO and LPIPS are the two complementary perceptual supervision utilized in PixelGen.

🧩 Visualizations

Effectiveness of the perceptual losses in PixelGen.

Visualization of more images generated by our text-to-image PixelGen.

Visualization of 256*256 images generated by our class-to-image PixelGen.

🎉 Checkpoints

Dataset	Epoch	Model	Params	FID	HuggingFace
ImageNet256	80	PixelGen-XL/16	676M	5.11 (w/o CFG)	🤗
ImageNet256	160	PixelGen-XL/16	676M	1.83 (w/ CFG)	🤗

Dataset	Model	Params	GenEval	HuggingFace
Text-to-Image	PixelGen-XXL/16	1.1B	0.79	🤗

🔥 Online Demos

We provide online demos for PixelGen-XXL/16(text-to-image) on HuggingFace Spaces.

HF spaces: https://dd0d187fc54e4b00ee.gradio.live

To host the local gradio demo, run the following command:

# for text-to-image applications
python app.py --config ./configs_t2i/sft_res512.yaml --ckpt_path=./ckpts/PixelGen_XXL_T2I.ckpt

🤖 Usages

In class-to-image(ImageNet) experiments, We use ADM evaluation suite to report FID. In text-to-image experiments, we use BLIP3o dataset as training set and utilize GenEval to collect metrics.

Environments

# for installation (recommend python 3.10)
pip install -r requirements.txt

Inference

# for inference without CFG using 80-epoch checkpoint
python main.py predict -c ./configs_c2i/PixelGen_XL_without_CFG.yaml --ckpt_path=./ckpts/PixelGen_XL_80ep.ckpt
# for inference with CFG using 160-epoch checkpoint
python main.py predict -c ./configs_c2i/PixelGen_XL.yaml --ckpt_path=./ckpts/PixelGen_XL_160ep.ckpt

Train

# for c2i training
# Please modify the ImageNet1k path in the config file before training.
python main.py fit -c ./configs_c2i/PixelGen_XL.yaml

# multi-node training in lightning style, e.g., 4 nodes
export MASTER_ADDR={Your Config}
export MASTER_PORT={Your Config}
export NODE_RANK={Your Config}
export NNODES={Your Config}
export NGPUS_PER_NODE={Your Config}
python main.py fit -c ./configs_c2i/PixelGen_XL.yaml --trainer.num_nodes=4

# for t2i training
python main.py fit -c ./configs_t2i/pretraining_res256.yaml
python main.py fit -c ./configs_t2i/pretraining_res512.yaml --ckpt_path=./ckpts/pretrain256.ckpt
python main.py fit -c ./configs_t2i/sft_res512.yaml  --ckpt_path=./ckpts/pretrain512.ckpt

💐 Acknowledgement

This repository is built based on DeCo. Thanks for their contributions and Shuai Wang's support!

📖 Citation

If you find PixelGen is useful in your research or applications, please consider giving us a star ⭐ and citing it by the following BibTeX entry.

@article{ma2026pixelgen,
      title={PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss}, 
      author={Zehong Ma and Ruihan Xu and Shiliang Zhang},
      year={2026},
      eprint={2602.02493},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.02493}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gradio		.gradio
configs_c2i		configs_c2i
configs_t2i		configs_t2i
docs		docs
evaluations		evaluations
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

🫖 Introduction

Illustration of Perceptual Manifold

🧩 Visualizations

🎉 Checkpoints

🔥 Online Demos

🤖 Usages

💐 Acknowledgement

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

🫖 Introduction

Illustration of Perceptual Manifold

🧩 Visualizations

🎉 Checkpoints

🔥 Online Demos

🤖 Usages

💐 Acknowledgement

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages