You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
¹ Beijing University of Posts and Telecommunications (BUPT)
² Shanghai Jiao Tong University (SJTU)
³ University of Shanghai for Science and Technology (USST)
⁴ Tsinghua University (THU)
⁵ Hong Kong University of Science and Technology (HKUST)
This repository accompanies our IEEE tutorial paper, serving as a living resource for researchers at the intersection of generative AI and wireless communications. As semantic communications emerge as a paradigm shift from bit-accurate transmission toward meaning-centric communication, diffusion models have become a cornerstone technology enabling receivers to reconstruct high-quality content from minimal semantic cues. This repository provides curated collections of representative works, popular implementations, educational resources, and practical guidelines to help researchers continuously acquire knowledge in this rapidly evolving interdisciplinary field.
📋 TL;DR
What is this article about?
To the best of our knowledge, this is the first tutorial paper on diffusion models for generative semantic communications. It provides a unified resource for researchers to efficiently begin their work in this interdisciplinary area, without separately navigating scattered literatures across generative AI and wireless communications.
🎯Mathematical Fundamentals: From score matching and Langevin dynamics to stochastic differential equations (SDEs) and probability flow ordinary differential equations (PF ODEs), we present the theoretical foundations of score-based diffusion models.
🎨 Conditioning Mechanisms: We examine how to steer diffusion models toward task-specific objectives through two complementary paradigms — inference-time conditioning that injects guidance during sampling while preserving pre-trained models, and training-time conditioning that jointly optimizes conditional and unconditional scores for tighter control, meeting the fundamental controllability requirement in semantic communications.
⚡ Sampling Acceleration: Recognizing that iterative sampling (often requiring hundreds to thousands of neural network evaluations) presents significant computational challenges for real-time deployment, we review five primary acceleration strategies: dimensionality reduction, knowledge distillation, structure pruning, cache reuse, and flow matching.
🔬 Task Generalization: We explore how diffusion models, initially conceived for specific data modalities and domains, can be extended across diverse scenarios through three fundamental aspects — modality expansion, domain adaptation, and task generalization, which addresses the requirements of task-specific multi-modal semantic communications.
📡 Application Scenarios: Through analysis of three distinct use cases, we illustrate how diffusion models enable extreme compression while maintaining semantic fidelity:
Fidelity-oriented human semantic communications balancing consistency-realism trade-offs for perceptually realistic reconstruction
Task-specific machine semantic communications optimizing effectiveness-efficiency trade-offs for downstream task execution under bandwidth constraints
Intent-driven agent semantic communications managing centralization-distribution trade-offs for multi-agent coordination through shared probabilistic representations
Why is this article needed?
As wireless systems approach Shannon capacity limits, semantic communications represent a paradigm shift from bit-accurate transmission toward meaning-centric communication. The emergence of diffusion models as powerful generative priors has catalyzed generative semantic communications, where receivers reconstruct high-quality content from minimal semantic cues. However, the field currently lacks systematic guidance connecting diffusion model techniques to semantic communication system design. This article fills that critical gap by:
Eliminating barriers between machine learning and communication communities
Providing depth beyond existing surveys and magazines through rigorous mathematical treatment and implementation details
Establishing connections via an inverse problem perspective that reformulates semantic decoding as posterior inference
Offering practical resources including open-source implementations, and deployment guidelines
Who should read this?
We believe this article may be helpful to the following groups of people:
Researchers in semantic communications seeking to leverage diffusion models
Machine learning practitioners interested in wireless communication applications
Graduate students entering the interdisciplinary field of AI-native wireless networks
Engineers designing next-generation communication systems with semantic awareness
Seminal works establishing the theoretical and practical foundations of diffusion models.
#
Method
Venue
Key Contribution
Links
1
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
ICML'15
First diffusion model using thermodynamic principles
2
NCSN - Generative Modeling by Estimating Gradients
NeurIPS'19
Score matching with Langevin dynamics (SMLD)
3
DDPM - Denoising Diffusion Probabilistic Models
NeurIPS'20
Simplified training objective and high-quality generation
4
DDIM - Denoising Diffusion Implicit Models
ICLR'21
Non-Markovian sampling for accelerated generation
5
Score SDE - Score-Based Generative Modeling through SDEs
ICLR'21
Unified SDE framework connecting score matching and diffusion
6
LDM - High-Resolution Image Synthesis with Latent Diffusion Models
CVPR'22
Diffusion in learned latent spaces (Stable Diffusion)
🎨 Conditional Diffusion Models
Conditional diffusion models enable controlled generation by incorporating external guidance. This section covers two main categories based on when conditioning is applied.
Inference-Time Conditional Diffusion Models
These methods introduce guidance during sampling without modifying the pre-trained model.
#
Method
Venue
Description
Links
1
CG - Classifier Guidance
NeurIPS'21
Adds classifier gradients to steer generation
2
ILVR
ICCV'21
Iterative refinement toward a reference image
3
SDEdit
ICLR'22
Structure-preserving editing via controlled denoising
4
RePaint
CVPR'22
Inpainting by alternating denoising and re-noising
5
Prompt-to-Prompt
arXiv'22
Cross-attention editing guided by text prompts
6
DDRM
NeurIPS'22
Linear inverse problem solver using diffusion priors
7
MCG
NeurIPS'22
Adds manifold consistency during sampling
8
DDNM - Denoising Diffusion Null-space Model
ICLR'23
Null-space projection for zero-shot restoration
9
DPS - Diffusion Posterior Sampling
ICLR'23
Posterior sampling with measurement guidance
10
πGDM - Pseudoinverse-Guided DM
ICLR'23
Pseudoinverse-based conditioning for inverse tasks
11
Null-Text Inversion
CVPR'23
Real-image editing via null-text optimization
12
BlindDPS
CVPR'23
Jointly samples unknown operator and clean signal
13
DiffPIR
CVPRW'23
Plug-and-play restoration with diffusion priors
14
DiffusionMBIR
CVPR'23
Uses 2D diffusion priors for 3D reconstruction
15
FreeDoM
ICCV'23
Training-free diffusion adaptation for new tasks
16
DG - Discriminator Guidance
ICML'23
Introduces a discriminator that gives explicit supervision to a denoising sample path
17
SMRD
MICCAI'23
MRI reconstruction via diffusion priors
18
PSLD
NeurIPS'23
Posterior sampling in latent diffusion space
19
RED-diff
ICLR'24
Variational regularization with diffusion denoisers
20
ControlVideo
ICLR'24
Video editing with spatial/temporal control via fine-tuning
21
DeqIR
CVPR'24
Fixed-point solver for diffusion restoration
22
SparseCtrl
ECCV'24
Adds sparse keyframe controls to text-to-video diffusion
23
DiffBIR
ECCV'24
Blind image restoration with generative diffusion priors
24
DMPlug
NeurIPS'24
Plug-in solver for general inverse problems
25
DGSolver
NeurIPS'25
Diffusion generalist solver with universal posterior sampling
26
DAPS
CVPR'25
Annealed posterior sampling for inverse problems
27
SITCOM
ICML'25
Iterative constrained optimization during sampling
28
DiffStateGrad
ICLR'25
Gradient projection in diffusion latent space
29
RF-Inversion
ICLR'25
Semantic image inversion and editing using rectified SDEs
30
FlowDPS
ICCV'25
Posterior sampling within flow-matching ODEs
Key Formula:
Training-Time Conditional Diffusion Models
These methods incorporate conditioning directly during model training.
Personalizes a concept via learned token embeddings
5
DreamBooth
CVPR'23
Subject-driven personalization via fine-tuning
6
GLIGEN
CVPR'23
Grounded language-to-image generation
7
InstructPix2Pix
CVPR'23
Instruction-based image editing
8
ControlNet
ICCV'23
Fine-grained spatial control
9
IP-Adapter
arXiv'23
Image prompt adapter for identity/style conditioning
10
MoD - Mixture of Diffusers
arXiv'23
Conditional diffusion with learned mixture experts
11
DiT - Diffusion Transformer
ICCV'23
Transformer-based diffusion
12
MDT - Masked Diffusion Transformer
ICCV'23
Masked diffusion transformers
13
SDXL - Stable Diffusion XL
ICLR'24
High-res text-to-image diffusion with multi-aspect conditioning
14
T2I-Adapter
AAAI'24
Lightweight adapters for control
15
AnimateDiff
ICLR'24
Motion module for animation
16
LVD - LLM-grounded Video Diffusion
ICLR'24
LLM-guided video generation
17
SEINE
ICLR'24
Short-to-long video diffusion
18
VideoCrafter2
CVPR'24
Open-source text-to-video / video editing diffusion pipeline
19
HunyuanDiT
CVPR'24
Large-scale DiT-based text-to-image diffusion with strong conditioning
20
S-CFG - Rethinking spatial Inconsistency in CFG
CVPR'24
Analyzes and improves spatial consistency in CFG-based generation
21
D3PO
CVPR'24
RLHF-style preference finetuning for diffusion without reward model
22
DreamMatcher
CVPR'24
Appearance matching self-attention for semantically-consistent text-to-image personalization
23
PixArt-Σ
ECCV'24
High-resolution text-to-image
24
Follow-Your-Emoji
SIGGRAPH Asia'24
Fine-controllable and expressive freestyle portrait animation with diffusion
25
HunyuanVideo
arXiv'24
High-res text-to-video diffusion with multi-scale DiT backbone
26
DDO - Direct Discriminative Optimization
ICML'25
Direct optimization for preference alignment
27
CFG++
ICLR'25
Refines CFG via dynamic gradient weighting
28
Ctrl-Adapter
ICLR'25
Unified adapter to inject diverse spatial/temporal controls into image/video diffusion
29
T2V-Turbo-v2
ICLR'25
Fast text-to-video generation
30
β-CFG
arXiv'25
Dynamic guidance method for text-to-image diffusion models
Key Formula:
⚡ Efficient Diffusion Models
Efficient diffusion models aim to reduce computational cost and sampling time through various acceleration strategies.
Dimensionality Reduction
Operating in compressed latent spaces reduces computational overhead.
#
Method
Venue
Description
Links
1
LDM - Latent Diffusion Model
CVPR'22
Stable Diffusion foundation
2
WSGM - Wavelet Score-based GM
NeurIPS'22
Wavelet-based score models
3
DiT - Diffusion Transformer
ICCV'23
Transformer-based diffusion
4
WaveDiff
CVPR'23
Wavelet-based diffusion
5
LMD - Latent Masking Diffusion
AAAI'24
Combines the advantages of MAEs and diffusion
Knowledge Distillation
Distilling multi-step diffusion into fewer steps or single-step models.
#
Method
Venue
Description
Links
1
PD - Progressive Distillation
ICLR'22
4-8 steps with minimal quality loss
2
CM - Consistency Model
ICML'23
Single-step generation
3
LCM - Latent Consistency Model
arXiv'23
Distills diffusion into few-step latent consistency models
4
DMD2 - Distribution Matching Distillation v2
NeurIPS'24
Improved distribution matching
5
CTM - Consistency Trajectory Model
ICLR'24
Trajectory consistency modeling
6
iCT - Improved Consistency Training
ICML'24
Improved consistency training without teacher models
Structure Pruning
Reducing model parameters through structured pruning.
#
Method
Venue
Description
Links
1
Diff-Pruning
NeurIPS'23
Structural pruning for diffusion
2
TDPM - Truncated DPM
ICLR'23
Truncated diffusion models
3
LD-Pruner
CVPR'24
Latent diffusion pruning
4
DiP-GO
NeurIPS'24
Diffusion pruning with gradient optimization
5
AdaDiff
ECCV'24
Adaptive diffusion pruning
6
SnapFusion
NeurIPS'23
Mobile diffusion via architecture evolution and data distillation
Cache Reuse
Reusing intermediate computations across sampling steps.
#
Method
Venue
Description
Links
1
DeepCache
CVPR'24
Deep feature caching
2
BlockCaching
CVPR'24
Block-wise caching strategy
3
L2C - Learning to Cache
NeurIPS'24
Learned caching policies
4
ToCa - Token-wise Caching
ICLR'25
Token-wise feature caching for DiT acceleration
5
ClusCa - Clustered Caching
MM'25
Compute-efficient clustering cache
6
TaylorSeer
ICCV'25
Taylor expansion-based feature forecasting for DiT acceleration
Flow Matching
Transforming diffusion into deterministic flows for faster sampling.
#
Method
Venue
Description
Links
1
Flow Matching
ICLR'23
Continuous normalizing flows
2
Rectified Flow
ICLR'23
Straightening probability flows
3
PeRFlow - Piecewise Rectified Flow
NeurIPS'24
Piecewise rectification for accelerating diffusion models
4
InstaFlow
ICLR'24
One-step generation via rectified flow
5
MeanFlow
NeurIPS'25
Mean-field flow matching
6
Stable Diffusion 3
arXiv'24
Scaling rectified flow transformers for high-resolution image synthesis (MMDiT)
7
FLUX
arXiv'25
High-quality flow matching-based text-to-image model with hybrid transformer architecture
🌐 Generalized Diffusion Models
Generalized diffusion models extend the framework to diverse modalities, domains, and tasks.
Modality Expansion
Extending diffusion to multiple modalities beyond images.
#
Method
Venue
Description
Links
1
MonoFormer
arXiv'24
One transformer for both diffusion and autoregression
2
Diffusion Forcing
NeurIPS'24
Full-sequence diffusion forcing
3
Show-o
ICLR'25
Unified image and text generation
4
Transfusion
ICLR'25
Combining diffusion and autoregression
5
UniDisc
arXiv'25
Unified discrete-continuous diffusion
6
OmniGen2
arXiv'25
Unified image generation model with multi-modal conditioning
Domain Adaptation
Adapting diffusion models to specialized domains.
#
Method
Venue
Description
Links
1
DSB - Diffusion Schrödinger Bridge
NeurIPS'21
Domain transfer via Schrödinger bridge
2
Composable Diffusion
ECCV'22
Compositional visual generation
3
DreamBooth
CVPR'23
Personalization with few examples
4
I2SB - Image-to-Image Schrödinger Bridge
ICML'23
Image-to-image translation
5
P2P-Bridge
ECCV'24
Point-to-point bridging
6
OT-CFM
ICLR'23
Optimal transport conditional flow matching for efficient domain coupling
Task Generalization
Generalizing diffusion models across multiple tasks.
#
Method
Venue
Description
Links
1
Diffuser
ICML'22
Planning with diffusion models
2
Diffusion Policy
RSS'23
Visuomotor policy learning
3
DDPO - Denoising Diffusion Policy Optimization
ICLR'24
RL fine-tuning for diffusion
4
C-LoRA - Continual LoRA
TMLR'24
Continual learning for diffusion
5
Diffusion-ES
CVPR'24
Evolutionary search with diffusion for black-box trajectory optimization
6
B²-DiffuRL
CVPR'25
Bidirectional diffusion for RL
7
DPPO - Diffusion Policy Policy Optimization
ICLR'25
PPO fine-tuning for diffusion policies in robotics
🛜 Diffusion Models for Semantic Communications
This section presents applications of diffusion models in semantic communications.
[Preliminary] Diffusion Models for Data Compression
Representative works using diffusion models for data compression across image, video, and audio modalities.
#
Method
Venue
Description
Links
1
CDC
NeurIPS'23
Conditional diffusion decoder for end-to-end optimized lossy image compression
2
HFD
arXiv'23
High-fidelity compression with score-based generative models
3
Multi-Band Diffusion
NeurIPS'23
High-fidelity audio generation from low-bitrate discrete representations
4
PerCo
ICLR'24
Ultra-low bitrate image compression with diffusion models (0.003 bpp)
5
IPIC (Idempotence)
ICLR'24
Perceptual compression via idempotence constraints without training new models
6
CorrDiff
ICML'24
Correcting diffusion compression with privileged end-to-end decoder
7
Foundation Diffusion
ECCV'24
Lossy compression using pre-trained foundation models without fine-tuning
8
Extreme Video Compression
WCSP'24
Extreme video compression with diffusion-based predictive generation (0.02 bpp)
9
UQDM
ICLR'25
Progressive compression with universally quantized diffusion models
10
DiffC
ICLR'25
Zero-shot lossy compression using pretrained Stable Diffusion models
11
PICD
CVPR'25
Versatile perceptual image compression with diffusion rendering for screen and natural images
Fidelity-Oriented Human Semantic Communications
Diffusion models for high-quality semantic image, video, and audio transmission prioritizing perceptual fidelity for human consumption.
#
Method
Venue
Description
Links
1
DM4ASC
ICASSP'24
First diffusion framework for audio semantic communication as inverse problem
2
CommIN
ICASSP'24
INN-guided diffusion for wireless image transmission as inverse problem
3
DiffSC
ICASSP'24
DDPM with Multi-Dimensional Feature Extraction for high-noise environments
4
CDDM
TWC'24
Channel denoising diffusion models adapting to AWGN/Rayleigh channels
5
Gen-SC
WCSP'24
Transmits images efficiently by sending text descriptions and reconstructing images via a text-to-image diffusion model
6
CDM-JSCC
WCL'24
Enhances the perceptual quality of transmitted images by utilizing a rate-adaptive conditional diffusion model
7
Img2Img-SC
MLSP'24
Language-oriented semantic communication framework that transmits both textual descriptions and compressed image embeddings
8
MU-GSC
arXiv'24
Swin Transformer JSCC with diffusion decoder, 17.75% PSNR improvement
9
DiffJSCC
TMLCN'25
Pre-trained Stable Diffusion with Deep JSCC achieving <0.008 symbols/pixel
10
DiffCom
JSAC'25
Probabilistic sampling using channel signals as fine-grained conditions
11
GVSC
TVT'25
First generative video semantic communication at low bandwidth ratio
12
Wang et al.
arXiv'25
Receiver-driven retransmission with caption-guided latent diffusion inpainting
13
SGD-JSCC
arXiv'25
DiT-based diffusion with semantic side information for channel denoising
14
WVSC-D
arXiv'25
Wireless video semantic communication framework with decoupled diffusion multi-frame compensation
15
DiT-JSCC
arXiv'26
A DiT-based generative JSCC that ensures high semantic consistency for image transmission under extreme channel conditions
Task-Specific Machine Semantic Communications
Resource-efficient diffusion models optimized for machine semantic communications and edge computing scenarios.
#
Method
Venue
Description
Links
1
GESCO
arXiv'23
Pioneering diffusion-based machine semantic communication transmitting compressed semantic maps
2
Qiao et al.
WCL'24
Latency-aware generative semantic communications with pre-trained diffusion models
3
SCGSC
WCNC'24
Semantic change driven generative machine semantic communication framework
4
LDM-SemCom
TWC'25
Real-time edge computing with end-to-end consistency distillation
5
Guo et al.
TWC'25
Treating wireless transmission as forward diffusion process with VAE modules
6
Q-GESCO
WCL'25
Quantized models reducing memory 75% and FLOPs 79% for resource-constrained devices
7
CASC
ICC'25
Latent diffusion with Condition-Aware NN, 51.7% inference time reduction
8
SC-Diffusion
TMLCN'25
Parameter generation for task-oriented semantic communications via conditional diffusion model
9
Khalid et al.
ICML'25
Semantic image communication via Stable Cascade with compact latent embeddings
10
Wang et al.
arXiv'25
Training-free LDM receiver with SDE-derived SNR-to-timestep mapping for zero-shot generalization
11
DiffSem
arXiv'25
Task-oriented with privacy, notable accuracy improvement on MNIST
12
SS-MGSC
arXiv'25
A multi-user generative semantic communication framework utilizing semantic-splitting and diffusion models for personalized vehicular networks
Intent-Driven Agent Semantic Communications
AI agents with diffusion models for intent-driven semantic communications.
#
Method
Venue
Description
Links
1
A-GSC
TWC'24
Agent-driven generative semantic communications with cross-modality and prediction based on diffusion RL
2
Semantic Collaboration
CNIOT'24
A multi-agent collaboration framework based on semantic communication for search and rescue tasks
3
CSCA
TMC'26
A diffusion policy-empowered cognitive SemCom agent for intent-driven multimodal communication planning at the edge
📊 Benchmarks and Datasets
Benchmarks
Widely-used open-source benchmarks for evaluating diffusion model generation quality, prompt fidelity, and compositional capabilities.
Text-to-Image Benchmarks
#
Benchmark
Description
Source
1
DrawBench
200 challenging prompts across 11 categories (counting, colors, spatial, text rendering, etc.) introduced by Imagen for qualitative human evaluation of T2I models.
2
PartiPrompts (P2)
1,600 diverse English prompts spanning 12 categories and 11 challenge aspects for holistic T2I evaluation. Released with the Parti model.
3
TIFA
VQA-based automatic evaluation measuring T2I faithfulness by generating question-answer pairs from prompts and verifying against images. 4K prompts, 25K questions across 12 categories.
4
T2I-CompBench
Comprehensive compositional T2I benchmark evaluating attribute binding, spatial relationships, and complex compositions with detection-based metrics.
5
GenEval
Compositional generation benchmark evaluating object count, spatial relations, attribute binding, and co-occurrence accuracy via object detection pipelines.
6
DPG-Bench
Dense prompt generation benchmark with long, detailed prompts synthesized from multi-annotation sources for evaluating models on complex, attribute-rich descriptions.
7
MJHQ-30K
30K high-quality Midjourney images across 10 categories for automatic FID-based aesthetic quality evaluation. Curated with aesthetic and CLIP score filtering.
8
GenAI-Bench
1,600 compositional prompts from professional designers, evaluating advanced reasoning (counting, comparison, logic) with human ratings across 10 leading T2I/T2V models.
Video Generation Benchmarks
#
Benchmark
Description
Source
1
VBench
Comprehensive video generation benchmark evaluating 16 dimensions including temporal consistency, motion quality, aesthetic fidelity, and subject identity.
2
EvalCrafter
Benchmark and pipeline for evaluating video generation models across visual quality, text-video alignment, motion quality, and temporal consistency.
Datasets
Audio
#
Dataset
Description
Size
Tasks
Source
1
LibriSpeech
Large-scale corpus of read English speech derived from audiobooks. Clean and noisy subsets available.
1000 hours
ASR, Speech Recognition
2
VCTK
English multi-speaker corpus with 110 speakers reading newspapers. High-quality recordings.
44 hours
TTS, Voice Conversion, Speaker Recognition
3
AudioSet
Large-scale dataset of 2M 10-second audio clips with 527 sound event classes from YouTube.
2M clips
Audio Classification, Sound Event Detection
Image
#
Dataset
Description
Size
Tasks
Source
1
ImageNet
Large-scale image classification dataset with 1000 object categories. Standard benchmark for computer vision.
1.4M images
Classification, Object Recognition
2
COCO
Common Objects in Context. Object detection, segmentation, and captioning with 80 categories.
330K images
Detection, Segmentation, Captioning
3
FFHQ
Flickr-Faces-HQ. High-quality face dataset at 1024×1024 resolution with diverse variations.
70K images
Face Generation, GAN, Style Transfer
4
CLIC
Challenge on Learned Image Compression dataset. Professional quality images for compression research.
2000+ images
Image Compression, Quality Assessment
5
Kodak
Kodak PhotoCD dataset. Standard benchmark with 24 high-quality uncompressed images.
24 images
Image Compression, Quality Evaluation
6
Places365
Scene recognition dataset with 365 scene categories. Focuses on environmental context.
10M images
Scene Recognition, Classification
7
CelebA
Large-scale face attributes dataset with 40 attribute annotations per image.
202K images
Face Recognition, Attribute Prediction
Video
#
Dataset
Description
Size
Tasks
Source
1
Kinetics-400/600/700
Large-scale human action video dataset from YouTube. Standard for action recognition.
650K videos
Action Recognition, Video Classification
2
UCF101
Action recognition dataset with 101 action categories from realistic web videos.
13K videos
Action Recognition, Video Understanding
3
ActivityNet
Large-scale video dataset for human activity understanding with temporal annotations.
20K videos
Activity Detection, Temporal Localization
4
YouTube-8M
Large-scale video understanding dataset with 8M videos and 3862 visual entity classes.
8M videos
Video Classification, Multi-label
5
MSR-VTT
Video captioning dataset with 10K video clips and 200K natural language descriptions.
10K videos
Video Captioning, Video-Text Retrieval
Volume (3D/4D)
#
Dataset
Description
Size
Tasks
Source
1
D-NeRF
Dynamic Neural Radiance Fields dataset with synthetic and real dynamic scenes for 4D reconstruction.
9 scenes
Dynamic Novel View Synthesis, 4D Reconstruction
2
Neu3D
Neural 3D video synthesis dataset with multi-view videos of human performances.
200+ sequences
3D Human Reconstruction, Neural Rendering
3
ShapeNet
Large-scale 3D shape dataset with 55 object categories and 51,300 3D CAD models.
51K models
3D Reconstruction, Shape Analysis
4
ScanNet
Richly-annotated indoor RGB-D scans with 3D semantic segmentation labels for 1513 scenes.
1513 scans
3D Segmentation, Indoor Scene Understanding
5
ModelNet
3D CAD model dataset with ModelNet40 (40 classes) and ModelNet10 (10 classes) versions.
12K models
3D Classification, Point Cloud Processing
6
NeRF Synthetic
Blender-rendered synthetic scenes with known camera poses and lighting for NeRF evaluation.
8 scenes
Novel View Synthesis, 3D Reconstruction
Domain-Specific
Autonomous Driving
#
Dataset
Description
Size
Tasks
Source
1
nuScenes
Full 3D sensor suite with LiDAR, radar, and cameras. 1000 scenes with 3D bounding boxes.
1000 scenes
3D Detection, Tracking, Prediction
2
KITTI
Benchmark suite for stereo, optical flow, visual odometry, and 3D object detection from driving scenarios.
200K images
3D Detection, Depth, Odometry
3
Waymo Open Dataset
High-resolution sensor data with LiDAR and camera from Waymo vehicles. Large-scale 3D annotations.
1000 segments
3D Detection, Tracking, Motion Prediction
4
Cityscapes
Urban street scenes with dense pixel-level semantic and instance segmentation annotations.
Large chest X-ray dataset with free-text radiology reports. Largest publicly available CXR dataset.
377K images
Disease Classification, Report Generation
3
ChestX-ray14
Large-scale chest X-ray dataset with 14 common disease labels for multi-label classification.
112K images
Disease Classification, Localization
4
Medical Segmentation Decathlon
Multi-organ segmentation covering 10 different medical imaging tasks (CT, MRI).
2600+ cases
Multi-task 3D Segmentation
Depth Estimation
#
Dataset
Description
Size
Tasks
Source
1
NYU Depth V2
Indoor RGB-D dataset with dense depth maps from Microsoft Kinect. 1449 labeled scenes.
1449 scenes
Depth Estimation, Indoor Scene Understanding
2
DIODE
Dense Indoor and Outdoor DEpth dataset with high-quality depth from laser scanner.
25K images
Depth Estimation, Normal Estimation
3
Middlebury Stereo
Standard stereo matching benchmark with high-resolution calibrated image pairs and ground truth.
30+ pairs
Stereo Matching, Depth Estimation
4
SceneFlow
Large synthetic dataset with optical flow and disparity ground truth for 3D scene understanding.
39K images
Optical Flow, Stereo Matching, Depth
Remote Sensing
#
Dataset
Description
Size
Tasks
Source
1
SpaceNet
High-resolution satellite imagery with building footprints, road networks across multiple cities.
1M+ buildings
Building Detection, Road Extraction
2
xView
One of the largest overhead imagery datasets with 1M object instances across 60 classes.
1M objects
Object Detection, Classification
3
DOTA
Dataset for Object deTection in Aerial images with oriented bounding boxes. 15 categories.
188K instances
Oriented Object Detection, Aerial Imagery
4
LEVIR-CD
Large-scale building change detection dataset from Google Earth with 637 image pairs.
637 pairs
Change Detection, Building Analysis
📏 Evaluation Metrics
Perception Metrics
Full-Reference Metrics
#
Metric
Description
Source
1
PSNR
Peak Signal-to-Noise Ratio. Measures the ratio between the maximum possible power of a signal and the power of corrupting noise. Calculated as PSNR = 10·log₁₀(MAX²/MSE).
2
SSIM
Structural Similarity Index. Assesses image quality based on luminance, contrast, and structure. Designed to improve on PSNR by considering structural information.
3
LPIPS
Learned Perceptual Image Patch Similarity. Uses deep neural network features to compute perceptual distance between images, better aligned with human perception.
4
DISTS
Deep Image Structure and Texture Similarity. Combines structure and texture similarity using deep features for better perceptual quality assessment.
Reduced-Reference Metrics
#
Metric
Description
Source
1
RRED
Reduced-Reference Entropic Differencing. Uses entropic differences between wavelet coefficients, requiring only partial statistical features from reference.
2
RR-SSIM
Reduced-Reference SSIM. Extracts and transmits only key structural features (edge information, local statistics) from reference image.
No-Reference Metrics
#
Metric
Description
Source
1
NIQE
Natural Image Quality Evaluator. Measures deviation from statistical regularities in natural images using natural scene statistics (NSS). Completely blind quality assessment.
2
FID
Fréchet Inception Distance. Calculates Fréchet distance between feature distributions of real and generated images in Inception-v3 space. Lower FID indicates better quality and diversity.
3
KID
Kernel Inception Distance. Unbiased alternative to FID using polynomial kernel on Inception features. More reliable for small sample sizes.
4
IS
Inception Score. Evaluates both quality (classification confidence) and diversity (marginal class distribution).
5
MUSIQ
Multi-scale Image Quality Transformer. Handles native-resolution images via multi-scale patch embedding without fixed-size cropping, enabling more robust no-reference quality assessment.
6
CLIP-IQA
Leverages CLIP's vision-language representations for no-reference image quality and aesthetic assessment via prompt-based antonym pairing.
Semantic Metrics
#
Metric
Description
Source
1
CLIPScore
Measures text-image alignment using CLIP embeddings. Computed as cosine similarity between CLIP image and text features.
2
ViTScore
Uses Vision Transformer features to evaluate semantic similarity between images. Captures high-level semantic content beyond pixel-level differences.
3
SeSS
Semantic Similarity Score. Based on Scene Graph Generation and graph matching, shifts image similarity scores into semantic-level graph matching scores.
4
DreamSim
Learned perceptual metric trained on synthetic triplet judgments from diffusion models, capturing mid-level semantic similarity beyond low-level texture.
5
ImageReward
Text-image alignment metric learned from human preference rankings via reward modeling, designed to evaluate text-to-image generation quality.
6
HPSv2
Human Preference Score v2. Fine-tuned CLIP model predicting human aesthetic preferences for generated images, trained on large-scale human choice data.
7
PickScore
Preference-based scoring model trained on the Pick-a-Pic dataset of human pairwise preferences for text-to-image generation.
🔗 Other Resources
📚 Comprehensive Books, Surveys & Tutorials
Diffusion Models
#
Paper
Authors
Year
Links
1
Understanding Diffusion Models: A Unified Perspective
Luo et al.
2022
2
Diffusion Models: A Comprehensive Survey of Methods and Applications
Yang et al.
2022
3
Diffusion Models in Vision: A Survey
Croitoru et al.
2022
4
A Survey on Generative Diffusion Models
Cao et al.
2022
5
A Survey on Video Diffusion Models
Xing et al.
2023
6
Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey
Li et al.
2023
7
Efficient Diffusion Models: A Comprehensive Survey From Principles to Practices
Ma et al.
2024
8
Diffusion Model-Based Image Editing: A Survey
Huang et al.
2024
9
Diffusion Models in Low-Level Vision: A Survey
He et al.
2024
10
Diffusion Models in 3D Vision: A Survey
Wang et al.
2024
11
Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review
Uehara et al.
2024
12
Efficient Diffusion Models: A Survey
Shen et al.
2025
13
A Survey on Diffusion Language Models
Li et al.
2025
14
The Principles of Diffusion Models
Lai et al.
2025
15
Flow Matching Guide and Code
Lipman et al.
2024
16
An Introduction to Flow Matching and Diffusion Models
Holderrieth & Erives
2025
Semantic Communications
#
Paper
Authors
Year
Links
1
Toward Wisdom-Evolutionary and Primitive-Concise 6G: A New Paradigm of Semantic Communication Networks
Zhang et al.
2022
2
Semantic Communications for Future Internet: Fundamentals, Applications, and Challenges
Yang et al.
2022
3
Beyond Transmitting Bits: Context, Semantics, and Task-Oriented Communications
Gunduz et al.
2022
4
Semantics-Empowered Communications: A Tutorial-Cum-Survey
Lu et al.
2022
5
Less Data, More Knowledge: Building Next Generation Semantic Communication Networks
Chaccour et al.
2022
6
Enhancing Deep Reinforcement Learning: A Tutorial on Generative Diffusion Models in Network Optimization
Du et al.
2023
7
A Survey on Semantic Communication Networks: Architecture, Security, and Privacy
Guo et al.
2024
8
Resource Management, Security, and Privacy Issues in Semantic Communications: A Survey
Won et al.
2024
9
Generative AI-Driven Semantic Communication Networks: Architecture, Technologies, and Applications
Liang et al.
2024
10
A Contemporary Survey on Semantic Communications: Theory of Mind, Generative AI, and Deep Joint Source-Channel Coding
Nguyen et al.
2025
11
Generative Diffusion Models for Wireless Networks: Fundamental, Architecture, and State-of-the-Art
Fan et al.
2025
12
Resource Allocation in Wireless Semantic Communications: A Comprehensive Survey
Zhang et al.
2025
📺 Courses & Video Lectures
#
Title
Source
Type
Links
1
Stanford CS236: Deep Generative Models
Stefano Ermon et al.
University Course
2
MIT 6.S978: Deep Generative Models
Kaiming He et al.
University Course
3
MIT 6.S184: Introduction to Flow Matching and Diffusion Models
Peter Holderrieth & Ezra Erives
University Course
4
Diffusion Models Course
Hugging Face
Online Course
5
NeurIPS 2023 Workshop: Diffusion Models
NeurIPS
Workshop
6
Diffusion and Score-Based Generative Models
Yang Song
Lecture
7
Two Minute Papers – Diffusion Series
Two Minute Papers
YouTube Series
8
Generative Modeling by Estimating Gradients of the Data Distribution
Yang Song
Blog Post
9
What are Diffusion Models?
Lilian Weng
Blog Post
🧰 Interactive Demos & Tools
#
Tool
Type
What it’s great for
Links
1
Stable Diffusion WebUI (AUTOMATIC1111)
UI + Extensions
Local UI with huge plugin ecosystem
2
InvokeAI
Pro UI
Studio-style creative workflow & editing
3
🤗 Diffusers
Library
Clean Python API for diffusion inference & training
4
Diffusers Playground (Hugging Face Spaces)
Web demo
Try many pipelines online (no local install)
5
ComfyUI
Node-graph UI
Modular node-based pipelines for reproducible flows
6
StableStudio (Stability AI)
Official UI
Frontend for SDXL / stability models
7
Fooocus
Simple UI
One-click text→image with SDXL support
8
kohya-ss / sd-scripts
Training / Finetune
LoRA, DreamBooth, finetuning helpers
9
ControlNet
Conditioning model
Pose / edge / depth guided generation
10
sd-webui-controlnet
WebUI Extension
Easy ControlNet integration for WebUI
📝 Citation
If you find this article or repository helpful, please consider citing:
@article{qin-diffcomm,
author = {H. L. Qin and J. Dai and G. Lu and S. Shao and S. Wang and T. Xu and W. Zhang and P. Zhang and K. B. Letaief},
title = {Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications},
journal = {arXiv preprint arXiv:2511.08416},
year = {2025}
}
Related Papers from Our Group
@article{dai-gaicomm,
author = {J. Dai and X. Qin and S. Wang and L. Xu and K. Niu and P. Zhang},
title = {Deep Generative Modeling Reshapes Compression and Transmission: From Efficiency to Resiliency},
journal = {IEEE Wireless Commun.},
volume = {31},
number = {4},
pages = {48--56},
year = {2024}
}
@article{wang-diffcom,
author = {S. Wang and J. Dai and K. Tan and X. Qin and K. Niu and P. Zhang},
title = {DiffCom: Channel Received Signal is a Natural Condition to Guide Diffusion Posterior Sampling},
journal = {IEEE J. Sel. Areas Commun.},
volume = {43},
number = {7},
pages = {2651--2666},
year = {2025}
}
@article{qin-semcod,
author = {H. L. Qin and J. Dai and S. Wang and X. Qin and S. Shao and K. Niu and W. Xu and P. Zhang},
title = {Neural Coding is Not Always Semantic: Toward the Standardized Coding Workflow in Semantic Communications},
journal = {IEEE Commun. Stand. Mag.},
volume = {9},
number = {4},
pages = {24--33},
year = {2025}
}
@article{tan-ditjscc,
author = {K. Tan and J. Dai and S. Wang and G. Lu and S. Shao and K. Niu and W. Zhang and P. Zhang},
title = {DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations},
journal = {arXiv preprint arXiv:2601.03112},
year = {2026}
}
🌟 Acknowledgments
We thank the diffusion models and semantic communications research communities for their groundbreaking work. Special thanks to all and future contributors to this repository.
⭐ Star this repo if you find it useful! ⭐
Maintained with ❤️ by the community members:
About
A public repository of "Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications", which is a collection of educational resources and curated papers on diffusion models and their applications in semantic communications.