1. [Artificial Intelligence](/topics/ai/ai-inference/) [Generative AI](/topics/ai/generative-ai) Community Models # AI Models Explore and deploy top AI models built by the community, accelerated by [NVIDIA’s AI inference platform](https://developer.nvidia.com/topics/ai/ai-inference?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.inference%3Adesc%2Ctitle%3Aasc), and run on NVIDIA-accelerated infrastructure. [Explore Models](https://build.nvidia.com/models "vMaterials for Windows")[View Performance](/deep-learning-performance-training-inference/ai-inference "vMaterials for Linux ") * * * ## ![AI Model - DeepSeek logo](https://developer.download.nvidia.com/images/pretrained-ai-models/deepseek.svg)DeepSeek DeepSeek is a family of open-source models that features several powerful models using a [mixture-of-experts (MoE) architecture](https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/) and provides advanced reasoning capabilities. DeepSeek models can be optimized for performance using TensorRT-LLM for data center deployments. You can use NIM to try out the models for yourself or customize with the open-source NeMo framework. #### Explore Explore sample applications to learn about different use cases for DeepSeek models. - [Explore how DeepSeek-R1 8K/1K results show a 15x performance benefit and revenue opportunity for NVIDIA Blackwell GB200 NVL72 over Hopper H200](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/) - [Community Tutorial: Journey With DeepSeek R1 on NVIDIA Jetson Orin Nano™ Super Using Docker and Ollama](https://dev.to/ajeetraina/my-journey-with-deepseek-r1-on-nvidia-jetson-orin-nano-super-using-docker-and-ollama-1k2m) - [NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) - [DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action](https://blog.vllm.ai/2025/09/29/deepseek-v3-2.html) #### Integrate Get started with the right tools and frameworks for your development environment. - [Download Containers From the Jetson AI Lab](https://www.jetson-ai-lab.com/models.html) - [Customize DeepSeek v3 With Your Own Data Using the NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/deepseek_v3.html) #### Optimize Optimize inference workloads for LLMs with TensorRT-LLM. Learn how to set up and get started using Llama in TensorRT-LLM. - [How to Get the Best Performance on DeepSeek-R1 in TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md) - [Quantize Deepseek R1 to FP4 With TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/deepseek) TensorRT Model Optimizer now has an experimental feature to deploy to vLLM. [Check out the workflow](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm). - [Automating GPU Kernel Generation With DeepSeek-R1 and Inference Time Scaling](https://developer.nvidia.com/blog/automating-gpu-kernel-generation-with-deepseek-r1-and-inference-time-scaling/) Get started with the models for your development environment. Model ### Get Production-Ready DeepSeek Models With NVIDIA NIM. Rapid prototyping is just an API call away. [Deploy Production-Ready DeepSeek Models](https://build.nvidia.com/search?q=deepseek) Model ### NVIDIA DeepSeek R1 FP4 The NVIDIA DeepSeek R1 FP4 model is the quantized version of the DeepSeek R1 model, which is an autoregressive language model that uses an optimized transformer architecture. The NVIDIA DeepSeek R1 FP4 model is quantized with TensorRT Model Optimizer. [Deploy With TensorRT-LLM on Hugging Face](https://huggingface.co/nvidia/DeepSeek-R1-FP4#deploy-with-tensorrt-llm) Model ### DeepSeek on Ollama Ollama lets you deploy DeepSeek quickly to all your GPUs. [Deploy Optimized Models Using Ollama](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8) [View More Family Models](https://build.nvidia.com/search?q=deepseek) ## ![AI Model - Google DeepMind’s Gemma logo](https://developer.download.nvidia.com/images/pretrained-ai-models/gemma.svg)Gemma Gemma is Google DeepMind’s family of lightweight, open models. Gemma models span a variety of sizes and specialized domains to meet each developer's unique needs. NVIDIA has worked with Google to enable these models to run optimally on a variety of NVIDIA’s platforms, ensuring you get maximum performance on your hardware, from data center GPUs like NVIDIA Blackwell and NVIDIA Hopper architecture chips to Windows RTX and Jetson devices. Enterprise customers can deploy optimized containers using NVIDIA NIM microservices for production-grade support and customize using the end-to-end NeMo framework. With the latest release of Gemma 3n, these models are now natively multilingual and multimodal for your text, image, video, and audio data. #### Explore Explore sample applications to learn about different use cases for Gemma models. - [Watch Gemma3 on Jetson Orin Nano: Live Demo Running Visual Language Models at 15 TPS (With Examples)](https://www.youtube.com/watch?v=jSKHeYVcAB8) - [Watch Google’s Gemma2 SLM on NVIDIA Jetson Orin Nano: The Future of Conversational Edge AI](https://www.youtube.com/watch?v=mgUrthfw3ys) #### Integrate Use Gemma on your devices and make it your own. - [Download Gemma Containers From the Jetson AI Lab](https://www.jetson-ai-lab.com/models.html) - [Download Gemma Through Chat With RTX GitHub](https://github.com/NVIDIA/ChatRTX) - [Customize Gemma for Your Data Using the NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/24.09/llms/gemma.html) - Read the Blog: [Run Google DeepMind’s Gemma 3n on NVIDIA Jetson and RTX](https://developer.nvidia.com/blog/run-google-deepminds-gemma-3n-on-nvidia-jetson-and-rtx/) #### Optimize Optimize inference workloads for LLMs with TensorRT-LLM. Learn how to set up and get started using Llama in TensorRT-LLM. - Read the Blog: [NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma](https://developer.nvidia.com/blog/nvidia-tensorrt-llm-revs-up-inference-for-google-gemma/) - [Serve Gemma Open Models Using GPUs on GKE With Dynamo-Triton™ and TensorRT-LLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tensortllm) - [TensorRT-Model-Optimizer Post-Training Quantization Guide Compatible With vLLM and SGLang](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm) Get started with the models for your development environment. Model ### Get Started With Gemma Models With NVIDIA NIM Gemma 3 is now featured on the NVIDIA API Catalog, enabling rapid prototyping with just an API call. [Experiment and Deploy Gemma Models](https://build.nvidia.com/search?q=gemma) Model ### Gemma 3 Models on Ollama Ollama lets you start experimenting in seconds with the most capable Gemma model that runs on a single NVIDIA H100 Tensor Core GPU. [Download Gemma3 on Ollama](https://ollama.com/library/gemma3) Model ### Gemma-2b-it ONNX INT4 The Gemma-2b-it ONNX INT4 model is quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). Easily fine-tune and adapt the model to your unique requirements with Hugging Face’s Transformers library or your preferred development environment. [Download on Hugging Face](https://huggingface.co/nvidia/Gemma-2b-it-ONNX-INT4) [View More Family Models](https://build.nvidia.com/search?q=gemma) ## ![AI Model - OpenAI’s ChatGPT logo](https://developer.download.nvidia.com/images/logos/logo-openai.svg)gpt-oss NVIDIA and OpenAI began pushing the boundaries of AI with the launch of NVIDIA DGX™ back in 2016. The collaborative AI innovation continues with the OpenAI gpt-oss-20b and gpt-oss-120b launch. NVIDIA has optimized both new open-weight models for [10x inference performance](https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/) on NVIDIA Blackwell architecture, delivering up to 1.5 million tokens per second (TPS) on an NVIDIA GB200 NVL72 system. #### Explore Explore open models and samples to learn about different use cases for NVIDIA-optimized gpt-oss models. - NVIDIA Launchable: [Optimizing Inference With NVIDIA TensorRT-LLM](https://brev.nvidia.com/launchable/deploy?launchableID=env-30i1YjHsRWT109HL6eYxLUeHIwF) - [How to Build a Simple AI Agent With OpenAI’s gpt-oss-20b](https://www.youtube.com/watch?v=e2sgwsC92Bc) - Read Blog: [Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72, NVIDIA Accelerates OpenAI gpt-oss Models From Cloud to Edge](https://developer.nvidia.com/blog/delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge/) - [Explore OpenAI gpt-oss Models in SGLang Collaboration Between Eigen AI, NVIDIA, SGLang, and Open-Source Community](https://github.com/sgl-project/sglang/issues/8833) #### Integrate Get started with the right tools and frameworks for your development environment, leveraging open gpt-oss models. - [Ollama RTX Getting Started: 3-Step Infographic](https://developer.download.nvidia.com/images/pretrained-ai-models/rtx-ai-garage-3-steps-20b.png) - [GGML/llama.cpp GitHub README](https://github.com/ggml-org/llama.cpp) - [GitHub: Codex CLI – Getting Started](https://github.com/openai/codex) - [vLLM Supports gpt-oss](https://blog.vllm.ai/2025/08/05/gpt-oss.html) #### Optimize NVIDIA has optimized both new open-weight models for accelerated inference performance on NVIDIA Blackwell architecture. - [Using NVIDIA TensorRT-LLM to Run gpt-oss-20b in OpenAI’s Cookbook](https://cookbook.openai.com/articles/run-nvidia) - [NVIDIA TensorRT-LLM performance over time for gpt-oss-120b model](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/) - [Dynamo Deployment Guide: Running gpt-oss-120b Disaggregated With TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/trtllm/gpt-oss.md) - [Optimized Attention and MoE Routing Kernels Are Available Through the FlashInfer Kernel-Serving Library for LLMs.](https://github.com/flashinfer-ai/flashinfer) - Guide to [speculative decode](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/) for gpt-oss-120b using the [gpt-oss-120b-Eagle3](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3) model Get started with the models for your development environment. Model ### Get Production-Ready gpt-oss Models with NVIDIA NIM Download and deploy pre-packaged, portable, optimized NIM microservices: - gpt-oss-120b: available for download [Link](https://build.nvidia.com/openai/gpt-oss-120b/deploy) | [Docs](https://docs.api.nvidia.com/nim/reference/openai-gpt-oss-120b) - gpt-oss-20b: available for download [Link](https://build.nvidia.com/openai/gpt-oss-20b/deploy) | [Docs](https://docs.api.nvidia.com/nim/reference/openai-gpt-oss-20b) Model ### Explore gpt-oss models on Hugging Face NVIDIA worked with several top open-source frameworks such as [Hugging Face Transformers](https://huggingface.co/blog/welcome-openai-gpt-oss), Ollama, and vLLM, in addition to NVIDIA TensorRT-LLM for optimized kernels and model enhancements. [Explore 120b model](https://huggingface.co/openai/gpt-oss-120b) Model ### Explore gpt-oss on Ollama Developers can experience these models through their favorite apps and SDKs using Ollama, Llama.cpp, or Microsoft AI Foundry Local. [Explore on Ollama](https://ollama.com/library/gpt-oss) ## ![AI Model - OpenAI’s ChatGPT logo](https://developer.download.nvidia.com/images/pretrained-ai-models/kimi-color-1.svg)Kimi Kimi is a family of open-weight models, including MoE, thinking, and specialized models, from Moonshot AI. Kimi K2 is a state-of-the-art MoE language model with 32 billion activated parameters and 1 trillion total parameters. The Kimi K2 Thinking MoE model—ranked as the most intelligent open-source model on the Artificial Analysis leaderboard—saw a [10x performance leap](https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/) on the NVIDIA GB200 NVL72 rack-scale system compared with NVIDIA HGX™ H200. Fireworks AI has deployed Kimi K2 on the NVIDIA B200 platform to achieve the highest performance on the Artificial Analysis leaderboard. #### Explore Explore open models and samples to learn about different use cases for NVIDIA-optimized Kimi models. - [Breakthrough Performance: Cost-Effective Kimi K2 Training for Everyone](https://developer.nvidia.com/blog/accelerating-large-scale-mixture-of-experts-training-in-pytorch/) - [How to Deploy the Kimi K2 Model on the Cloud With NVIDIA Hopper](https://www.gmicloud.ai/blog/how-to-deploy-the-kimi-k2-model-on-the-cloud) #### Integrate Get started with the right tools and frameworks for your development environment, leveraging open Kimi models. - [Lambda's Tutorial on Serving a One Trillion-Parameter Model on 8x NVIDIA Blackwell GPUs With vLLM](https://lambda.ai/blog/how-to-serve-kimi-k2-instruct-on-lambda-with-vllm?utm_source=twitter&utm_medium=organic-social&utm_campaign=2025-12-kimi-k2-tutorial&utm_content=post-1) #### Optimize Learn how NVIDIA has optimized open-weight models for accelerated inference performance on the NVIDIA Blackwell architecture. - [Kimi K2 Thinking Runs 10x Faster to Enable One-Tenth the Cost per Token on NVIDIA GB200 NVL72](https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/) - [NVIDIA Kimi K2 Thinking NVFP4 model Is Quantized With TensorRT Model Optimizer](https://huggingface.co/nvidia/Kimi-K2-Thinking-NVFP4) - [Kimi K2 Thinking at 140+ TPS on NVIDIA Blackwell With Baseten](https://www.baseten.co/blog/kimi-k2-thinking-at-140-tps-on-nvidia-blackwell/) Get started with the models for your development environment. Model ### Get Production-Ready Kimi K2 Models With NVIDIA NIM Download and deploy pre-packaged, portable, optimized NIM microservices: - Kimi K2 Instruct: [Link](https://build.nvidia.com/moonshotai/kimi-k2-instruct/modelcard) | [Docs](https://docs.api.nvidia.com/nim/reference/moonshotai-kimi-k2-instruct) - Kimi K2 Instruct 0905: [Link](https://build.nvidia.com/moonshotai/kimi-k2-instruct-0905/modelcard) | [Docs](https://docs.api.nvidia.com/nim/reference/moonshotai-kimi-k2-instruct-0905) Model ### Explore Kimi K2 Models on Hugging Face NVIDIA Kimi K2 Thinking NVFP4 is the quantized version of Moonshot AI's Kimi K2 Thinking model, which is an autoregressive language model that uses an optimized transformer architecture. [Explore Model](https://huggingface.co/nvidia/Kimi-K2-Thinking-NVFP4) Model ### Explore Kimi K2 Thinking NVFP4 on Hugging Face The NVIDIA Kimi K2 Thinking NVFP4 model is the quantized version of the Kimi K2 Thinking model, which is an autoregressive language model that uses an optimized transformer architecture. The NVIDIA Kimi-K2-Thinking NVFP4 model is quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). [Explore Kimi K2 Thinking NVFP4](https://huggingface.co/nvidia/Kimi-K2-Thinking-NVFP4) ## ![AI Model - Meta’s Llama logo](https://developer.download.nvidia.com/images/pretrained-ai-models/meta.svg)Llama Llama is Meta’s collection of open foundation models, most recently made multimodal with the 2025 release of Llama 4. NVIDIA worked with Meta to advance inference of these models with [NVIDIA TensorRT™-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0) (TRT-LLM) to get maximum performance from data center GPUs like NVIDIA Blackwell and NVIDIA Hopper™ architecture GPUs. Optimized versions of several Llama models are available as [NVIDIA NIM™ microservices](/nim) for an easy-to-deploy experience. You can also customize Llama with your own data using the end-to-end [NVIDIA NeMo™ framework](https://docs.nvidia.com/nemo-framework/index.html). #### Explore Explore sample applications to learn about different use cases for Llama models. - [Explore how NVIDIA Blackwell B200 achieves up to 4x more throughput versus Hopper H200 on the Llama 3.3 70B 1K/1K benchmark](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/) - [Llama 3 8B as a Voice Agent on NVIDIA Jetson™](https://www.youtube.com/watch?v=7lKBJPpasAQ) - [Retrieval-Augmented Generation (RAG) Example Application Using Llama 3 and LlamaIndex](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RAG/examples/basic_rag/llamaindex) - [Build a Simple AI Agent in Five Minutes with the Llama 3.1 405B NVIDIA NIM](https://www.youtube.com/watch?v=mg0kwpmUhPU) #### Integrate Get started with the right tools and frameworks for your AI model development environment. - [Deploying Llama for Multi-LoRA vLLM Backend in Dynamo-Triton™](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/vllm_backend/docs/llama_multi_lora_tutorial.html) - [Set Up Your NVIDIA RTX™ for Llama With Hugging Face Transformers and PyTorch](https://www.youtube.com/watch?v=af7XjGekm4g) - [Accelerate Hugging Face Llama 3 With Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/te_llama/tutorial_accelerate_hf_llama_with_te.html) - [Customize Llama for Your Data Using the NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/24.09/llms/llama.html) #### Optimize Optimize inference workloads for large language models (LLMs) with TensorRT-LLM. Learn how to set up and get started using Llama in TRT-LLM. - [Performance and functionality improvements for Llama 3.3 models with vLLM](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/) - [Benchmarking the Llama 3 NVIDIA NIM](https://developer.nvidia.com/blog/llm-performance-benchmarking-measuring-nvidia-nim-performance-with-genai-perf/#setting_up_an_openai-compatible_llama-3_inference_service_with_nim) - [Boost Llama 3.3 70B by 3x With TensorRT-LLM Speculative Decoding](https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/) - [Power Generative AI With Performance-Optimized Llama 3.1 NVIDIA NIM Microservices](https://www.youtube.com/watch?v=_rtfR5MXjUc) - [TensorRT-Model-Optimizer Post-Training Quantization Guide Compatible With vLLM and SGLang](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm) Get started with the models for your development environment. Model ### Get Production-Ready Llama Models With NVIDIA NIM The NVIDIA API Catalog enables rapid prototyping with just an API call. [Deploy Production-Ready Llama Models](https://build.nvidia.com/search?q=llama) Model ### Llama 4 on Ollama Ollama enable you to deploy Llama 4 quickly to all your GPUs. [Deploy Optimized Models Using Ollama](https://ollama.com/library/llama4) Model ### Quantized Llama 3.1 8B on Hugging Face NVIDIA Llama 3.1 8B Instruct is optimized by quantization to FP8 using the open-source [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) library. Compatible with data center and consumer devices. [Download From Hugging Face](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8) [View More Family Models](https://build.nvidia.com/search?filters=publisher%3Ameta&q=llama) ## ![AI Model - NVIDIA Nemotron logo](https://developer.download.nvidia.com/images/logos/m48-containerized-model-76b900(1).svg)NVIDIA Nemotron The NVIDIA Nemotron™ family of open models, including Llama Nemotron, excel in reasoning along with a diverse set of agentic tasks. The models are optimized for various use cases: Nano offers cost-efficiency, Super balances accuracy and compute, and Ultra delivers maximum accuracy. With an open license, these models ensure commercial viability and data control. #### Explore Explore models, datasets, and sample applications to learn about different use cases for Nemotron models. - [Build More Accurate and Efficient AI Agents With the New NVIDIA Llama Nemotron Super v1.5](https://developer.nvidia.com/blog/build-more-accurate-and-efficient-ai-agents-with-the-new-nvidia-llama-nemotron-super-v1-5/) - [NVIDIA Nemotron-Personas Dataset Explained](https://youtube.com/shorts/47IayEsgtLQ?feature=shared) #### Integrate Get started with the right tools and frameworks for your development environment, leveraging open Nemotron models and datasets for agentic AI. - [Nemotron Post-Training Dataset](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1) - [Llama Nemotron Super v1.5](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5?linkId=100000375720466) - [Llama Nemotron VLM Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-VLM-Dataset-v1) #### Optimize Optimize Nemotron with NVIDIA NeMo and build AI agents with NVIDIA NIM and NVIDIA Blueprints with customizable reference workflows. - [How to Enhance RAG Pipelines With Reasoning Using NVIDIA Llama Nemotron Models](https://developer.nvidia.com/blog/how-to-enhance-rag-pipelines-with-reasoning-using-nvidia-llama-nemotron-models/?linkId=100000376536389&ncid=so-nvsh-832804) - [Building AI Agents at the Edge With NVIDIA Llama Nemotron Nano 4B](https://www.youtube.com/watch?v=LnSt5jt-DkQ) - [Multimodal Document Intelligence With NVIDIA Llama Nemotron Nano VL](https://www.youtube.com/watch?v=FHc5KxgJ61g) Get started with the models for your development environment. Model ### Nemotron Nano Provides superior accuracy for PC and edge devices. The newly announced Nemotron Nano 2 supports a configurable thinking budget, enabling enterprises to control token generation to reduce cost and deploy optimized agents on edge devices. [Get Started With Nemotron Nano](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) Model ### Llama Nemotron Super Offers the highest accuracy and throughput on a single NVIDIA H100 Tensor Core GPU. With FP4 precision, Llama Nemotron Super 1.5 is optimized for NVIDIA Blackwell architecture with NVFP4 format, delivering up to 6x higher throughput on NVIDIA B200 compared with FP8 on NVIDIA H100. [Get Started With Llama Nemotron Super](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5?linkId=100000375720466) Model ### Llama Nemotron Ultra Delivers the leading agentic AI accuracy for complex systems, optimized for multi-GPU data centers. [Get Started With Llama Nemotron Ultra](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1) ## ![AI Model - Microsoft Phi logo](https://developer.download.nvidia.com/images/pretrained-ai-models/windows.svg)Phi Microsoft Phi is a family of Small Language Models (SLMs) that provide efficient performance for commercial and research tasks. These models are trained on high quality training data and excel in mathematical reasoning, code generation, advanced reasoning, summarization, long document QA, and information retrieval. Due to their small size, Phi models can be deployed on devices in single GPU environments, such as Windows RTX and Jetson. With the launch of the Phi-4 series of models, Phi has expanded to include advanced reasoning and multimodality. #### Explore Explore sample applications to learn about different use cases for Phi models. - [AI Podcast Assistant Demo Notebook](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/community/ai-podcast-assistant) - [Read About the Latest Multimodal Phi 4 Launch](https://developer.nvidia.com/blog/latest-multimodal-addition-to-microsoft-phi-slms-trained-on-nvidia-gpus/) - [Read Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning](https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/) #### Integrate Get started with the right tools and frameworks for your development environment. - [Download Containers From the Jetson AI Lab](https://www.jetson-ai-lab.com/models.html) - [Customize Phi 3 With Your Own Data Using the NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/phi3.html) - [PhiCookBook on GitHub](https://github.com/microsoft/PhiCookBook) #### Optimize Optimize inference workloads for LLMs with TensorRT-LLM. Learn how to set up and get started using Llama in TRT-LLM. - [Optimize Phi 3 With the TensorRT-LLM Open-Source Library](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi) - [Deploy Phi 3 With Triton and TensorRT-LLM](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/llm.html) Get started with the models for your development environment. Model ### Get Production-Ready Phi Models With NVIDIA NIM The NVIDIA API Catalog enables rapid prototyping with just an API call [Deploy Production-Ready Phi Models](https://build.nvidia.com/search?q=phi ) Model ### Phi on Ollama Ollama lets you deploy Phi quickly to all your GPUs. [Deploy Optimized Models Using Ollama](https://ollama.com/library/phi4) Model ### Phi-3.5-mini-Instruct INT4 ONNX The Phi-3.5-mini-Instruct INT4 ONNX model is the quantized version of the Microsoft Phi-3.5-mini-Instruct model, which has 3.8 billion parameters. [Download From the NVIDIA Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/models/phi-3.5-mini-instruct-onnx-int4-rtx) [View More Family Models](https://build.nvidia.com/search?q=phi) ## ![AI Model - Microsoft Phi logo](https://developer.download.nvidia.com/images/logos/logo-qwen.svg)Qwen Alibaba has released Tongyi Qwen3, a family of open-source hybrid-reasoning large language models (LLMs). The Qwen3 family consists of two MoE models, 235B-A22B (235B total parameters and 22B active parameters) and 30B-A3B, and six dense models, including the 0.6B, 1.7B, 4B, 8B, 14B, and 32B versions. With ultra-fast token generation, developers can efficiently integrate and deploy Qwen3 models into production applications on NVIDIA GPUs, using different frameworks such as NVIDIA TensorRT-LLM, Ollama, SGLang, and vLLM. #### Explore Explore sample applications to learn about different use cases for Qwen models. - Deploy Launchable: [Running Qwen 3 Next with SGLang on NVIDIA GPUs](https://brev.nvidia.com/launchable/deploy?launchableID=env-32vt3nCJIIjT7omuDmbjJQcb0pW) - Deploy Launchable: [Running Qwen 3 Next with vLLm on NVIDIA GPUs](https://brev.nvidia.com/launchable/deploy?launchableID=env-32vt7HcQjCUpafGyquLZwJdIm8F) - Deploy Launchable: [Build an AI Agent with Qwen3 Next with NVIDIA NIM](https://brev.nvidia.com/launchable/deploy?launchableID=env-32qPA9uwn9WF7aFkmtZrPeltQJL) - Watch Video: [Build an AI Agent with Optimized Qwen3-Next Powered by NVIDIA NIM](https://www.youtube.com/watch?v=5yzSgKu8hiI) - [Canary-Qwen-2.5B Sets New Speech AI Benchmark](https://www.youtube.com/watch?v=p3RbbtVVgvk) #### Integrate Get started with the right tools and frameworks for your development environment. - [Customize Qwen3 Using NeMo](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/qwen3.html) - [Deploy Qwen on NVIDIA GPUs Using SGLang](https://docs.sglang.ai/basic_usage/qwen3.html) - [New Open Source Qwen3-Next Models Preview Hybrid MoE Architecture Delivering Improved Accuracy and Accelerated Parallel Processing across NVIDIA Platform](https://developer.nvidia.com/blog/new-open-source-qwen3-next-models-preview-hybrid-moe-architecture-delivering-improved-accuracy-and-accelerated-parallel-processing-across-nvidia-platform/?linkId=100000382638998) - Read Blog: [Integrate and Deploy Tongyi Qwen3 Models Into Production Applications With NVIDIA](https://developer.nvidia.com/blog/integrate-and-deploy-tongyi-qwen3-models-into-production-applications-with-nvidia/) #### Optimize Optimize inference workloads for LLMs with TensorRT-LLM. Learn how to set up and get started using Llama in TRT-LLM. - [Performance Evaluation of Qwen3 Using TensorRT-LLM Disaggregation on GB200](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html#qwen-3) - [Build and Run a Qwen Model With TensorRT-LLM for Single Node, Single GPU, or Multi-GPU](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/qwen/README.md) - [Benchmarking Qwen3 With NVIDIA Dynamo Community](https://www.youtube.com/watch?v=BUVOCqbmy3U) Get started with the models for your development environment. Model ### Qwen Models on NVIDIA API Catalog Try out these powerful models capable of thinking and reasoning, that can achieve significantly enhanced performance in downstream tasks, especially hard problems. [Explore Model](https://build.nvidia.com/qwen) Model ### NVIDIA NeMo canary-qwen-2.5b NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. [Download From Hugging Face](https://huggingface.co/nvidia/canary-qwen-2.5b) Model ### Qwen on Ollama Ollama enables you to deploy a variety of Qwen models quickly to all your NVIDIA GPUs. Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. [Deploy Qwen Locally With Ollama](https://ollama.com/library/qwen)[Download the Latest Version of Qwen3 From Ollama](https://ollama.com/library/qwen3) [View More Family Models](https://build.nvidia.com/search?q=qwen) #### NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Cost for Agentic AI Built to accelerate the next generation of agentic AI, NVIDIA Blackwell Ultra delivers breakthrough inference performance with dramatically lower cost. Cloud providers such as Microsoft, CoreWeave, and Oracle Cloud Infrastructure are deploying NVIDIA GB300 NVL72 systems at scale for low-latency and long-context use cases, such as agentic coding and coding assistants. This is enabled by deep co-design across NVIDIA Blackwell, NVLink™, and NVLink Switch for scale-out; NVFP4 for low-precision accuracy; and NVIDIA Dynamo and TensorRT™ LLM for speed and flexibility—as well as development with community frameworks SGLang, vLLM, and more. [Explore technical results](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) ![Data center illustration showing multi-modal AI tokens for image, audio, visual and more as part of the NVIDIA “Think SMART” framework.](https://developer.download.nvidia.com/images/dgx-press-gb300-1920x1080.jpg) * * * ## More Resources ![Decorative image representing Developer Community](https://developer.download.nvidia.com/icons/m48-developer-1.svg) ### Join the NVIDIA Developer Program ![Decorative image representing Training and Certification](https://developer.download.nvidia.com/icons/m48-certification-ribbon-2.svg) ### Get Training and Certification ![Decorative image representing Inception for Startups](https://developer.download.nvidia.com/images/isaac/m48-ai-startup-256px-blk.png) ### Accelerate Your Startup * * * ## Ethical AI NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). Try top community models today. [Contact Us](mailto:ContactDesignWorks@nvidia.com)