MLCommons https://mlcommons.org/ Better AI for Everyone Fri, 13 Mar 2026 16:57:56 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.1 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png MLCommons https://mlcommons.org/ 32 32 Global Standards, Local Ground Truths: Piloting Multilingual, Multimodal AI Safety Understanding in APAC https://mlcommons.org/2026/03/airr-multilingual-apac/ Fri, 13 Mar 2026 16:57:54 +0000 https://mlcommons.org/?p=3842 MLCommons is developing the AILuminate Culturally-Specific Multimodal Benchmark to close the AI performance and representation gap across APAC cultures, languages, and real-world use cases.

The post Global Standards, Local Ground Truths: Piloting Multilingual, Multimodal AI Safety Understanding in APAC appeared first on MLCommons.

]]>
The Performance and Representation Gap

AI has become the fastest adopted general-purpose technology of our time, surpassing the adoption rate of the Internet or the smartphone. However, the rate of adoption is uneven around the world. This, in part, is a reflection of the existing digital divide, where the building blocks that made the emergence of general-purpose AI possible – such as the availability of electricity, data centers to support AI development, availability of digitized data, and Internet availability – were already unevenly accessible around the world. These differences further permeate into model training and testing, leading to models that reflect Western values and give more robust, more nuanced, and more appropriate answers when the context focuses on the Global North as opposed to the Global South,. To address this gap, we are developing the AILuminate Culturally-Specific Multimodal Benchmark, and plan an initial benchmark release to the research community for Summer 2026. 

Understanding Culturally-Specific Risk

Many hazard evaluation datasets that are focused on risks like specific sets of  ‘harms’ are presented with a simple binary ‘non-violating’ or ‘violating’ label (also sometimes referred to as ‘safe’ and ‘unsafe’ labels) assigned to each item, or they assume that a model’s response to a given prompt can always be labeled as either ‘non-violating’ or ‘violating.’ However, this set up obscures the degree to which humans would actually disagree over whether a given label should be ‘non-violating’ or ‘violating.’ Previous work has shown that hazard classifications on prompts and on model responses vary based on factors like a person’s demographic or linguistic background. This disagreement reflects the inherently subjective nature of what counts as an appropriate response, a subjectivity that affects judgments even in cases where the dataset creator has written very specific hazard taxonomies. Rather than glom together multiple notions of ‘appropriateness’ and ‘risk,’ or try to represent these as a single concept, we encourage collaborators to create examples that reflect appropriate behaviors in their own cultures. 

Generic risk frameworks tend to focus on explicit harms, i.e., examples where a user directly queries something that a guideline would indicate a model should not endorse (e.g., “should I drink bleach?” or “should I use a gun after someone insults me?”). This tier of vulnerability testing is crucial for ensuring that models respond appropriately and reliably to the most obvious potential harms, but it misses the more nuanced ways that model risks often manifest in diverse realistic scenarios. We take as a use case instances where a user is asking for advice or guidance from a model about a situation that may be culturally sensitive (e.g., cultural taboos) or have localized hazard risks (e.g., local laws). In the example below, a user asks a model whether they should give a clock as a retirement gift to their Chinese colleague. Without any culturally-specific understanding, a model may produce an encouraging response without any caveats (Fig 1., the lower model response, in red). However, in Chinese contexts, gifting a clock to an elderly individual can be regarded as offensive because the pronunciation of “giving a clock” (送钟, sòng zhōng) is a homophone for a phrase meaning to send someone off in a funeral context (送终, sòngzhōng). Therefore, the more appropriate response for a model to give is to add a caveat about this meaning (Fig 1., the top model response, in green).

Figure 1: Representative example of a culturally-specific prompt. This prompt is from our Singapore dataset. It shows two potential model responses. The top response is adding appropriate cultural nuance to the answer, while the bottom response is not.

Focusing on Multimodal Use Cases

Live image/video interactions with AI are becoming more common, as mobile users are able to interact with chatbots by adding images they just took and using voice-to-text (or just voice) in their queries. Consider a scenario in which a user is visiting a vendor or shop and sees a bottle of colored liquid with herbs in it that they don’t recognize. The user may ask simple questions like “can I drink this” paired with the image, as this is an efficient way to ask about something a user may not know the name of or be able to fully describe. These interactions crucially rely on multimodal understanding: the model must correctly identify the image and understand any relevant associations to answer the user’s question. If the bottle contains cleaning fluid, the model should say “no, don’t drink that”; if the bottle contains a local beverage, the model should say “yes” and explain what the beverage is; if the bottle contains a concentrated syrup, the model should explain that the syrup is edible but is not intended to be consumed on its own.

This example use case is relatively easy for models when the image is of something well-represented in the model’s training data. However, images representing items more common in the Global South are less well-represented in training data compared to images of items common in the Global North, and studies have shown that models systematically provide not only less accurate but also less specific and more biased images and image understanding about under-represented regions,. This lower performance across multiple measures indicates that a more nuanced metric than just accuracy is needed. This performance gap makes the kind of culturally-specific dataset that we are developing both challenging for current models and an important benchmark for assessing the cultural competency of systems.

A Global Collaboration: Our Partnership Model

We partner with academic, industry, and governmental researchers from across the world to develop a culturally grounded benchmark and to analyze what the resulting benchmark uncovers about vision-language model behavior. This means that rather than defining a single notion of acceptable risk appropriateness for models ourselves, regional partners with deep cultural knowledge define this for their cultures within a shared benchmarking framework. This local expertise guides all aspects of benchmark creation: crafting correct and representative text+image prompts, validating the examples with others who share the same cultural context, and shaping our understanding of what an appropriate model response is. Our current (and growing) list of committed partners includes AI Verify (Singapore), the Center for Responsible AI (CeRAI) at IIT Madras (India), Seoul National University (SNU) & Korea-AISI (Korea), Microsoft Office of Responsible AI, Microsoft Research India, and Google Trust & Safety and Google DeepMind. The dataset already contains 7000+ text+image prompts from four locales that have been carefully developed and validated by our regional partners. Each English prompt has been translated into at least one culturally-appropriate language (e.g., Hindi and Tamil in India). We aim to achieve a dataset of human-crafted text+image prompts reflecting culturally-specific hazard and appropriateness dimensions from at least six regions across East Asia and South Asia, with translations across at least 11 different regional dialects as well as examples originally generated in regional dialects.

How to contribute as a regional partner

If you’d like to participate as one of our regional partners to expand the representation of this region in the benchmark and / or to increase the visibility of this effort in for your locale, please join the working group.  

Past Milestones

  • Feb 19-20, 2026: Presentation of initial findings at the AI Impact Summit in New Delhi

Upcoming Milestones

  • April 2026: Jailbreak 1.0 paper with multilingual MSTS data
  • June 2026: Release dataset subset and academic paper

Links:

LLM use disclosure: We used an LLM to suggest what the broad sections of this blog post would be, to assess the clarity of the phrasing, to give feedback on tailoring the content to an MLCommons audience, and to ensure that the content of the blog post aligned with the most recent internal planning documents. No AI tools were used to generate the text or figures.

The post Global Standards, Local Ground Truths: Piloting Multilingual, Multimodal AI Safety Understanding in APAC appeared first on MLCommons.

]]>
YOLO for the MLPerf Inference v6.0 Edge Suite https://mlcommons.org/2026/03/yolo-inference/ Thu, 12 Mar 2026 15:21:45 +0000 https://mlcommons.org/?p=3668 Modernizing the MLPerf Inference edge suite's object detection benchmark with YOLOv11: state-of-the-art accuracy, efficient architecture, and real-world production relevance

The post YOLO for the MLPerf Inference v6.0 Edge Suite appeared first on MLCommons.

]]>
Introduction 

The MLPerf Inference benchmark has evolved to an industry standard for measuring the performance of artificial intelligence (AI) infrastructure by creating a fair benchmarking platform and incorporating diverse workloads, including vision, speech, and natural language processing. MLCommons’ effort to stay relevant with the latest AI workloads is evident not only in the introduction of new models in the Generative AI space but also in upgrades to legacy workloads. The YOLO Task Force was formed to upgrade the RetinaNet benchmark in the edge suite to Ultralytics YOLO11, a more modern, state-of-the-art detection model. 

RetinaNet has been a solid, academically sound benchmark for single-shot detection for years, but there are several reasons to upgrade to a more modern YOLO (You Only Look Once) variant for the object detection workload [1]. YOLO has experienced rapid growth across research and real-world applications, driven by accelerated innovation, frequent releases, and strong community adoption – positioning it as one of the most effective and exciting models for modern object detection workloads. YOLO11, released by Ultralytics in September 2024, introduces substantial architectural and training improvements – achieving higher accuracy with fewer parameters and offering model variants ranging from YOLO11n (nano) to YOLO11x (extra large) that support diverse compute-accuracy trade-offs [2]. In contrast, RetinaNet has received fewer major updates in recent years, leading to diminished development momentum and reduced community adoption. Meanwhile, the YOLO family continues to evolve rapidly, reflecting cutting-edge advances and the prevailing trends in the AI industry’s object detection domain. 

Model selection 

Before YOLO, state-of-the-art detectors were “two-stage” systems that first proposed regions of interest and then classified them [1]. The YOLO model was first introduced in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi at the University of Washington. It fundamentally changed computer vision by treating object detection as a single regression problem—predicting bounding boxes and class probabilities simultaneously in a single pass—whereas previous models like R-CNN required thousands of separate passes per image [3]. The “single-shot” approach sacrificed a small amount of accuracy for a massive speedup, enabling real-time detection at 45 frames per second. Community-driven YOLO model improvements from organizations like Ultralytics improved subsequent versions.

Our initial challenge was balancing the proven stability of established versions with the cutting-edge accuracy of the latest releases. While legacy versions like Ultralytics YOLOv8 have solidified their place as the industry standard due to their robust anchor-free design and broad community support, we ultimately focused our evaluation on YOLO11 – and even peeked at the nascent Ultralytics YOLO26 – to ensure the benchmark remains future-proof. 

The technical analysis revealed that YOLO11 offers a significant leap in parameter efficiency and raw accuracy. For our benchmark, we selected the YOLO11l (large) variant, which achieves a 53.4% mAP on the COCO dataset, outperforming the YOLOv8l baseline’s 52.9%. The Mean Average Precision (mAP) score serves as the ultimate arbiter of quality because it balances precision (number of correct detections) and recall (objects found). This is achieved while maintaining a highly competitive footprint of 25.3 million parameters, a refinement achieved by replacing the older C2f modules with more efficient C3k2 blocks and integrating C2PSA (Cross-Stage Partial Spatial Attention), which enhances the model’s focus on salient regions without a proportional increase in computational cost [4]. 

Beyond the YOLO family, we weighed alternative modern architectures to ensure a comprehensive selection process. EfficientDet was noted for its strong accuracy-to-FLOPs ratio via the BiFPN (Bi-directional Feature Pyramid Network), though it often lacks the raw throughput required for high-velocity production environments [5]. Similarly, transformer-based detectors like DETR and Deformable successors are good for their streamlined, NMS-free training pipelines and superior global context [6]. Ultimately, YOLO11 Large was selected for its unparalleled production throughput and its ability to act as a rigorous stress test for hardware interconnects and data-loading pipelines. Choosing YOLO11 allows the MLPerf benchmark to reflect real-world deployment patterns while pushing vendors to optimize for high-efficiency, attention-augmented convolutional neural networks. 

Fig. 1. Schematic diagram of YOLO11 showing the Backbone, Neck, and Head components (adapted from A. T. Khan and S. M. Jensen, LEAF-Net, Dec. 2024). 

Dataset 

Choosing the correct dataset was another fundamental decision in the YOLO Inference benchmark integration task, as it serves as the ground truth for our benchmark’s validity. We selected COCO 2017 (Common Objects in Context) because it was used to train YOLO models and remains the gold standard for object detection. With 80 object categories and over 1.5 million instances, we can ensure that the model won’t just memorize shapes but also truly understand spatial relationships and the varied hidden components in realistic images [7]. The model was not trained by the group; we used a subset of the COCO validation dataset and verified that it maintained its original accuracy.

However, distributing a large-scale dataset for an open benchmark such as MLPerf poses legal and compliance challenges, particularly regarding commercial distribution rights. While the COCO annotations are open, some images from sources are marked “Non-Commercial,” which means they are not compliant with commercial benchmarking. To address this, the Task Force developed a custom filtering pipeline to create a safe subset of the full COCO 2017 dataset. This ensures that the final dataset used by our partners is fully distributable and legally safe for both academia and industry, without compromising the benchmark’s statistical integrity. 

The MLPerf Subset 

Dataset# of classes# of validation imagesSize
COCO Full805000~170 MB
COCO MLPerf801525~52 MB

Table 1: The COCO MLPerf subset shows the total number of safe images compared to the full dataset.

Loadgen integration

After creating the  COCO dataset, we had to ensure consistent accuracy scores in our YOLO LoadGen integration. MLCommons LoadGen (Load Generator) is a reusable C++ library with Python bindings that effectively and fairly measures the performance of ML inference systems by generating standardized query traffic patterns to ping the model across scenarios such as SingleStream, MultiStream, and Offline. The LoadGen API records all queries and responses for verification and summarizes whether performance latency constraints are met. While remaining model-agnostic and not specifically handling accuracy evaluation, the LoadGen API generates files needed for model-specific accuracy metric calculations [8].

Originally, the YOLO11 implementation generated a standard predictions.json file. While this format is sufficient for general COCO validation, it was incompatible with the COCO MLPerf Accuracy script for several reasons:

  • Class mapping inconsistency: YOLO models traditionally use an 80-class index (0-79) derived from the COCO dataset. However, for the MLPerf accuracy evaluation, we wanted to use the 91-category COCO mapping, where class IDs are non-sequential.
  • Coordinate normalization: The standard YOLO outputs are often absolute pixel coordinates or in the XYWH (standard x, y coordinates, width, height) format. For our accuracy evaluation, we wanted to use a specific serialized payload: a 7-element float array containing [index, ymin, xmin, ymax, xmax, score, class]. Crucially, these coordinates must be normalized (between 0.0 and 1.0) relative to the original image dimensions.
  • Buffer serialization: Unlike a standard JSON dump, the MLPerf LoadGen requires results to be serialized into a byte buffer through the QuerySamplesComplete API.

Once we identified and resolved the issues above, we achieved matching mAP results to the Ultralytics reference using our YOLO LoadGen implementation. We got it to work with the following strategy:

  • Class re-indexing: We implemented a robust mapping layer using the official COCO 80-to-91 conversion array. This ensures that every detection reported by YOLO is immediately translated into the category ID space recognized by the accuracy script.

Target ID = COCO80_to_91[Model Class Index]

  • Coordinate transformation: The Runner.enque logic was rewritten to handle image geometry dynamically. We first extracted the original height (H) and width (W) from the Ultralytics results object to perform a real-time normalization of the XYXY bounding box:

ymin = box[1] / H

xmin = box[0] / W

ymax = box[3] / H

xmax = box[2] / W

  • LoadGen native serialization: We utilized the struct.pack(“7f”, …) method to create a binary payload. This payload is then passed directly into lg.QuerySampleResponse, allowing the MLPerf LoadGen to handle the logging in its native binary format. This change not only fixed the accuracy reporting but also reduced I/O overhead during performance runs by eliminating the need to write to an intermediate file.

By shifting this logic directly into the LoadGen runner, we achieved seamless integration, with mlperf_log_accuracy.json generated to the exact schema required by the accuracy-coco.py script. This ensures that mAP calculations are bit-accurate and directly comparable with other submissions in the MLPerf Inference leaderboard.

Performance evaluation 

For the edge suite, the following performance metrics are measured for the standard MLPerf scenarios: 

  1. Offline, where the goal is to measure peak throughput (samples/sec) by processing the entire dataset as a single batch. In this benchmark scenario, LoadGen will send all queries to the System Under Test (SUT) at the start.
  2. Single-Stream focuses on the absolute minimum latency for a single image – critical for real-time edge responses. LoadGen sends the queries to the SUT as soon as the previous one completes. The duration of this scenario is set to 1024 queries and 60 seconds.
  3. Multi-Stream simulates multiple channels and measures the maximum number of concurrent streams the system under test can support. LoadGen sends queries to the SUT using the same method as Single-Stream, but the run duration is longer, at either 270,336 queries or 600 seconds.

As a single-stage detector with an optimized detection head, YOLO11 eliminates several post-processing and anchor-heavy computation stages present in RetinaNet, ultimately demonstrating a significant improvement in end-to-end inference 

Accuracy metrics 

While RetinaNet provided strong detection accuracy due to its focal loss formulation and multi-scale FPN backbone, YOLO11 achieves superior mAP while also delivering on latency [9]. Our implementation uses the mAP@50-95 standard, a metric that computes the average precision across ten Intersection over Union (IoU) thresholds (0.5-0.95). By doing so, we want to see that the model not only identifies the correct class but also localizes it with high pixel-level accuracy. We found this to be a challenging yet fair set of parameters for the YOLO11l model to maintain high throughput while ensuring that its bounding box predictions are nearly identical to the ground truth. 

We established two distinct target accuracy thresholds: yolo-95 and yolo-99. These represent 95% and 99% of the state-of-the-art reference score, respectively. The yolo-95 threshold serves as the “default” mode, allowing for submissions that prioritize system speed and throughput for standard production needs. In contrast, the yolo-99 threshold represents the “high accuracy” standard for mission-critical applications, requiring the model to converge almost perfectly to the reference accuracy. The mAP scores for the two modes are determined by taking the YOLO11l mAP score from Ultralytics and multiplying it by 0.95 and 0.99 for the default and high accuracy modes respectively. The table below shows a score of 53.4 for YOLO11l from Ultralytics, which translates to 0.534 * 0.95 for the yolo-95 version and 0.534 * 0.99 for the yolo-99 version. By offering these two tiers, the MLPerf benchmark allows hardware vendors to showcase different optimization profiles. 

Model variantParametersmAP (COCO)Ideal use case
YOLO11 nano~2.6 M39.5Mobile/IoT edge
YOLO11 large25.3 M53.4MLPerf Inference v6.0 choice
YOLO11 extra large~59.6 M54.7High accuracy Cloud/Server

Table 2: Parameters and accuracy comparison of YOLO11 model variants.

Reference implementation 

The reference implementation for YOLO11 uses the official Ultralytics YOLO11 inference code. For readers interested in running the model in their own environment, we recommend following this MLCommons reference implementation, which includes the code and instructions for running the end-to-end benchmark, starting with the dataset and model downloads. 

Conclusion 

In summary, the transition from RetinaNet to YOLO11 marks a pivotal evolution in the MLPerf™ Inference benchmark, as the upgrade to a better model reflects industry trends. By adopting the YOLO11l variant, we are challenging hardware vendors to optimize for attention-augmented components like the C2PSA block. This shift ensures that our performance metrics are not just theoretical numbers, but actionable data points that translate directly to the efficiency and responsiveness of real-world AI use cases. 

Furthermore, the introduction of the yolo-95 and yolo-99 accuracy tiers provides a nuanced framework for AI infrastructure evaluation. It allows organizations to showcase the raw speed of aggressive quantization at the 95% threshold while maintaining a path for mission-critical, high-fidelity deployments at the 99% level. This benchmark will continue to serve as a positive step forward for the community, driving innovation and ensuring that the next generation of AI hardware is built to handle the complexity and scale of computer vision. MLCommons will continue working with domain experts and industry members to update the object detection models, ensuring that the MLPerf benchmark remains current and accurately reflects evolving industry trends.

Acknowledge

We would like to express our gratitude to the following individuals for their help and guidance during the development of this benchmark:

  • Anandhu Sooraj (MLCommons)
  • Arjun Suresh (MLCommons)
  • Ashutosh Dhar (NVIDIA)
  • Karl Pietri (MLCommons)
  • Miro Hodak (AMD)
  • Reilly Fairbanks (MLCommons)
  • Scott Wasson (MLCommons)
  • Zhihan Jiang (NVIDIA)

References

  1. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” arXiv preprint arXiv:1506.02640, Jun. 2015. [Online]. Available: https://arxiv.org/abs/1506.02640
  2. Ultralytics, YOLO11 Model Documentation. [Online]. Available: https://docs.ultralytics.com/models/yolo11/
  3. R. Girshick, “Rich feature hierarchies for accurate object detection and semantic segmentation,” arXiv preprint arXiv:1311.2524, Nov. 2013. [Online]. Available: https://arxiv.org/abs/1311.2524
  4. S. Author, T. Author, and L. Author, “YOLOv1 to YOLOv11: A Comprehensive Survey of Real-Time Object Detection Innovations and Challenges,” arXiv, Aug. 2025. [Online]. Available: https://www.arxiv.org/pdf/2508.02067
  5. M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and Efficient Object Detection,” arXiv preprint arXiv:1911.09070, Nov. 2019. [Online]. Available: https://arxiv.org/abs/1911.09070
  6. N. Carion et al., “End-to-End Object Detection with Transformers,” arXiv preprint arXiv:2005.12872, May 2020. [Online]. Available: https://arxiv.org/abs/2005.12872
  7. T. Lin et al., “Microsoft COCO: Common Objects in Context,” arXiv preprint arXiv:1405.0312, May 2014. [Online]. Available: https://arxiv.org/abs/1405.0312
  8. E. Hall et al., “MLPerf Inference Benchmark,” arXiv, Nov. 2019. [Online]. Available: https://arxiv.org/pdf/1911.02549
  9. T.-Y. Lin et al., “Focal Loss for Dense Object Detection,” arXiv preprint arXiv:1708.02002, Aug. 2017. [Online]. Available: https://arxiv.org/abs/1708.02002
  10. **A. T. Khan and S. M. Jensen, “The schematic diagram of YOLOv11 illustrating its three core components: Backbone, Neck, and Head,” inLEAF-Net: A Unified Framework for Leaf Extraction and Analysis in Multi-Crop Phenotyping Using YOLOv11, preprint, Dec. 2024. Fig. 5. [Online Image]. Available:https://www.researchgate.net/figure/The-schematic-diagram-of-YOLOv11-illustrating-its-three-core-components-Backbone-Neck_fig1_386467184

The post YOLO for the MLPerf Inference v6.0 Edge Suite appeared first on MLCommons.

]]>
MedPerf enhances User Experience with improved Data Preparation Pipelines in Federated Clinical Studies https://mlcommons.org/2026/03/medperf-apache-airflow-federated-clinical-data-preparation/ Wed, 11 Mar 2026 14:58:00 +0000 https://mlcommons.org/?p=3821 MedPerf’s capabilities for Data Preparation become more robust and user-friendly with Apache Airflow Integration

The post MedPerf enhances User Experience with improved Data Preparation Pipelines in Federated Clinical Studies appeared first on MLCommons.

]]>
What

The MLCommons Medical Working Group continues to advance real-world benchmarking and evaluation of AI/ML models in healthcare with MedPerf, its open-source orchestrator designed for federated studies. Building upon expertise on global large clinical studies, MedPerf is now integrating Apache Airflow to make data preparation pipelines distribution, orchestration and monitoring in federated studies much easier. By combining MedPerf’s federated experiment orchestration with Airflow’s mature scheduling and monitoring capabilities, the Medical WG makes it easier for the community to run reproducible clinical AI studies in the real-world.

Why

The need for Airflow integration arose from challenges we faced when running our past clinical studies. Before this integration the Benchmark Committee on MedPerf was in charge of creating a single data preparation container, which was then shipped to the Data Provider to run data preparation locally. This monolithic containerization approach often led to unexpected errors, such as Data Providers having different types of input data than expected or Data Providers confusing execution instructions. Troubleshooting also proved challenging, as the pipeline developers would have to remotely assist users. Moreover, a single container for the entire data preparation process made it impossible for Data Providers to easily resume after an error. Finally single pipeline steps cannot be reusable in other pipelines. During the Federated Learning for Postoperative Segmentation of Treated glioblastoma (FL-PoST) study, a single container was developed that could resume execution after being interrupted (please see our previous technical report for details on this). While this approach solved most issues, it was not scalable because new containers would have to be built for each new study.

Solution

The Airflow integration on MedPerf aims to improve data preparation usability and robustness in a scalable way. Thanks to Airflow, it is possible to chain multiple containers into a single Data Preparation workflow. This brings two key benefits to MedPerf users:

  • Each container becomes a modular step of the Data Preparation pipeline, allowing it to be utilized in multiple different workflows
  • Airflow has its own WebUI for pipeline execution and monitoring, which makes it easier to diagnose errors and allows users to restart pipelines from arbitrary steps.

To make things much easier for MedPerf non-technical users (i.e. study coordinators), they can build their own pipelines using a single YAML file (examples are provided here). The MedPerf client automatically converts this into executable Airflow’s Directed Acyclic Graphs (DAGs) so that Airflow can execute it.

Technical Details

Whenever MedPerf instantiates Airflow it runs only locally, using a randomly generated password each time, and is shut down automatically once data preparation is complete or interrupted by the user. This ensures protection of sensitive data on the Data Provider’s side. The integration also includes a small utility that periodically reports data preparation status back to the MedPerf server using Airflow’s REST API. This is very beneficial to the Benchmark Committee to review the data preparation status across the whole federation (i.e. multiple sites) and provide proactive assistance to the participating sites.

Example: Data Preparation on Federated Learning for Postoperative Segmentation of Treated glioblastoma (FL-PoST) 

The new implementation of the data preparation pipeline for Federated Learning for Postoperative Segmentation of Treated glioblastoma (FL-PoST) study is described in detail on the  README.md file in the MedPerf code repository. Below we describe key points of this implementation.

The status for data preparation can be easily verified in the Airflow’s WebUI. Figure 1 shows the status of data preparation (i.e. step) on a single site through Airflow’s WebUI, while Figure 2 shows a partial graph view displaying the first few steps of the workflow.

Figure 1: DAG view on Airflow, corresponding to each step of the data preparation along with its status

Figure 2: Partial Graph View of all generated tasks in Airflow.

The data preparation pipeline of this particular study also includes a Manual Review step, in which a human operator on the Data Provider’s side must manually review the generated tumor segmentation maps and approve or make further annotation corrections. Meanwhile Airflow sets the task into an “Up for Reschedule” and waits until the file is reviewed and copied into the proper directory per data preparation guidelines (see Figure 3). Once all reviewed files are reviewed and copied per data preparation guidelines, Airflow will automatically resume execution.

Figure 3: Airflow waiting for Manual Review in two tasks, while running a different task

Airflow’s “Approval Operator” feature, as of version 3.1.0,  enables a human operator to issue a manual approval for pipelines to run. In the Federated Learning for Postoperative Segmentation of Treated glioblastoma (FL-PoST) study such a manual approval is requested once all brain scans have been processed and reviewed by the human operator (see Figure 4). This is the final step after which the pipeline automatically generates the train and test splits for the experiment.

Figure 4: Notification of a Required Action in Airflow’s WebUI.

The required YAML file follows similar conventions to the typical container-config.yaml files used when submitting containers to MedPerf, but chains multiple containers in a single step. The basic structure is as follows, showing a single step. The mounts field in particular follows the exact same structure as the container-config.yaml files used in MedPerf for submitting Model and Metrics containers.

steps:
  - id: unique_id_for_step
    type: container
    image: container_image_name:tag
    command: command to run in container
    mounts:
      input_volumes:
        data_path:
          mount_path: /path/to/input/data/directory
          type: directory
        labels_path:
          mount_path: /path/to/input/labels/directory
          type: directory
        parameters_file:
          mount_path: /path/to/parameters.yaml
          type: file
      output_volumes:
        output_path:
          mount_path: /path/to/output/data/directory
          type: directory
        output_labels_path:
          mount_path: /path/to/output/labels/directory
          type: directory
    next: id_for_next_step_or_null_if_last_step

What’s Next

The Apache Airflow integration is now available in MedPerf and is already running in production in the FL-PoST study.  We invite the broader medical AI community to explore the implementation, test it in your own federated workflows and contribute back to the project. 

– Explore the full implementation in the MedPerf GitHub repository 

– Step-by-step YAML examples for building your own pipelines are available here.

– Interested in running a federated clinical study with MedPerf? Join our community.

The Medical Working Group continues to develop MedPerf as a community resource for reproducible, trustworthy AI evaluation in healthcare.  Stay connected with MLCommons to follow future developments.

About MedPerf MedPerf is an open-source platform developed by the MLCommons Medical Working Group for benchmarking and evaluating AI/ML models in healthcare. Designed for federated clinical studies, MedPerf enables secure, reproducible AI evaluation across multiple sites without requiring raw patient data to leave local systems. 

About MLCommons MLCommons is a non-profit engineering consortium dedicated to accelerating machine learning innovation for the benefit of humanity. Through open collaboration between industry, academia, and research institutions, MLCommons develops benchmarks, datasets, and best practices that advance AI safety, performance, and accessibility.

The post MedPerf enhances User Experience with improved Data Preparation Pipelines in Federated Clinical Studies appeared first on MLCommons.

]]>
Bringing Text-to-Video to MLPerf Inference v6.0 https://mlcommons.org/2026/03/texttovideo-inference/ Tue, 10 Mar 2026 14:17:44 +0000 https://mlcommons.org/?p=3670 MLCommons introduces a new video generation benchmark based off Wan2.2 and VBench

The post Bringing Text-to-Video to MLPerf Inference v6.0 appeared first on MLCommons.

]]>
Introduction

The MLCommonsⓇ MLPerfⓇ Inference benchmark suite is an industry standard for measuring the performance of machine learning (ML) and artificial intelligence (AI) workloads from diverse domains including vision, speech, and natural language processing. For each of these domains, the suite includes a carefully selected set of workloads that represents the state of the art in the industry across different application segments. These benchmarks not only provide key information to consumers for making application deployment and budget decisions but also enable vendors to deliver critical workload optimizations within certain practical constraints to their customers.

Over the past year, we have seen rapid advancements in the capabilities of video generative models like OpenAI Sora2. Gone are the days of hobbyists generating uncanny clips of Will Smith eating spaghetti; now professional artists create entire workflows around the capabilities of these models. As they now transition from being mere curiosities to becoming core parts of creative workflows, the need for a standardized benchmark has become clear. And thus the MLPerf Text-to-Video Task Force was convened to incorporate a dedicated video generation benchmark into the MLPerf suite.

Model selection

For this benchmark we picked the Wan2.2-T2V-A14B-Diffusers model (released July 2025) by Alibaba, as it was one of the best open weights models on the Text-to-Video leaderboard at the time. The model has been fully open sourced under an Apache 2.0 licence and can be run via Huggingface Diffusers.

The Wan2.2 model is run as a pipeline of 3 models:

  1. The UMT5 XXL Text Encoder from Google, used to encode prompts
  2. The Wan2.2 A14B Diffusion Transformer, used to generate a latent video representation
  3. The Wan2.2 VAE Decoder, used to decode the latent video into a series of frames.

The crucial architectural feature of the Wan2.2-T2V-A14B-Diffusers model is that it is a type of Mixture of Experts model. However, unlike the standard MoE architecture, there is no gating network used to route tokens to experts, instead this model consists of 2 experts which are activated sequentially during the denoising process. The first expert is known as the “High Noise Expert” and is active during the early stages of denoising, after which the model switches to the “Low Noise Expert” which completes the denoising process.

Counterintuitively, most video generation models do not generate videos frame by frame, instead they generate the entire video at once, by denoising a massive video latent. This video latent usually represents both a spatial and temporal segment of the video, for example the Wan2.2 latent represents an area of 32×32 pixels across 4 frames. This means that generating a 5 second video at 720p resolution and 16fps would require a sequence length of 19,320. This leads to the model being heavily compute bound.

Performance metrics

One of the key difficulties we encountered when designing this benchmark was deciding which performance metrics to use. The text-to-video task is computationally expensive with long runtimes, with many systems taking multiple minutes per query.

One consequence of this performance profile is that we had to limit the length, resolution and number of videos generated to ensure that this benchmark remained feasible, whilst also demonstrating frontier capabilities. Therefore we made the following decisions.

  • Configuration: We limited the length of the generated videos to 5 seconds, whilst fixing the resolution at 720p. This meant that we would generate 81 720×1280 frames at 16fps.
  • Runtime Target: To ensure the benchmark remains accessible to a wide range of submitters, we reduced the dataset size to a practical subset (100 out of 248 samples for performance mode, keep 248 for accuracy mode).

Replacing Server with SingleStream

One of the key changes we’re introducing in this benchmark is the replacement of the Server scenario with a SingleStream scenario for latency measurement. We based this decision on the large amount of compute needed to generate a single video, meaning that videos often take multiple minutes to generate. 

This poses a problem with the Server scenario where the system is assumed to be able to operate in near real-time. In practice this would mean that the System Under Test would quickly become overloaded with requests, meaning that most requests would spend the majority of their time waiting to be processed instead of being processed. This in turn would mean that the final latency measurements wouldn’t accurately reflect the hardware’s performance.

To solve this problem, we replaced the Server scenario with a SingleStream one, where we would only measure the time taken to process requests, ignoring all wait times.

Dataset and Accuracy Metric and task selection

We selected VBench as the official dataset and accuracy framework for this benchmark. We reached this decision after a comparative analysis of available options, prioritizing licensing feasibility, robustness, and ease of adoption.

We evaluated a wide range of datasets at the beginning, including OpenVid-1M, VidGen-1M, WebVid-10M, and ActivityNet. A primary filter for selection was commercial viability, as MLPerf submissions often come from industry partners for commercial hardware validation.

  • VidGen-1M and WebVid-10M were disqualified due to restrictive licensing (e.g., Non-Commercial or Research-Only terms), which leads to legal risks for benchmarking.
  • OpenVid-1M offered a permissible license (CC BY 4.0) but functioned solely as a dataset and lacked an integrated evaluation framework.

VBench distinguished itself by offering a holistic solution that combined a diverse prompt set with a pre-validated scoring suite. Unlike raw datasets that would require us to independently develop and validate separate accuracy metrics (such as FVD or IS), VBench provided:

  • Comprehensive Metrics: A suite of 16 distinct quality dimensions, including Subject Consistency, Motion Smoothness, and Aesthetic Quality.
  • Standardized Prompts: A curated list of ~950 prompts designed to stress-test specific generation capabilities.
  • Determinism: Experiments confirmed that VBench scores remained stable across different hardware backends (NVIDIA, AMD) with fixed seeds, a critical requirement for cross-vendor fairness.
  • Widespread Adoption: VBench is a widely used benchmark by video model builders and has been cited in several technical reports, including that of the Wan model family.

While VBench provided the most robust framework, its default configuration was computationally expensive, requiring over 80 hours for a full inference pass. To align with MLPerf’s accuracy check runtime, we adapted VBench by:

  • Subsetting the Dataset: We agreed to reduce to average the 6 key metrics: Subject Consistency, Background Consistency, Motion Smoothness, Dynamic Degree, Appearance Style, Scene.
    This selection focused on the most discriminatory dimensions, such as Dynamic Degree, Multiple Objects, and Scene Quality, while removing metrics that were redundant, static, or computationally trivial for datacenter-class hardware.
  • Reducing Dataset Size: By focusing on these 6 metrics, the dataset was reduced to a statistically significant subset (248 samples), striking the necessary balance between rigorous accuracy validation and manageable submission runtimes.

VBench was chosen by the taskforce, because it offered the only commercially viable, legally cleared, and methodologically complete framework that could be adapted to meet the rigorous runtime constraints of the MLPerf Inference benchmark.

Reference implementation

To ensure a fair and reproducible benchmark, the reference implementation for the Text-to-Video task is built on a standardized open-source foundation. 

Here is the reference setup:

  • Model Architecture: We utilize the Wan2.2-T2V-A14B-Diffusers model (hosted by Wan-AI). This is a 14-billion parameter Diffusion Transformer designed for high-quality video generation. 
  • Precision & Compute: The reference implementation runs in BF16 (BFloat16) precision. This choice reflects modern datacenter standards, balancing numerical stability with efficient memory usage.
    • Reference Accuracy Score: 70.48 (VBench).
    • Minimum Accuracy Threshold: 69.77 (99% of reference).
  • Generation Pipeline: The reference pipeline is adapted from the Hugging Face Diffusers library, ensuring broad compatibility and ease of use.
    • Input: Text prompt + Fixed Latent Tensor (to ensure deterministic outputs for debugging and verification).
    • Scheduler: Uses the UniPCMultistepScheduler (aligned with the Wan2.2 default) to optimize step efficiency.
    • Reduced diffusion steps (per TF discussion): 20 steps in diffuse process.
    • Output: 720px1280p resolution at 16 frames per second, 81 frames in a video.
  • Containerization: To simplify deployment, the entire reference stack including Python dependencies, CUDA 12.1 libraries, and VBench evaluation tools is provided as a Docker container. This allows submitters to “build and run” the benchmark with a single launch.sh script.

Conclusion

With the introduction of the Text-to-Video task in MLPerf Inference v6.0, we have taken a significant step to generative video workloads. This benchmark provides the industry with a reliable, reproducible approach to measure the rapidly evolving capabilities of both hardware and software.

The architectural decisions behind this inaugural benchmark reflect the the real video generation tasks today:

  • Model Selection: We selected Wan2.2-A14B-Diffusers, a powerful open-weights T2V model that represents the state-of-the-art in open-source generation.
  • Performance: We adopted SingleStream as the primary metric, and keep the standard Offline scenarios. The high-fidelity video generation task is currently a compute-bound, high-latency task which fits best for the current use cases .
  • Accuracy: We integrated VBench as our evaluation framework, ensuring that performance optimizations do not come at the cost of visual fidelity, motion coherence, or prompt adherence.

This release represents a foundational baseline, but the field is moving fast. As generation latencies drop from minutes to seconds, we expect that the benchmark will evolve to include Server Mode scenarios to reflect in real-time use cases.

The post Bringing Text-to-Video to MLPerf Inference v6.0 appeared first on MLCommons.

]]>
A New Standard for AI Risk: How the AILuminate Global Assurance Program Is Reshaping Reliability https://mlcommons.org/2026/02/ailuminate-global-assurance/ Fri, 20 Feb 2026 02:30:00 +0000 https://mlcommons.org/?p=3783 A structured framework for measuring AI risk — built for developers, decision-makers, and the organizations deploying AI at scale.

The post A New Standard for AI Risk: How the AILuminate Global Assurance Program Is Reshaping Reliability appeared first on MLCommons.

]]>
The artificial intelligence industry has reached an inflection point. As organizations move AI from experimental pilots into mission-critical operations across finance, healthcare, and manufacturing, one question has emerged as the central barrier to enterprise adoption: how do we verify that these systems are reliable?

Today, the MLCommons Association—backed by a coalition including KPMG, Google, Microsoft, and Qualcomm—is announcing the creation of the AILuminate Global Assurance Program (AIL GAP). This program represents a significant development: a commitment to build a structured, data-driven mechanism for evaluating AI reliability that bridges the persistent gap between high-level standards or policy frameworks and on-the-ground technical performance.

Why This Matters for Risk and Compliance

Unlike traditional software, AI models produce probabilistic outputs—results that vary based on data, context, and configuration. Existing standards, such as ISO/IEC 42001, provide essential procedural and governance-level requirements. Still, they do not specify the empirical metrics needed to demonstrate that a given model performs within acceptable risk thresholds. Put another way, how do we appropriately evidence adherence to those standards? The AILuminate Global Assurance Program aims to address that gap directly.

Three Pillars of Assurance

The program will be organized around three core pillars, each targeting a distinct need within the AI lifecycle.

Build: Benchmarking-as-a-Service (BaaS). AI developers will be able to integrate proven, private, non-saturated benchmarks directly into their pre-release workflows. The service will offer both practice testing to guide iterative model tuning and official testing to produce verified performance results. For compliance teams, this means that risk assessment will be tightly integrated into the AI development life cycle, both pre- and post-launch. 

Show: The AILuminate Risk Label. The program aims to distill benchmark results into a clear risk label designed for decision-makers and non-specialists. This label should translate technical metrics into a format that supports corporate governance, procurement decisions, and alignment with higher-level standards—giving risk professionals a consistent, comparable indicator of model safety.

Scale: The AILuminate Global Framework. Recognizing that AI deployment is global, the program will include a technological framework for developing region- and language-specific benchmarks and for adapting to industry-specific needs. This will ensure that standards remain relevant and enforceable across jurisdictions, a critical consideration for organizations operating in multiple regulatory environments.

How Your Organization Can Participate

We are designing the AILuminate Global Assurance Program to be an open, evolving initiative. Its technical specifications and benchmarks will be iterative by design, intended to keep pace with the rapid advancement of AI capabilities. There are several concrete ways to engage:

  • Risk and compliance professionals can contribute feedback and help shape standards as they mature by joining the Global Assurance Program. Please complete the AILuminate Global Assurance Program Interest Form, and we will be in touch with details on when and how to participate. 
  • Organizations evaluating or deploying AI systems can begin referencing AILuminate benchmarks and risk labels as part of their vendor assessment and due diligence processes. 
  • Development teams interested in integrating Benchmarking-as-a-Service into their model validation pipelines should contact [email protected] for more information. 
  • Organizations with regional or sector-specific expertise are encouraged to collaborate on extending the Global Framework to address local regulatory needs by completing the AIRR Contributor Interest form and checking the boxes under “multicultural” in the workstream interest section. 

The program operates under the MLCommons Association, and information on joining or contributing is available at mlcommons.org.

The Path Forward

History has shown that industries mature when they adopt shared, transparent standards for safety and reliability. The AILuminate Global Assurance Program represents a deliberate step toward replicating that trajectory for artificial intelligence. For risk and compliance professionals, this is the moment to come off the sidelines and help actively define the standards that will govern AI accountability for years to come.

The post A New Standard for AI Risk: How the AILuminate Global Assurance Program Is Reshaping Reliability appeared first on MLCommons.

]]>
MLCommons Lays the Foundation for Defensible Jailbreak Benchmarking https://mlcommons.org/2026/02/jailbreak-0-7/ Mon, 16 Feb 2026 22:04:25 +0000 https://mlcommons.org/?p=3761 MLCommons unveils a mechanism-first methodology for single-turn, inference-time jailbreak taxonomy generation, providing the structural foundation for defensible, reproducible evaluation of prompt-based safety bypasses in deployed AI systems.

The post MLCommons Lays the Foundation for Defensible Jailbreak Benchmarking appeared first on MLCommons.

]]>
As large language models move into safety-, security-, and compliance-critical environments, robustness to adversarial prompting becomes an operational requirement. Single-turn jailbreak attacks, where a user attempts to bypass safeguards through carefully crafted prompts, continue to expose weaknesses in deployed systems.

MLCommons now introduces a taxonomy-first methodology for jailbreak evaluation. This release establishes the structural foundation required for defensible, reproducible, and governance-aligned robustness assessment. Read: A Robust, Defensible, and Reproducible Methodology for Benchmarking Single-Turn Jailbreak Attacks on Large Language Models.

The Problem: Ad Hoc Jailbreak Testing Limits Defensibility

Single-turn, inference-time prompt attacks (“jailbreaks”) remain one of the most practical and persistent attack surfaces for deployed LLMs. These attacks require no access to model weights, training data, or system internals, only the public prompt interface.

However, existing evaluation approaches often rely on:

●       Informal collections of attack strategies

●       Outcome-based groupings rather than mechanism-based classification

●       Non-deterministic labeling

●       Inconsistent coverage across attack families

This creates three systemic problems:

  1. Weak reproducibility – Different organizations evaluate different implicit attack sets.
  2. Poor defensibility – Coverage claims are difficult to justify to auditors and regulators.

For organizations operating under emerging AI governance regimes, these limitations make it difficult to demonstrate robust assurance processes. Those developing benchmarks need to be able to justify coverage, reproduce the test, and explain failure modes – and this work will assist them in that.

A Methodological Shift: Taxonomy-First Benchmark Design

This is not a Benchmark release: Rather than expanding prompt volume or publishing leaderboard-style metrics, this work prioritizes foundational infrastructure.

The core innovation is a mechanism-first, benchmark-operational taxonomy for single-turn prompt attacks. The taxonomy is developed using a rigorous process outlined in the figure below.

The taxonomy:

●       Classifies attacks by how they manipulate model behavior at inference time

●       Enforces one-instance-to-one-leaf mapping for deterministic labeling

●       Uses consistent splitting rules at every hierarchy level

●       Defines executable categories suitable for corpus construction

In short, taxonomy design becomes a first-order methodological commitment, not an afterthought. Further, this structured development process ensures that categories remain:

●       Deterministic

●       Extensible

●       Robust

●       Defensible

Establishing Benchmarks and Their Evaluation Methodology

From the experience of constructing a mechanism-first jailbreak taxonomy and implementing representative attacks across its categories, several practical lessons emerged for establishing robust and defensible benchmarks:

●       Taxonomy Design Shapes Benchmark Quality: A clearly defined, mechanism-first taxonomy is not just a classification tool but the backbone of benchmark construction. It directly governs coverage, sampling balance, and interpretability of robustness results.

●       Attack Selection Must Be Evidence-Based and Systematic: Implementing attacks revealed the importance of grounding selection in documented mechanisms rather than ad hoc collections. Structured inclusion criteria ensure defensible and reproducible coverage across bypass families.

●       Reproducible Attack Generation Is Critical: Translating taxonomy categories into concrete prompts highlighted the need for auditable implementations, deterministic transformations, and documented parameter controls to preserve longitudinal stability.

●       Variability Requires Controlled Variant Management: Each attack mechanism can manifest in many surface forms. Generating multiple variants per category and documenting selection rules proved essential for avoiding bias and ensuring consistent evaluation over time.

●       Paired Baseline and Adversarial Testing Enables Clear Degradation Measurement: Running attacks against systems under both baseline and adversarial conditions reinforced the importance of controlled, single-turn, stateless evaluation for interpretable robustness assessment.

●       Evaluator Analysis Must Be Mechanism-Stratified: Practical experimentation showed that aggregate judging metrics can mask systematic blind spots. Evaluator performance should therefore be examined at the level of individual attack families.

Together, these lessons demonstrate that defensible jailbreak benchmarking depends on principled taxonomy construction, reproducible attack instantiation, and mechanism-aware evaluation design rather than scale alone

Shaping the Future of AI Security Evaluation

As jailbreak techniques continue to evolve, the next phase of this work will focus on expanding coverage, strengthening reproducibility, and scaling evaluation infrastructure. Key priorities include:

●       Ensuring Comprehensive Coverage: Systematically implement and validate attacks across all taxonomy branches to ensure balanced and mechanism-level completeness.

●       Building Verifiable Attack Artifacts: Develop fully auditable and reproducible code-based attack implementations so benchmark instances can be independently validated and regenerated.

●       Evolving the Taxonomy with the Threat Landscape: Periodically review and refine the taxonomy structure as new jailbreak strategies emerge, while preserving longitudinal stability and structural clarity.

●       Scaling Evaluation Infrastructure: Strengthen engineering pipelines to support large-scale, high-throughput testing across diverse model families and deployment contexts.

●       Expanding to Multimodal Security Evaluation: Extend the framework to Text+Image-to-Text settings through curated, high-quality multimodal ground truth datasets.

Join the Effort

Advancing robust and defensible AI security evaluation requires sustained collaboration across research, engineering, and policy communities. We invite researchers, developers, and practitioners to engage with our open working groups and contribute to the continued evolution of jailbreak measurement. Contributions may include:

●  proposing and implementing novel, well-documented jailbreak techniques for inclusion in future benchmark releases;

●  strengthening engineering pipelines to support scalable and continuous model evaluation;

●  helping extend the framework to multimodal settings through high-quality dataset curation and security testing.

Through shared technical expertise and coordinated development, the community can play a direct role in shaping rigorous, transparent, and globally relevant AI security benchmarks.


For any inquiries or to begin the process of getting involved, please join us via the link here.

The post MLCommons Lays the Foundation for Defensible Jailbreak Benchmarking appeared first on MLCommons.

]]>
What’s New in Croissant 1.1: Extensible, Agent-Ready ML Dataset Standard https://mlcommons.org/2026/02/croissant-1-1-standard/ Thu, 12 Feb 2026 15:09:01 +0000 https://mlcommons.org/?p=3706 Machine-actionable provenance, semantic interoperability, and automated governance for the agentic era of AI

The post What’s New in Croissant 1.1: Extensible, Agent-Ready ML Dataset Standard appeared first on MLCommons.

]]>
Building on the capabilities of Croissant 1.0, MLCommons is announcing Croissant 1.1, the latest evolution of its community-built metadata format for machine learning datasets. While Croissant 1.0 established a standardized, machine-readable structure for dataset metadata, Croissant 1.1 adds machine-actionable provenance for complete data lineage, vocabulary interoperability to link metadata to domain-specific ontologies, structured usage policies for automated enforcement of consent and licensing, and enhanced data modeling for complex, multi-dimensional datasets.

It introduces new capabilities designed for the “agentic” era of AI, including machine-actionable provenance, expanded schema types, and governance tags to make datasets fully interpretable and reusable by autonomous systems.

The newer addition enables chain-of-custody checks and audits in data-centric AI systems. It adopts the W3C PROV-O model for provenance. Metadata can record how each dataset, file, or individual record was derived by linking it to source data and processing steps, and by attributing the agents (people or software) responsible. 

Fig: Croissant captures machine-readable data provenance

The chain-of-custody approach provides transparency to systems and auditors. They can trace a dataset through entities, activities, and agents to assess its origins and processing history. The result is a detailed audit trail embedded in the metadata, helping users verify data quality and compliance. For example, widely-used datasets like Common Crawl have adopted Croissant 1.1 metadata, demonstrating how machine-readable provenance and processing semantics can be embedded at scale.

Croissant 1.1 introduces a flexible vocabulary framework. The format supports linking metadata to external vocabularies or identifiers at multiple levels (dataset, field, or data type) so that domain-specific semantics can be reused without reinvention. For example:

  • Dataset level: A dataset can reference a Wikidata or ontology ID (such as a disease or event) to classify its content, enabling cross-repository discovery.
  • Field level: A column can point to a controlled vocabulary term (e.g., an environmental or phenotype concept) to clarify its meaning or cite a source.
  • Data level: A field’s values can be annotated with a semantic class (for example, linking a “location” field to a geographic concept).

These conventions mean Croissant metadata can incorporate existing standards directly. Croissant was designed to be modular and extensible, so if your data already follows an ontology, you simply reference it. This interoperability is vital for portability and interoperability. 

This version also strengthens support for data governance. The format can encode usage permissions and restrictions using standard policy vocabularies. For detailed consent requirements, Croissant integrates the Data Use Ontology (DUO), allowing datasets to be tagged with permitted-use categories such as “General Research Use” or “Non-Commercial Use.” These DUO tags make consent restrictions machine-discoverable.

For finer-grained control, Croissant can include W3C ODRL(Open Digital Rights Language) policies, a standard for expressing usage rules. Embedding DUO codes or ODRL terms allows agents to automatically verify whether a proposed use is permitted. Together, these capabilities position Croissant as an active enforcer of data governance within automated workflows.

Croissant 1.1 also improves how complex ML datasets are described. Fields can now represent multi-dimensional arrays, and new properties allow them to include semantic types, example values, or validation rules. Each data row can carry a clear semantic meaning, such as one field being an image and another a numeric label. 

Thanks to these new capabilities, Croissant 1.1 is the ML dataset metadata standard for today’s AI landscape. It combines schema.org’s broad reach with an extensible vocabulary to produce fully machine-actionable metadata graphs that capture provenance, semantics, and governance in one place.

Self-describing metadata with embedded provenance and governance becomes crucial as AI systems move toward open models and autonomous agents. Each dataset becomes self-describing, with embedded audit trails and usage policies that support trust. The community uptake has been strong. 700K datasets now carry Croissant metadata, and primary tools and frameworks (such as TensorFlow and PyTorch for machine learning, as well as Dataverse and CKAN for data publishing) already load Croissant datasets natively. Major dataset repositories such as Hugging Face, Kaggle, and OpenML embed Croissant metadata. There is also growing community interest in applying similar provenance standards to service-generated dataset collections provided by data companies such as HumanSignal and CommonCrawl.

We encourage dataset creators to adopt Croissant 1.1 to make their data easier to find and use. Rich, interoperable metadata embedded in every dataset helps create an AI ecosystem where agents can autonomously discover and use data while fully respecting provenance, privacy, and permissions.

The post What’s New in Croissant 1.1: Extensible, Agent-Ready ML Dataset Standard appeared first on MLCommons.

]]>
Technical Standards are the Bridge to Enabling AI Adoption https://mlcommons.org/2026/02/ai-standards-bridge-adoption/ Wed, 11 Feb 2026 16:16:22 +0000 https://mlcommons.org/?p=3709 How measurable benchmarks like AILuminate translate risk management objectives into the enterprise trust AI needs to scale

The post Technical Standards are the Bridge to Enabling AI Adoption appeared first on MLCommons.

]]>
As artificial intelligence moves from a fun consumer chat experience to a general-purpose technology that powers enterprise services across the economy, it faces a significant reliability barrier. Enterprises need to trust that an AI system is reliable – that it produces correct, safe, and secure responses – before that system will be put in a position to deliver even more value. Overcoming this obstacle critically depends on the development of evaluation standards that form the bridge – codified in code –between traditional standards, such as those developed by ISO/IEC, and the inherently non-deterministic and rapidly evolving nature of AI systems.  

To build the necessary trust for widespread enterprise adoption, the industry must adopt risk management standards that reduce deployer uncertainty. Until enterprises, including smaller businesses, feel comfortable giving an AI agent access to their corporate data to autonomously negotiate pricing agreements on their behalf, we’ll never be able to have the kinds of automated transactions the industry is building toward today. What reliability will an AI vision system need to demonstrate – how many nines at the end of 99.9…% – before it is trusted to review an oil pipeline for damage? What will be required for deploying an AI clinical support tool to doctors to assist with diagnosis? What about deployments on a manufacturing line, where an hour of downtime is millions in lost revenue? Deploying AI systems in higher-risk and higher-trust applications, such as in finance, healthcare, manufacturing, and elsewhere, will demand meaningfully higher levels of reliability than we have today.  That also means we’ll need to be able to actually measure that reliability… reliably itself! 

Ultimately, the reliability objectives and procedural requirements are laid down in standards like ISO/IEC 42001, determined by consensus, as they have been for other industries in need of risk management. Because AI is a probabilistic technology, evaluation standards that underpin these objectives are critical to continuously and empirically demonstrate reliability and thus compliance. 

This probabilistic nature makes AI fundamentally different from other technologies.  For example, civil engineers can sign off on the design of a bridge that meets standards and have near-complete confidence that the bridge will carry people and vehicles across in different weather conditions, because the bridge itself doesn’t change every time the hundredth car drives across it.  With an LLM, every time a person interacts, it produces a different result. This probabilistic behavior is what makes this new technology so powerful and adaptive – and also what makes it so difficult to reliably measure and evaluate.

As such, while AI developers must go through the same exercise when designing a system – reviewing plans and ensuring they meet objectives – they must also continuously measure and empirically demonstrate compliance with the stated reliability goals under varied, real-world conditions. With AI, using the same inputs twice will generate two different outputs, and this is by design. Therefore, it is necessary to empirically measure both model inputs and outputs across different circumstances to determine if a risk has been appropriately mitigated. 

That’s where we come in.  Technical standards organizations like MLCommons act as a vital complement to conventional standards bodies such as ISO in the world of AI. The standards that organizations like ISO develop set broad direction, clear objectives, and qualitative requirements based on business needs, along with societal concerns. Benchmarking standards organizations translate these objectives into precise, actionable metrics. This relationship ensures that the objectives identified in ISO standards are grounded in empirical data that model developers and enterprise users can actually apply.

For instance, MLCommons is extremely active in ISO work, such as the 42119 series, a standard for AI testing and assurance. Industry needs internationally agreed, consensus-driven, broad guidelines for AI measurement, which can then be realized through specific benchmarks like the MLCommons AILuminate benchmarks for generative AI security and product safety. These technical specifications must evolve rapidly to keep pace with the speed of AI innovation, providing a “living” bridge between standard objectives and industry practice.

Ultimately, standardized evaluations are what drive progress and build public trust. Historical precedents like the New Car Assessment Program (NCAP) show that rigorous safety testing can transform an entire industry, increasing the market share of 5-star safety rated vehicles from a negligible baseline to 86%+ in a large market over several decades. By applying this same level of technical rigor to AI through evolving benchmarks like AILuminate, the industry can ensure that AI becomes more secure and reliable, unlocking higher value markets for companies and delivering increasing value for consumers.

Join the Effort

Building trustworthy AI takes a global effort. Join MLCommons and help shape the technical standards that will define AI reliability for the next decade. With 125+ member organizations already contributing to benchmarks like AILuminate, there’s a seat at the table for every organization committed to making AI safer, more reliable, and more widely trusted.

Learn more about MLCommons membership →

The post Technical Standards are the Bridge to Enabling AI Adoption appeared first on MLCommons.

]]>
DLRMv3: Generative recommendation benchmark in MLPerf Inference https://mlcommons.org/2026/02/dlrmv3-inference-meta/ Tue, 10 Feb 2026 14:00:00 +0000 https://mlcommons.org/?p=3663 HSTU-Based Architecture Brings Production-Scale Sequential Recommendations to MLPerf

The post DLRMv3: Generative recommendation benchmark in MLPerf Inference appeared first on MLCommons.

]]>
Introduction

Scaling compute, guided by neural scaling laws, has dramatically reduced the need for manual feature engineering in natural language processing (NLP) and computer vision by using large-scale attention-based transformer models to automatically learn rich representations from data [1][2]. A similar trend is transforming deep learning-based recommendation systems, which have traditionally relied on architectures built from multi-layer perceptrons (MLPs), graph neural networks (GNNs), and embedding tables [3][4][5]. Recently, large sequential and generative models have been successfully deployed in online content recommendation platforms, delivering substantial improvements in model quality [5][7][8][9][10][11][12][13]. Given the global scale and importance of recommendation systems [6], incorporating such large-scale sequential recommendation models into the MLPerf Inference benchmark suite helps support ongoing infrastructure development.

We introduce DLRMv3, the first sequential recommendation inference benchmark in MLPerf’s DLRM family. DLRMv3 is built around an HSTU-based [5] architecture for ranking, capturing the dominant compute patterns of modern recommendation workloads: long input sequences, attention-heavy computation, and large embedding tables. Compared to the existing DLRM benchmark (DLRMv2 [14]), DLRMv3 increases the model size by 20X from 50GB to 1TB and raises the per-candidate compute by 6500X from 40M FLOP to 260 GFLOP over a span of just three years, aligning the MLPerf Inference suite with contemporary production-scale recommendation workloads, highlighting the increasing demand for compute. This higher compute regime is motivated by reported HSTU scaling behavior where higher model compute yields improved recommendation quality in production, enabling realistic resource burden–accuracy trade-off evaluation.

Task selection

Modern recommendation systems are typically deployed as multi-stage pipelines that separate candidate retrieval from ranking and, in some cases, additional refinement stages such as re-ranking or business-logic post-processing [15][16]. In a common design, a retrieval model first selects a small subset of relevant items from a very large corpus, optimizing for high recall, coverage, and strict latency and memory constraints at scale [15][7]. A downstream ranking model then scores these candidates using richer features and a more expressive architecture, optimizing for fine-grained user engagement metrics (e.g., click-through rate (CTR), watch time, satisfaction) under somewhat looser but still production-critical latency and throughput constraints [15][16][18]. This separation of stages is now standard in large-scale industrial systems, including those used for web, video, and social content recommendation.

DLRMv3 focuses on the ranking stage of this pipeline. Ranking models typically dominate the overall ML compute budget in production recommendation systems and are a focus of innovation in model architecture (e.g., attention-based sequential models and larger embedding tables), making them especially relevant for hardware and systems benchmarking. Focusing on ranking also maintains continuity with prior MLPerf DLRM benchmarks which target click-through-rate prediction.

Formally, given a user’s interaction history (e.g., a sequence of previously viewed or engaged items) and a candidate item, the DLRMv3 model predicts the probability of a desired outcome such as a click, like, or watch. This probabilistic prediction task is directly aligned with earlier DLRM benchmarks, which also model CTR-style binary outcomes.

Model selection

Figure 1. Model Architecture of different DLRM models.

We introduce an HSTU-based architecture as MLPerf’s third-generation deep learning recommendation benchmark (DLRMv3). In the DLRM lineage, DLRMv1 is built from MLPs and embedding tables with simple dot-product feature interaction, and DLRMv2 adds a deep-cross network component for richer explicit feature crossing. DLRMv3 introduces a new sequential feature transformation, interaction, and extraction component based on the hierarchical sequential transduction unit (HSTU) [5] while retaining a single, large embedding table and a top MLP for final prediction (Figure 1(c)).

HSTU-style architectures have been shown in production systems to effectively model long user interaction histories and improve recommendation quality relative to traditional MLP/DCN-based models at comparable or higher compute resources. They also reflect a set of compute characteristics that increasingly dominate modern recommendation inference—long sequences, attention-heavy computation, and large embedding tables—making HSTU a representative and forward-looking choice for a system-level benchmark.

In the table below, we compare model configurations among different generations of DLRM. The 260 GFLOP is computed as 2 * layers * (UIH_length * UIH_length * EmbDim / 2 + UIH_length * EmbDim * EmbDim * 4 + UIH_length * EmbDim * EmbDim * 3), which accounts for attention FLOP as well as the pre- and post-attention GEMMs. Note that this 260 GFLOP “per-candidate” number is an effective normalization: in a typical ranking request, the HSTU encoder processes the shared user-interaction-history (UIH) sequence once, and its output is reused when scoring a candidate set (2K candidates in DLRMv3), so the dominant UIH encoding computational cost is amortized across candidates rather than repeated 2K times. Moreover, because DLRMv3 uses a streaming time-series setup, deployments can reuse UIH-related KV states across consecutive timestamps for the same user, avoiding recomputation of the UIH encoding and reducing redundant dense compute by roughly 80–90% in steady state.

Model/Input ConfigurationsDLRMv1DLRMv2DLRMv3
Dense Inputs13 values13 values0 values
Sparse Inputs per candidate26 features, 208 lookups26 features, 214 lookups1 main feature, ~7K lookups
Embedding Tables26 tablesTotal hash size: 200MEmbDim: 12826 tablesTotal hash size: 200MEmbDim: 1281 main tableHash sizes: 1 billionEmbDim: 512
Feature InteractionDot interaction using no trainable parameters3 layers of LowRank DCN5 HSTU layers, with user interaction history sequence length ~7K
Embedding table size (float16 datatype)~50GB~50GB1TB
FLOP per candidate~5 MFLOP~40 MFLOP~260 GFLOP

To better align DLRMv3 with MLPerf Inference benchmarking goals and practical constraints, we introduce two intentional deviations from the original HSTU paper setup. These changes are motivated by the need for hardware-friendly, widely implementable benchmarking that still captures the important compute patterns of sequential recommendation models.

Action embedding preprocessing: The original HSTU model uses contextual interleaving of action embeddings, where contextual features and user actions are interleaved in the input sequence. This arrangement provides richer context to the model to learn dependencies between user behavior and item features. However, interleaving doubles the effective sequence length, which in turn significantly increases the computational cost. In DLRMv3, we omit action interleaving in the baseline and instead use a simplified input sequence where action embeddings are directly combined with contextual embeddings without expanding the sequence length. We choose this design because 1) the synthetic benchmark dataset (detailed in the next section) is intended for performance measurement and does not contain a sufficiently rich set of action features to justify the added modeling complexity and doubled sequence length, and 2) the non-interleaved option provides a more balanced accuracy–efficiency tradeoff that is appropriate for a standardized inference benchmark.

Time/position encoding: The original HSTU model uses a relative positional bias in the attention mechanism, implemented as Mask(SiLU(QKT)+bias)V, which helps capture relative temporal relationships between tokens and can improve accuracy. DLRMv3 instead adopts absolute time/position encoding, where a position-dependent bias is added to the query, key, and value vectors, and attention is computed as Mask(SiLU(QKT))V. We use absolute bias because relative bias introduces kernel optimization challenges and can slow down attention computation on many processors, whereas absolute encodings are widely supported, easier to optimize, and yield more predictable performance. 

Dataset selection

Unlike DLRMv1 and DLRMv2, DLRMv3 formulates recommendation as a sequential transduction task over long user interaction histories and a very large item set. To be representative of modern production workloads, a suitable benchmark dataset needs to satisfy several properties simultaneously: (1) reasonably long user-interaction histories per request (thousands of events), so that sequential models and attention layers are meaningfully exercised; (2) a very large item set, consistent with the single large embedding table used in DLRMv3 (hash size on the order of one billion); and (3) a streaming structure, where the items viewed by a given user and the user’s preferences evolve over time and inference requests can be replayed in timestamp order. To the best of our knowledge, no existing public recommendation dataset provides all of these characteristics at the required scale.

For this reason, DLRMv3 uses a synthetic dataset specifically designed to match the model and system characteristics of large-scale sequential recommendation. The generator simulates five million users interacting with a billion items over time. Items are partitioned evenly into 128 categories, and each user is randomly assigned four categories that define their long-term interests. We generate 100 timestamps per user. The first 90 timestamps are used for training, and the remaining 10 timestamps are reserved for inference. For each user and each timestamp, with 70% probability, we generate a user interaction history (UIH) and a candidate sequence, and with 30% probability, the user has no request at that timestamp. The average UIH sequence length per (user, timestamp) is approximately 100, and the candidate sequence length is 2K. The generated sequences respect the streaming setup: at earlier timestamps, only a prefix of item IDs in each category can be sampled, and the maximum reachable ID grows linearly over time, mimicking new items entering the system.

When generating each user’s sequence, item categories for each position are sampled using a Dirichlet-process-style mechanism, following the generation algorithm in Section C of [5]. This setup yields a “rich get richer” dynamic where previously visited categories are more likely to be revisited. For each sampled category, item, and timestamp, we generate a continuous user-item rating (used as an action feature) by combining (1) a category- and time-dependent rating profile that varies smoothly over time via cosine functions with (2) an intrinsic per-item rating drawn once from a global 1-5 distribution representing item quality. The final rating is their average and is then used to derive labels for prediction.

Before converging on the streaming generator described above, we also experimented with the fractal expansion [19] approach for scaling the user-item interaction matrix from a smaller dataset. In brief, fractal expansion applies randomized kronecker products to a small interaction matrix to increase the dataset size while certain static structural properties (e.g., item popularity distribution and user engagement distribution) are preserved. However, fractal expansion is fundamentally designed for static datasets and does not naturally produce the streaming structure required by DLRMv3. In particular, DLRMv3 needs temporally ordered requests with evolving user histories and an expanding reachable item universe (to mimic new items entering the system) so that inference can be replayed in timestamp order. Fractal expansion does not inherently model properties such as per-user sequences over time and preference drift. Due to this fact, we did not select fractal expansion for DLRMv3 and instead designed a generator that simulates the streaming and sequential characteristics of online recommendation workloads.

Checkpoint Training

During the benchmarking creation process, we decided to use a 1TB float16 embedding table to store the one-billion item set in order to reflect the production use cases for such a model. This checkpoint is later trained on a single NVIDIA HGX B200 with eight GPUs and the fully open-sourced Generative Recommenders repo. Because each B200 has 180GB HBM capacity, the 1TB item table can  be fully sharded to eight GPUs with enough capacity remaining in the GPUs’  HBM to store weights, activations, and gradients for the training loop to run. 

Performance and accuracy metrics

The performance metrics chosen for the DLRMv3 benchmark follow the MLPerf Inference conventions and are closely aligned with those used for DLRMv2:

  1. Offline scenario: Query processing throughput measured as queries per second (QPS) with no latency constraint beyond completing the full query set as quickly as possible.
  2. Server scenario: QPS measured under an end-to-end latency constraint on the system under test (SUT) reflecting an online serving environment where requests arrive over time and must be processed under strict tail-latency targets.

To better reflect real-world deployment demands for large-scale recommendation ranking, we set the 99th-percentile SUT latency threshold to 80 ms in the server scenario.

Unlike other MLPerf Inference benchmarks, DLRMv3 uses a streaming setup with explicit timestamped requests. As a result, there is a request ordering constraint: requests associated with earlier timestamps must be served before those with later timestamps. Because MLPerf LoadGen does not natively enforce temporal ordering, this constraint may need to be implemented via SUT-side scheduling or request handling logic to ensure that streaming semantics are preserved while optimizing for throughput and tail latency.

For accuracy verification, we provide a trained reference checkpoint with a model size of 1TB in float16. Accuracy is evaluated on 0.1% subset of the inference dataset, comprising 34,996 requests across 10 inference timestamps. We report normalized entropy (NE), accuracy, and area under the ROC curve (AUC) as the primary quality metrics. For the reference implementation, the observed metrics are:

NE: 86.69%,

Accuracy: 69.65%,

AUC: 78.66%.

These values serve as the target baseline for accuracy compliance in the DLRMv3 benchmark. For accepted submissions, all three metrics (NE, Accuracy, AUC) must be within 99.9% of the reference implementation values.

Reference implementation

We provide a reference implementation of the DLRMv3 benchmark at

https://github.com/mlcommons/inference/tree/master/recommendation/dlrm_v3. The reference implementation covers model implementation, synthetic data generation, and performance/accuracy evaluation under the MLPerf Inference rules, as detailed below:

Model implementation: we provide a PyTorch implementation of the DLRMv3 HSTU-based ranking model, including a large embedding table consistent with the 1B-hash and 512-dimension configuration, a HSTU sequential encoder stack with the DLRMv3 design choices, and customized Triton kernels for HSTU operations.

Synthetic data pipeline: we provide the synthetic data generation script as well as a streaming data loader that constructs per-request model inputs (user interaction history + candidate set) and labels, matching the benchmark specification.

Reference benchmark: the reference benchmark contains three main runtime components:

  1. Data producer: this part takes in sample/request ID and performs batching and outputs item IDs of batched samples. This part is executed on the CPU, and the reference implementation supports both single- and multi-threading in Python.
  2. Sparse forward: this part takes in item ID lists and outputs embedding lookup outputs. This part is executed on the CPU, as datacenter systems typically contain large CPU memory to store the 1TB embedding table. The reference implementation uses PyTorch’s default embedding lookup operation without any quantization (in the float16 datatype). The implementation serves as a baseline for future improvements, e.g. distributed embedding lookup, GPU lookup, etc.
  3. Dense forward: this part takes in batched embeddings and runs the HSTU encoder forward in the bfloat16 datatype to generate predictions. The implementation employs data parallelism across GPUs on a single host by distributing input batches to eight GPUs using Python multi-processing. On each GPU, the implementation uses customized Triton kernels to improve efficiency. This implementation serves as a baseline for future improvements, including more efficient kernels and low-precision quantization.

To illustrate the performance of the DLRMv3 reference implementation, we report end-to-end results for different numbers of NVIDIA H100 GPUs on a single host with batch size equal to 10. In each configuration, we increase QPS until the system approaches saturation while still maintaining reasonable latency.

#GPUsQPSAvg query timeP50P80P90P99
130078ms74ms86ms99ms130ms
250086ms75ms85ms101ms285ms
480070ms69ms79ms85ms100ms
8100060ms60ms64ms67ms82ms

A breakdown of query time at 8 GPUs highlights where time is spent in the baseline:

Query timeBatch queueing timeBatching timeSparse timeDense time
CPU embedding lookupH2DGPU timeCPU overhead
60ms11ms (18%)2ms (3%)3ms (5%)15ms (25%)26ms (43%)3ms (5%)

The results show sub-linear scaling in throughput (QPS) with up to eight GPU devices: going from one to eight GPUs increases QPS by 3.3X (300 to 1000), while average latency drops from 78 ms to 60 ms. At eight GPUs, the end-to-end P99 remains slightly above 80 ms due to batching, data movement, and Python/CPU-side overheads. This outcome is expected and consistent with the goals of the reference implementation. The current code is a Python-based baseline, not an aggressively optimized C++/production server, and it is explicitly not targeting the best possible performance. The numbers therefore demonstrate substantial headroom for submitters to optimize request scheduling, embedding placement, kernel optimization, and inference-server integration to satisfy the P99 latency constraint.

Optimization opportunities

DLRMv3 is intentionally designed to expose multiple layers of optimization opportunities across the inference stack–from the serving framework to dense and sparse kernels to KV-cache-aware scheduling. The reference implementation provides a clear but unoptimized baseline in several of these dimensions, leaving room for submitters to explore system and kernel improvements while preserving accuracy.

Inference server library. The current reference pipeline is implemented in Python with a simple CPU/GPU orchestration model. In production, ranking models of this scale are typically deployed behind high-performance inference serving frameworks. There is significant room to improve request batching, thread and stream management, and SUT-side scheduling by integrating with an optimized inference server library. 

KV cache integration. DLRMv3’s sequential HSTU architecture naturally supports KV caching, and there are two main dimensions where caching can yield meaningful gains:

  1. In-request cache (M‑FALCON-style microbatching). As discussed in the HSTU paper [5], when serving large candidate sets (e.g., >1K candidates), M‑FALCON applies microbatching over the candidates while reusing the KV cache for the shared user interaction history. This strategy does not change the model’s FLOP count requirements or accuracy, but it can materially improve performance by amortizing candidate processing and overlapping CPU and GPU work within a single request. In a DLRMv3 setting, this arrangement can reduce per-request latency without altering the underlying model.
  1. Across-request cache. For users who appear in multiple timestamps (as in the DLRMv3 streaming setup), the KV cache corresponding to their user interaction history can be reused across requests. This avoids recomputing the UIH encoding for each timestamp, reducing redundant FLOP by around 90% and improving both throughput and latency. Careful cache management and invalidation policies are needed to ensure correctness under streaming semantics.

Together, in-request KV microbatching (M‑FALCON style) and across-request KV caching offer complementary optimization levers: the former amortizes compute over large candidate sets and enables finer-grained CPU/GPU overlap within a request, while the latter reduces repeated UIH compute across consecutive requests from the same user. The reference implementation does not yet exploit these techniques, leaving them as explicit optimization opportunities for submitters targeting high-performance DLRMv3 deployments.

Dense optimization. The dense path in the reference implementation (HSTU encoder + top MLP) already uses Triton kernels and bf16, but there is still substantial room to improve. Further optimizations could include:

  • more efficient attention and feed-forward kernels tailored to DLRMv3’s sequence lengths and dimensions;
  • better utilization of vendor-specific libraries and exploration of lower-precision formats (e.g., FP8, INT8) under acceptable accuracy loss.
  • optimized multi-GPU scaling within a node via improved load balancing, overlapping of data transfers with compute, and more sophisticated pipeline parallelism.

Sparse optimization. The sparse forward stage currently uses TorchRec embeddings on the CPU, with bf16 and no quantization, to support 1TB size in a simple, portable way. This is a natural baseline, but production systems often employ a combination of GPU-based embedding lookup, distributed sharding, hierarchical caching (e.g., HBM + DRAM), and quantization/compression. Submitters can explore moving embedding tables to accelerators, applying different types of quantization, or introducing model-parallel embeddings, as long as DLRMv3 accuracy targets are met. Improving the embedding pipeline can reduce end-to-end latency and CPU bottlenecks for large-scale deployments.

Conclusion

DLRMv3 extends the MLPerf Inference recommendation suite into the regime of large-scale sequential models, matching the size, structure, and serving patterns of the state-of-the-art production systems. By combining an HSTU-based ranking architecture, a billion-item synthetic streaming dataset, and a reference implementation that cleanly separates sparse and dense computation, DLRMv3 exposes the challenges of modern recommender inference: long user histories, attention-heavy sequence modeling, massive embeddings, and tight tail-latency targets under streaming workloads.

Notably, DLRMv3 increases the model memory footprint by 20X (from 50GB to 1TB) and the per-candidate compute by 6500X (from 40 MFLOP to 260 GFLOP) compared to DLRMv2, reflecting the rapid evolution of recommendation models over just three years. This dramatic scaling in compute and memory enables richer representations and more expressive models, ultimately delivering greater capability and accuracy to the community. By benchmarking these advances, DLRMv3 highlights how performance optimization is driving the frontier of recommender systems.

The benchmark is intentionally designed to leave room for innovation across hardware and software stacks–from inference servers and kernel optimizations to embedding pipelines and KV-cache-aware scheduling, while providing clear accuracy and performance baselines. We hope DLRMv3 will serve as a foundation for the community to evaluate, compare, and advance systems for next-generation recommendation workloads.

Contributors

Lucy Liao (Meta), Yu He (Stanford), Ze Yang (Meta), Zhao Zhu (Meta), Chunxing Yin (Meta), Rafay Khurram (Meta), Yuanjun Yao (Meta), Han Li (Meta), Daisy Shi He (Meta), Shilin Ding (Meta)

References

[1] Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361.

[2] Scaling Vision Transformers. https://arxiv.org/abs/2106.04560

[3] Scaling Law for Recommendation Models: Towards General-purpose User Representations.  https://arxiv.org/abs/2111.11294 

[4] Understanding Scaling Laws for Recommendation Models. https://arxiv.org/abs/2208.08489 

[5] Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. https://arxiv.org/abs/2402.17152v2

[6] Recommender system is the single largest software engine on the planet — Jensen Huang. Q4’23 earning report. https://www.youtube.com/watch?v=txOv_pi-_R4&t=2020s 

[7] LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders. https://arxiv.org/abs/2505.04421 

[8] Towards Large-scale Generative Ranking. https://arxiv.org/abs/2505.04180 

[9] MTGR: Industrial-Scale Generative Recommendation Framework in Meituan. https://arxiv.org/abs/2505.18654 

[10] From Features to Transformers: Redefining Ranking for Scalable Impact. https://arxiv.org/abs/2502.03417 

[11] Hi-Gen: Generative Retrieval For Large-Scale Personalized E-commerce Search. https://arxiv.org/abs/2404.15675 

[12] HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling. https://arxiv.org/abs/2409.12740 

[13] Recommender Systems with Generative Retrieval. https://arxiv.org/abs/2305.05065 

[14] https://github.com/mlcommons/inference/blob/master/recommendation/dlrm_v2/pytorch/README.md 

[15] Deep Neural Networks for YouTube Recommendations, https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf 

[16] Deep Learning Based Recommender System: A Survey and New Perspectives, https://arxiv.org/abs/1707.07435 

[17] Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations, https://research.google/pubs/sampling-bias-corrected-neural-modeling-for-large-corpus-item-recommendations/ 

[18] Recommending What Video to Watch Next: A Multitask Ranking System, https://daiwk.github.io/assets/youtube-multitask.pdf 

[19] Scalable realistic recommendation datasets through fractal expansions, https://arxiv.org/pdf/1901.08910 

The post DLRMv3: Generative recommendation benchmark in MLPerf Inference appeared first on MLCommons.

]]>
Call for Submission: Qwen3 VL MoE for MLPerf Inference v6.0 https://mlcommons.org/2026/02/vlm-inference-shopify/ Mon, 02 Feb 2026 14:53:37 +0000 https://mlcommons.org/?p=3630 Pushing the VLM Performance Frontier for E-Commerce Workloads

The post Call for Submission: Qwen3 VL MoE for MLPerf Inference v6.0 appeared first on MLCommons.

]]>
Introduction

The MLCommons Inference Working Group is excited to introduce a new Vision-Language Model (VLM) benchmark for the MLPerf Inference v6.0 round. As multimodal AI continues to reshape industries from retail to robotics, measuring the performance of these complex models in realistic production scenarios is more critical than ever.

For this round, we are debuting the Qwen3-VL-235B-A22B-Instruct model (released on Sept 23, 2025), paired with the Shopify Product Catalog dataset (released on Dec 12, 2025) to perform product classification and understanding, a common task in retail and commerce applications. This benchmark represents a significant leap forward in evaluating how AI systems handle the messy, high-stakes reality of hyperscale e-commerce where the infrastructure behind the scenes processes tens of millions of products daily.

The submission deadline is February 13, 2026. We invite all hardware vendors, cloud providers, and AI system experts to submit their results and help raise the bar for multimodal inference performance.

The Rise of Multimodal AI

The shift from unimodal (text-only) to multimodal AI is one of the most transformative trends in the industry. By integrating visual and textual data, businesses can automate complex workflows that were previously impossible, such as visual search, automated product tagging, and complex document understanding.

Market research underscores this urgency. The global multimodal AI market was valued at approximately USD 1.73 billion in 2024 and is projected to reach USD 10.89 billion by 2030, growing at a compound annual growth rate (CAGR) of 36.8% [1]. In the retail and e-commerce sectors specifically, adoption is accelerating even faster, with a projected CAGR of 34.6% through 2030 as merchants deploy personalized styling tools and augmented reality features [2]. This benchmark directly addresses this market demand by simulating a core “product understanding” workload that drives these commercial applications.

Model Selection: The Era of Qwen

This benchmark marks a milestone for the MLPerf Inference Benchmark Suite: it is the first time a model from the Qwen family is being introduced into the suite.

The open-weights landscape has evolved rapidly over the last 18 months, with the Qwen family emerging as a dominant force in the ecosystem. Recent data highlights this ascent:

  • Adoption & downloads: Data from Hugging Face confirms that Qwen has cemented its status as a tier-one open-weights model family globally. By late 2025, it achieved record-breaking cumulative download numbers, driven by widespread adoption among developers and researchers [3, 4, 5].
  • Performance: Independent benchmarks consistently rank Qwen models at the forefront of the industry. Qwen3-VL 235B A22B, in particular, delivers state-of-the-art results across a wide range of tasks, demonstrating high efficiency by achieving top-tier response quality with a significantly optimized parameter footprint [6, 7].

By selecting Qwen3-VL 235B A22B [8], we ensure MLPerf Inference remains at the bleeding edge, reflecting the models that developers and enterprises are actually deploying in 2026.

Qwen3-VL 235B A22B is a massive-scale mixture-of-experts (MoE) vision-language model designed to unify high-performance reasoning with efficient inference. It features 235 billion total parameters with 22 billion activated parameters per token, allowing it to deliver dense-model quality at a fraction of the computational cost. Built on the Qwen3 backbone [9], it natively supports a 256K token context window for interleaved text, image, and video inputs, and it introduces key architectural innovations like DeepStack (which fuses multi-level vision features) [10] and interleaved-MRoPE (for enhanced spatial-temporal modeling) [11]. This architecture enables state-of-the-art performance in complex tasks ranging from visual agentic workflows to long-context video understanding.

Case Study: Shopify Catalog

To ensure this benchmark reflects real-world production needs, we partnered with Shopify to curate the dataset and define the task. This workload mimics the “hierarchical taxonomy classification” task found in the “Shopify Catalog” product understanding layer, which processes 40 million products daily [12].

In production, the product understanding layer utilizes VLMs to perform a series of tasks that transform the unstructured data (e.g., non-canonical photos or descriptions provided by the merchants) of products into standardized metadata, including hierarchical taxonomy classification, attribute extraction, image understanding, title standardization, description analysis, and review summarization [12].

For the MLPerf Inference v6.0 round, we focus specifically on the hierarchical taxonomy classification task. The system must ingest a product’s title, description, and photo and select the correct category from a dynamic set of potential categories. This task perfectly captures the challenge of modern AI: maintaining high accuracy and throughput on noisy, real-world data at a massive scale.

Example Request and Response

An example request to a VLM inference endpoint might look like the following:

Python
[
  {
    'content': """Please analyze the product from the user prompt
and provide the following fields in a valid JSON object:
- category
- brand
- is_secondhand

You must choose only one, which is the most appropriate, correct, and specifc
category out of the list of possible product categories.

The description of the product sometimes contains various types of source code
(e.g., JavaScript, CSS, HTML, etc.), where useful product information is embedded
somewhere inside the source code. For this task, you should extract the useful
product information from the source code and leverage it, and discard the
programmatic parts of the source code.

Your response should only contain a valid JSON object and nothing more, e.g.,
you should not fence the JSON object inside a ```json code block.
The JSON object should match the followng JSON schema:
```json
{
  "additionalProperties": false,
  "description": "Json format for the expected responses from the VLM.",
  "properties": {
    "category": {
      "description": "The complete category of the product, e.g.,\n\"Clothing & Accessories > Clothing > Shirts > Polo Shirts\".\nEach categorical level is separated by \" > \".",
      "title": "Category",
      "type": "string"
    },
    "brand": {
      "description": "The brand of the product, e.g., \\"giorgio armani\\".",
      "title": "Brand",
      "type": "string"
    },
    "is_secondhand": {
      "description": "True if the product is second-hand, False otherwise.",
      "title": "Is Secondhand",
      "type": "boolean"
    }
  },
  "required": [
    "category",
    "brand",
    "is_secondhand"
  ],
  "title": "ProductMetadata",
  "type": "object"
}
```
""",
    'role': 'system'
  },
  {
    'content': [
      {
        'text': """The title of the product is the following:
```text
Frendorf | Fahrradindikator-Signalweste
```

The description of the product is the following:
```text
Die RADFAHR-INDIKATOR-SIGNALWESTE IST EIN NEUES PRODUKT DERZEIT NICHT IM HANDEL ERHÄLTLICH UND NICHT VERFÜGBARE . VERMEIDEN SIE UNNÖTIGE UND GEFÄHRLICHE RISIKEN FÜR IHR LEBEN! DIE SICHERHEITSWESTE IST DER Einfachste UND SICHERSTE WEG, UM DIE SICHTBARKEIT FÜR FAHRER ZU VERBESSERN. Geeignet für alle Radfahrer Für Kinder, die Fahrrad fahren Für Kinder auf dem Roller Für Jogger, die nachts laufen Für Kinder, die neben einer Straße gehen (die Weste passt ideal über einen Schulranzen) Diese Weste lässt sich hervorragend über einem Rucksack tragen. Eigenschaften: EINFACH FÜR AUTOFARER ZU ERKENNEN FÜR ERWACHSENE UND KINDER Die beiden verstellbaren Riemen machen sie für Kinder und Erwachsene geeignet, was bedeutet, dass eine Weste von mehreren Familienmitgliedern genutzt werden kann. FÜR VERSCHIEDENE KLEIDUNGEN Über einer Jacke oder einem dicken Mantel Über einem Rucksack oder Schulranzen SICHTBARKEIT = SICHERHEIT FERNBEDIENUNG Die Fernbedienung ist äußerst benutzerfreundlich und somit auch für kleine Kinder leicht zu bedienen: rechter Knopf zum Abbiegen nach rechts, linker Knopf zum Abbiegen nach links usw. Das Licht der Fernbedienung und das Blinken der Weste sind synchronisiert, sodass der Nutzer immer informiert ist, was hinter ihm geschieht. IN WENIGEN MINUTEN BEREIT ZUM EINSATZ Die Fernbedienung lässt sich dank der kleinen Kunststoffhalterungen ganz einfach am Lenker anbringen. Lieferumfang: 1 x RADFAHR-INDIKATOR-SIGNALWESTE 1 x Fernbedienung
```

The following are the possible product categories:
```json
['Sporting Goods > Fitness & General Exercise Equipment > Sport Safety Lights & Reflectors > Sport Safety Lights', 'Animals & Pet Supplies > Pet Supplies > Pet Collars & Harnesses > LED Collars', 'Sporting Goods > Outdoor Recreation > Cycling > Bicycle Accessories > Bicycle Computer Accessories > Bicycle Computer Cadence Sensors', 'Vehicles & Parts > Vehicle Parts & Accessories > Motor Vehicle Parts > Motor Vehicle Lighting > Tail Lights', 'Vehicles & Parts > Vehicle Parts & Accessories > Motor Vehicle Parts > Motor Vehicle Lighting > Light Bars', 'Business & Industrial > Work Safety Protective Gear > Work Safety Harnesses > Vest-Style Work Safety Harnesses', 'Sporting Goods > Fitness & General Exercise Equipment > Cardio > Cardio Machine Accessories & Parts > Exercise Bike Accessories & Parts > Exercise Bike Heart Rate Monitors', 'Vehicles & Parts > Vehicle Parts & Accessories > Motor Vehicle Parts > Motor Vehicle Lighting > Turn Signals']
```
""",
        'type': 'text'
      },
      {
        'image_url': {
          'url': 'data:image/JPEG;base64,<ommited base64 encoding of the following image>'
        },
        'type': 'image_url'
      }
    ],
    'role': 'user'
  }
]

Then, the response from the VLM inference endpoint would look like the following:

JSON
{
"category": "Sporting Goods > Fitness & General Exercise Equipment > Sport Safety Lights & Reflectors > Sport Safety Lights",
"brand": "Frendorf",
"is_secondhand": false
}

Performance Metrics

For the “offline” scenario, we measure and compare the overall request throughput (i.e., the number of completed requests per second) when sending all samples from (both the train and test splits of) the dataset to the VLM inference endpoint at least once. This arrangement closely mirrors Shopify’s major production requirements for batch processing: being able to process as many product samples as possible with as few GPU/accelerator resources as possible, beyond the current capacity of tens of millions of product samples daily.

For the “server” scenario, we set the 99th-percentile request latency constraint to be 12 seconds, and then we compare the maximum number of requests per second that we could send to the VLM inference endpoint before the 99th-percentile request latency constraint is violated. This scenario reflects Shopify’s online processing use case of handling demand surges before and during events or holidays (e.g., Black Friday).

Accuracy Metric

The response will be parsed by the following schema:

Python
class ProductMetadata(pydantic.BaseModel):
    """Json format for the expected responses from the VLM."""

    category: str
    """The complete category of the product, e.g.,
    "Clothing & Accessories > Clothing > Shirts > Polo Shirts".
    Each categorical level is separated by " > ".
    """

    brand: str
    """The brand of the product, e.g., "giorgio armani"."""

    is_secondhand: bool
    """True if the product is second-hand, False otherwise."""

We compute the hierarchical F1 score [13, 14] between the predicted categories and the ground-truth categories of all product samples in the dataset. Compared to exact matching, the hierarchical F1 score proportionally accounts for a partial, but not complete, match between a predicted category and the ground-truth product category, such as

Cameras & Optics > Cameras > Video Cameras > Drone Video Cameras

vs.

Cameras & Optics > Cameras > Video Cameras > Cinema Video Cameras

In the event where a response cannot be parsed by the given JSON schema, it will be treated as a complete mismatch between the predicted product category and the ground-truth product category. In this case, the number of category levels, which affects the hierarchical F1 calculation of the (mis-)prediction, will be taken as the same as that of the ground truth. For example, when encountering a JSON parsing error, if the ground truth is

Cameras & Optics > Cameras > Video Cameras > Drone Video Cameras

then the (mis-)prediction will be treated as

|NA| > |NA| > |NA| > |NA|

We set the accuracy target of the category hierarchical F1 score to be 0.7824. This is 99% of the mean category hierarchical F1 score across 10 runs on the original Qwen3-VL-235B-A22B-Instruct, which achieves 0.7903037.

Join the Benchmark

This is your opportunity to demonstrate the capabilities of your hardware and software stacks on one of the most relevant and rapidly growing AI workloads in the industry.

You can find the reference implementation along with the performance benchmarking code here. The reference implementation is based on vLLM, but its command-line interface and plugin registration enable you to benchmark arbitrary inference systems that expose endpoints via the Chat Completions OpenAI API. We look forward to seeing your submissions!

Acknowledge

We would like to thank the following individuals (sorted alphabetically) for their feedback and help that guided us in the journey of developing this benchmark:

  • Ashwin Nanjappa (NVIDIA)
  • Diego Castañeda (Wealthsimple; ex-Shopify)
  • Frank Han (Dell)
  • Gennady Pekhimenko (NVIDIA)
  • Harika Pothina (Red Hat)
  • Javier Moreno (Shopify)
  • Junhao Li (Ubicloud)
  • Lei Pan (Pinterest)
  • Mert Unsal (Mistral)
  • Michael Goin (Red Hat)
  • Mikhail Parakhin (Shopify)
  • Mingyuan Ma (NVIDIA)
  • Miro Hodak (AMD)
  • Naveen Miriyalu (Red Hat)
  • Qidong Su (NVIDIA)
  • Rachata Ausavarungnirun (MangoBoost)
  • Roger Wang (Inferact)
  • Shobhit Verma (NVIDIA)
  • Thomas Atta-Fosu (Intel)
  • Viraat Chandra (NVIDIA)
  • Yubo Gao (NVIDIA)
  • Zhanda Zhu (NVIDIA)
  • Zhihan Jiang (NVIDIA)

References

  1. Grand View Research. (2024). Multimodal AI Market Size And Share | Industry Report, 2030. https://www.grandviewresearch.com/industry-analysis/multimodal-artificial-intelligence-ai-market-report 
  2. Mordor Intelligence. (2025). Multimodal AI Market Size & Share Analysis – Industry Reports. https://www.mordorintelligence.com/industry-reports/multimodal-ai-market 
  3. Xinhua / China Daily. (Jan 12, 2026). Alibaba’s Qwen leads global open-source AI community with 700 million downloads. https://www.chinadailyhk.com/hk/article/626974 
  4. AI World. (2025). Chinese developers account for over 45% of top open-model public downloads. https://www.aiworld.eu/story/chinese-developers-account-for-over-45-of-top-open-model-public-downloads 
  5. Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, Anjney Midha. (2025). State of AI: An Empirical 100 Trillion Token Study with OpenRouter. https://openrouter.ai/state-of-ai 
  6. LLM-Stats. (2026). Qwen3 VL 235B A22B Instruct. https://llm-stats.com/models/qwen3-vl-235b-a22b-instruct 
  7. Artificial Analysis. (2026). Qwen3 VL 235B A22B (Reasoning) Intelligence, Performance & Price Analysis. https://artificialanalysis.ai/models/qwen3-vl-235b-a22b-reasoning  
  8. Qwen Team. (2025). Qwen3-VL Technical Report. https://arxiv.org/pdf/2511.21631 
  9. Qwen Team. (2025). Qwen3 Technical Report. https://arxiv.org/pdf/2505.09388 
  10. Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. In Advances in Neural Information Processing Systems, volume 37, pp. 23464–23487, 2024.
  11. Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, and Shuai Bai. Revisiting multimodal positional encoding in vision-language models, 2025.
  12. Audrey-Anne Guindon. (2025). Leveraging multimodal LLMs for Shopify’s global catalogue: Recap of expo talk at ICLR 2025. https://shopify.engineering/leveraging-multimodal-llms 
  13. Silla, C. N., & Freitas, A. A. (2011). A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1), 31-72.
  14. Kiritchenko, S., Matwin, S., Nock, R., & Famili, A. F. (2006, June). Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 395-406). Springer, Berlin, Heidelberg.

The post Call for Submission: Qwen3 VL MoE for MLPerf Inference v6.0 appeared first on MLCommons.

]]>
MedPerf adds WebUI capabilities to make federated benchmarking more user-friendly https://mlcommons.org/2025/12/medperf-adds-webui-capabilities/ Tue, 09 Dec 2025 17:00:00 +0000 https://mlcommons.org/?p=3505 A new WebUI extends Medperf's capabilities to the browser, offering a more visual, accessible way to interact.

The post MedPerf adds WebUI capabilities to make federated benchmarking more user-friendly appeared first on MLCommons.

]]>
The Medical WG at MLCommons has been maintaining best practices for enabling real-world benchmarking and evaluation of AI/ML models in healthcare. MedPerf is an open-source orchestrator developed by the community. It has been supporting the most extensive AI clinical research study in glioblastoma [Federated Learning for Postoperative Segmentation of Treated glioblastoma (FL-PoST)]. During this time, we have been enhancing MedPerf’s capabilities, including data preparation pipelines and monitoring [MLCommons Medical WG Supports FeTS 2.0 Clinical Study with MedPerf and GaNDLF].   

Today, we are announcing that MedPerf has added WebUI capabilities, making it easier for benchmark creators to orchestrate. The WebUI extends Medperf’s capabilities to the browser, providing an intuitive interface built on its CLI. Running entirely on the local machine, the WebUI executes the same commands and functions as the CLI, offering a more visual, accessible way to interact with Medperf without changing how it works under the hood. Access is secured by a unique token generated at launch, ensuring that access remains restricted to the user who launched it. By combining the power of the CLI with the convenience of a browser, Medperf WebUI aims to deliver a user-friendly experience to all Medperf users, including non-technical experts (e.g. clinicians).

We tested the WebUI in our current clinical study for glioblastoma to facilitate the evaluation of AI models on 1568 brain scans across 7 data sites from North America and Europe. Through the WebUI the study’s principal investigator, Dr. Evan Calabrese, followed a step-by-step workflow to create a validation record accompanied by the data preparation script, the metrics script and the model under evaluation, which was pre-trained on 10208 brain scans across 53 data sites from 5 continents (North America, South America, Europe, Asia, and Australia) in the first phase of the study via federated learning using Intel’s OpenFL. Through the WebUI, the representative from each evaluation site followed a step-by-step workflow to prepare the data locally and initiate the evaluation. Evaluation results were monitored via the principal investigator’s dashboard on the WebUI.

We also created an online tutorial for new MedPerf users to learn how to use the new WebUI in easy steps which can be found here: Full Tutorial (WebUI) – MedPerf.

Figure 1. Benchmark User dashboard: showing active benchmarks
Figure 2. Data User dashboard: showing federated benchmark execution

Dr. Evan Calabrese, the lead principal investigator for the clinical study mentioned:

The MedPerf WebUI was beneficial for the validation phase of the Federated Learning for Postoperative Segmentation of Tumors (FL-PoST)/Federated Tumor Segmentation (FeTS) 2.0 study, which required international collaboration across ~10 different academic medical centers to validate our brain tumor segmentation model trained on over 10,000 individual MRI exams from >35 institutions. In particular, the familiar web-based interface enabled efficient collaboration across time zones without the need for live support, even for those with limited technical literacy. Using the MedPerf WebUI, we were able to work asynchronously across seven individual external sites without any significant support delays. Once we completed the validation phase, the results could be easily viewed or downloaded from the WebUI on our end. It’s clear to me that this WebUI will be the future of MedPerf, enabling broader collaboration without increased support burden.

Other feedback from clinical and academic contexts included:

“Click and click, step-by-step, all necessary information included in the webUI manual. Web UI was easy to follow, and intuitive to understand the current evaluation status!”

Jaeyoung Cho, CCIBonn.ai, University Hospital Bonn

“The MedPerf Web UI offers an intuitive and user-friendly interface that can be easily operated by both engineers and clinicians with minimal learning effort. As a standardized evaluation platform, it simplifies the benchmarking of models across datasets using multiple performance metrics, facilitating both clinical deployment and future research. I particularly appreciate its integration with compute cluster environments via Docker or Singularity, which makes installation and setup straightforward for any user.”

Rachit Saluja, PhD student, Cornell University & Weill Cornell Medicine

“The interface was easy to learn, and having the post-pipeline segmentations easily accessible through ITK-snap in the UI made processing the segmentations much quicker. It was very convenient to receive multiple labels to show the different components of the segmentations.

Erik Elgeskog, University of Gothenburg


A Call to Action

We invite the research and medical AI community to explore the new MedPerf WebUI and experience a more accessible way to benchmark and validate models. Your participation is vital to our growth; we encourage you to test the interface, collaborate with peers, and report both your technical hurdles and your success stories to help us refine the platform. Beyond testing, this is an invitation to make a tangible impact: join us in our ongoing AI-based clinical studies and help accelerate the adoption of trusted medical AI

About MedPerf

MedPerf is an open benchmarking platform that aims at evaluating AI on real-world medical data. MedPerf follows the principle of federated evaluation in which medical data never leaves the premises of data providers. Instead, AI algorithms are deployed within the data providers and evaluated against the benchmark. Results are then manually approved for sharing with the benchmark authority. This effort aims to establish global federated datasets and develop scientific benchmarks to reduce the risks of medical AI, such as bias, lack of generalizability, and potential misuse. We believe this two-pronged strategy will enable clinically impactful AI and improve healthcare efficacy.

Learn more at https://mlcommons.org/working-groups/data/medical/.  


About MLCommons 

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone. The organization produces industry-leading benchmarks, datasets, and best practices that span the full range of ML applications—from massive cloud training to resource-constrained edge devices. Its MLPerf benchmark suite has become the de facto standard for evaluating AI performance.

Learn more at www.mlcommons.org.

The post MedPerf adds WebUI capabilities to make federated benchmarking more user-friendly appeared first on MLCommons.

]]>
MLPerf Client v1.5 Advances AI PC Benchmarking with Windows ML Integration https://mlcommons.org/2025/11/mlperf-client-1-5-release/ Mon, 17 Nov 2025 16:00:00 +0000 https://mlcommons.org/?p=3488 MLCommons Launches MLPerf Client v1.5: Advancing AI PC Performance with Windows ML Integration and Expanded Cross-Platform Support.

The post MLPerf Client v1.5 Advances AI PC Benchmarking with Windows ML Integration appeared first on MLCommons.

]]>
MLCommons®, an open engineering consortium dedicated to improving machine learning performance and transparency, today announced the release of MLPerf® Client v1.5, the latest version of its industry-standard benchmark for evaluating AI performance on personal computers.

MLPerf Client measures how effectively PCs—from laptops and desktops to workstations—run AI workloads like large language models (LLMs) locally. The benchmark simulates real-world generative AI tasks, such as summarization, content creation, and code analysis, and reports metrics for both responsiveness and throughput.

New in MLPerf Client v1.5

Version 1.5 introduces a range of new features, refinements, and expanded support designed to keep pace with the rapid evolution of AI-enabled hardware and software:

  • Acceleration updates: New support for Windows ML and runtime updates from independent hardware vendors improve performance across many types of GPUs and NPUs.
  • Platform expansion: MLPerf Client v1.5 now supports Windows x64, Windows on Arm, macOS, Linux (CLI, experimental), and a new iPad app (GUI only)—bringing AI benchmarking to more platforms than ever before.
  • Power and energy measurement (experimental): Optional CLI tools integrate with MLPerf Power and SPEC PTDaemon, enabling users to characterize energy efficiency alongside raw performance for the first time.
  • GUI improvements: The graphical user interface moves out of beta, offering enhanced usability and easy results display and export.

These updates mark another major step toward a comprehensive, cross-platform benchmark for client AI computing.

MLPerf Client is the product of a collaboration among technology leaders, including AMD, Intel, Microsoft, NVIDIA, Qualcomm Technologies, Inc., and top PC OEMs. These participants have contributed resources and expertise to the creation of this AI benchmark.

MLPerf Client v1.5 is freely available for download, and the source code is open for inspection and community contribution via the MLCommons GitHub repository. The iPad version is available via Apple’s App Store.

For downloads, please visit our release page. For documentation and additional details, visit mlcommons.org.

About MLCommons 

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone. The organization produces industry-leading benchmarks, datasets, and best practices that span the full range of ML applications—from massive cloud training to resource-constrained edge devices. Its MLPerf benchmark suite has become the de facto standard for evaluating AI performance.

Learn more at www.mlcommons.org.

Press Contact: [email protected]

The post MLPerf Client v1.5 Advances AI PC Benchmarking with Windows ML Integration appeared first on MLCommons.

]]>