Skip to content

KVCache Verify client CPU choice has minimal influence on results #334

@dslik

Description

@dslik

One of the goals of the MLPerf Storage benchmarks are that once the storage device is saturated (peak throughput/IOPS), adding additional compute resources to a test client does not significantly alter the reported metrics. Specifically, our benchmarks should be storage benchmarks, not server benchmarks.

For each of the four KV Cache metrics reported on the results table, we need to verify that this above property is true:

  1. ⚠️ Tokens per second - If locally computed tokens (CPU-driven) are included together with KV cache'd tokens, having a faster client CPU would increase this metric. As a result, we should only include "Cached tokens per second". Since our KV cache workload is fixed across submitters for closed submissions, this will allow for a fair comparison.

  2. ✅ Read bandwidth - Once the storage is saturated, adding additional client CPU resources will not further increase this number.

  3. ✅ Write bandwidth - Once the storage is saturated, adding additional client CPU resources will not further increase this number.

  4. ⚠️ P95 Read Latency - If local computation steps are included in this latency measurement, having a faster client CPU would improve this metric. As a result, we should ensure that we are only including the latency of the storage I/O operations in this metric.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions