One of the goals of the MLPerf Storage benchmarks are that once the storage device is saturated (peak throughput/IOPS), adding additional compute resources to a test client does not significantly alter the reported metrics. Specifically, our benchmarks should be storage benchmarks, not server benchmarks.
For each of the four KV Cache metrics reported on the results table, we need to verify that this above property is true:
-
⚠️ Tokens per second - If locally computed tokens (CPU-driven) are included together with KV cache'd tokens, having a faster client CPU would increase this metric. As a result, we should only include "Cached tokens per second". Since our KV cache workload is fixed across submitters for closed submissions, this will allow for a fair comparison.
-
✅ Read bandwidth - Once the storage is saturated, adding additional client CPU resources will not further increase this number.
-
✅ Write bandwidth - Once the storage is saturated, adding additional client CPU resources will not further increase this number.
-
⚠️ P95 Read Latency - If local computation steps are included in this latency measurement, having a faster client CPU would improve this metric. As a result, we should ensure that we are only including the latency of the storage I/O operations in this metric.
One of the goals of the MLPerf Storage benchmarks are that once the storage device is saturated (peak throughput/IOPS), adding additional compute resources to a test client does not significantly alter the reported metrics. Specifically, our benchmarks should be storage benchmarks, not server benchmarks.
For each of the four KV Cache metrics reported on the results table, we need to verify that this above property is true:
✅ Read bandwidth - Once the storage is saturated, adding additional client CPU resources will not further increase this number.
✅ Write bandwidth - Once the storage is saturated, adding additional client CPU resources will not further increase this number.