ValueError: 'parquet' is not a valid FormatType when using mlpstorage training datagen

Describe the bug
When using mlpstorage training datagen to generate data for the dlrm workload, the process fails with a ValueError: 'parquet' is not a valid FormatType. The error originates from the DLIO benchmark's LoadConfig function, where it attempts to validate the dataset.format configuration.

To Reproduce
The issue was reproduced with the following command:

bash
mlpstorage training datagen --hosts=10.1.2.46 --model=dlrm --exec-type=mpi \
  --param dataset.num_files_train=369 --num-processes=1 --file \
  --results-dir=/mnt/data
This command internally calls mpirun to execute dlio_benchmark with the dlrm_datagen workload and the override ++workload.dataset.num_files_train=369.

Expected behavior
The data generation process should complete successfully without any ValueError.

Actual behavior
The execution fails with the following stack trace:

text
Error executing job with overrides: ['workload=dlrm_datagen', '++workload.dataset.num_files_train=369']
Traceback (most recent call last):
  File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/main.py", line 465, in run_benchmark
    benchmark = DLIOBenchmark(cfg['workload'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/main.py", line 71, in __init__
    LoadConfig(self.args, cfg)
  File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/utils/config.py", line 935, in LoadConfig
    args.format = FormatType(config['dataset']['format'])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/enum.py", line 757, in __call__
    return cls.__new__(cls, value)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/enum.py", line 1171, in __new__
    raise ve_exc
ValueError: 'parquet' is not a valid FormatType
Environment

OS: Ubuntu 24.04

Python: 3.12

Virtual Environment: MLPstorageV3

mlc-storage repository: latest version

dlio_benchmark version: installed via pip within the virtual environment

Additional context
The issue seems to be that the dlrm_datagen workload configuration specifies a dataset format of 'parquet'. However, the FormatType enum in dlio_benchmark/utils/config.py does not currently include parquet as a valid option. According to the [DLIO documentation](https://dlio-benchmark.readthedocs.io/), the supported file formats are tfrecord, hdf5, npz, csv, jpg, and jpeg.

Could you please advise on the following:

Is parquet a supported format for the dlrm workload?

If parquet is not supported, what is the recommended format to use for this workload?

Are there any additional dependencies or configuration changes required to enable parquet support?

Thank you for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: 'parquet' is not a valid FormatType when using mlpstorage training datagen #336

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ValueError: 'parquet' is not a valid FormatType when using mlpstorage training datagen #336

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions