Describe the bug
When using mlpstorage training datagen to generate data for the dlrm workload, the process fails with a ValueError: 'parquet' is not a valid FormatType. The error originates from the DLIO benchmark's LoadConfig function, where it attempts to validate the dataset.format configuration.
To Reproduce
The issue was reproduced with the following command:
bash
mlpstorage training datagen --hosts=10.1.2.46 --model=dlrm --exec-type=mpi
--param dataset.num_files_train=369 --num-processes=1 --file
--results-dir=/mnt/data
This command internally calls mpirun to execute dlio_benchmark with the dlrm_datagen workload and the override ++workload.dataset.num_files_train=369.
Expected behavior
The data generation process should complete successfully without any ValueError.
Actual behavior
The execution fails with the following stack trace:
text
Error executing job with overrides: ['workload=dlrm_datagen', '++workload.dataset.num_files_train=369']
Traceback (most recent call last):
File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/main.py", line 465, in run_benchmark
benchmark = DLIOBenchmark(cfg['workload'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/main.py", line 71, in init
LoadConfig(self.args, cfg)
File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/utils/config.py", line 935, in LoadConfig
args.format = FormatType(config['dataset']['format'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/enum.py", line 757, in call
return cls.new(cls, value)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/enum.py", line 1171, in new
raise ve_exc
ValueError: 'parquet' is not a valid FormatType
Environment
OS: Ubuntu 24.04
Python: 3.12
Virtual Environment: MLPstorageV3
mlc-storage repository: latest version
dlio_benchmark version: installed via pip within the virtual environment
Additional context
The issue seems to be that the dlrm_datagen workload configuration specifies a dataset format of 'parquet'. However, the FormatType enum in dlio_benchmark/utils/config.py does not currently include parquet as a valid option. According to the DLIO documentation, the supported file formats are tfrecord, hdf5, npz, csv, jpg, and jpeg.
Could you please advise on the following:
Is parquet a supported format for the dlrm workload?
If parquet is not supported, what is the recommended format to use for this workload?
Are there any additional dependencies or configuration changes required to enable parquet support?
Thank you for your help.
Describe the bug
When using mlpstorage training datagen to generate data for the dlrm workload, the process fails with a ValueError: 'parquet' is not a valid FormatType. The error originates from the DLIO benchmark's LoadConfig function, where it attempts to validate the dataset.format configuration.
To Reproduce
The issue was reproduced with the following command:
bash
mlpstorage training datagen --hosts=10.1.2.46 --model=dlrm --exec-type=mpi
--param dataset.num_files_train=369 --num-processes=1 --file
--results-dir=/mnt/data
This command internally calls mpirun to execute dlio_benchmark with the dlrm_datagen workload and the override ++workload.dataset.num_files_train=369.
Expected behavior
The data generation process should complete successfully without any ValueError.
Actual behavior
The execution fails with the following stack trace:
text
Error executing job with overrides: ['workload=dlrm_datagen', '++workload.dataset.num_files_train=369']
Traceback (most recent call last):
File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/main.py", line 465, in run_benchmark
benchmark = DLIOBenchmark(cfg['workload'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/main.py", line 71, in init
LoadConfig(self.args, cfg)
File "/root/.venvs/MLPstorageV3/lib/python3.12/site-packages/dlio_benchmark/utils/config.py", line 935, in LoadConfig
args.format = FormatType(config['dataset']['format'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/enum.py", line 757, in call
return cls.new(cls, value)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/enum.py", line 1171, in new
raise ve_exc
ValueError: 'parquet' is not a valid FormatType
Environment
OS: Ubuntu 24.04
Python: 3.12
Virtual Environment: MLPstorageV3
mlc-storage repository: latest version
dlio_benchmark version: installed via pip within the virtual environment
Additional context
The issue seems to be that the dlrm_datagen workload configuration specifies a dataset format of 'parquet'. However, the FormatType enum in dlio_benchmark/utils/config.py does not currently include parquet as a valid option. According to the DLIO documentation, the supported file formats are tfrecord, hdf5, npz, csv, jpg, and jpeg.
Could you please advise on the following:
Is parquet a supported format for the dlrm workload?
If parquet is not supported, what is the recommended format to use for this workload?
Are there any additional dependencies or configuration changes required to enable parquet support?
Thank you for your help.