[AMD] feat: Add mi355x distributed inference test CI workflow#409
[AMD] feat: Add mi355x distributed inference test CI workflow#409
Conversation
Summary of ChangesHello @cquil11, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes a new continuous integration workflow to validate distributed inference performance on AMD MI355X hardware. It specifically incorporates SGLang disaggregation benchmarks, ensuring that the system can efficiently handle large language models with varying input/output sequence lengths and concurrency levels, both with and without speculative decoding. This enhancement improves the reliability and performance monitoring of distributed inference setups. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new CI workflow for distributed inference testing on mi355x. It adds a new benchmark configuration, a script to run it using SLURM, and extends the runner script to handle multi-node jobs. My review focuses on improving the robustness and maintainability of the new shell scripts and configuration files. I've identified some critical issues with how SLURM job IDs are handled, which could lead to failures in a multi-user environment. I've also suggested improvements for code clarity, dependency management, and script safety.
| bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh" | ||
|
|
||
| # Wait for job to complete | ||
| JOB_ID=$(squeue -u $USER --noheader --format='%i') |
There was a problem hiding this comment.
JOB_ID=$(squeue -u $USER --noheader --format='%i') is not a robust way to get the job ID. It will retrieve all job IDs for the current user, which can lead to incorrect behavior if the user has other jobs running. If submit_disagg.sh submits multiple jobs, this will capture all of them, likely breaking subsequent commands that expect a single ID (e.g., LOG_FILE creation).
The job submission script (submit_disagg.sh, called from the benchmark script) should be modified to return the SLURM job ID(s) it creates, and this script should capture that output. For example, if sbatch is used, its output can be captured and printed.
Example of what could be done in the submission script:
# inside submit_disagg.sh
JOB_ID=$(sbatch my_job.slurm | awk '{print $4}')
echo $JOB_IDThen in launch_mi355x-amd.sh:
JOB_ID=$(bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh")| # git clone --branch cam/sa-251219 https://github.com/cquil11/sglang_disagg.git | ||
|
|
||
| # Switch to origin repo url for supporting wide ep configs | ||
| git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git |
There was a problem hiding this comment.
The script clones a repository from a personal GitHub account (billishyahao) and a specific branch (sa-260107). This introduces a significant risk to the CI/CD pipeline's stability and security. Personal branches can be force-pushed, rebased, or deleted at any time, which would break this script. It's strongly recommended to use a versioned release, a tag, or a commit hash from an organization-controlled repository to ensure reproducibility and stability.
| salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell | ||
| JOB_ID=$(squeue -u $USER -h -o %A | head -n1) |
There was a problem hiding this comment.
JOB_ID=$(squeue -u $USER -h -o %A | head -n1) is not a robust way to get the job ID. It relies on the job being the most recent one submitted by the user, which can be fragile in a shared environment. salloc prints the job ID to standard output when an allocation is granted. You should capture this output directly.
| salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell | |
| JOB_ID=$(squeue -u $USER -h -o %A | head -n1) | |
| JOB_ID=$(salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell) |
| seq-len-configs: | ||
| - isl: 1024 | ||
| osl: 1024 | ||
| search-space: | ||
| # MTP configurations | ||
| # "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 2048 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 16 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
| # "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 256, 512, 1024 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 2 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
|
|
||
| # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 32, 64, 128 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
|
|
||
| decode: | ||
| num-worker: 2 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
| # non-MTP configurations | ||
| # "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 2048 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 16 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
| # "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 256, 512, 1024 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 2 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
|
|
||
| # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 32, 64, 128 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
|
|
||
| decode: | ||
| num-worker: 2 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
| - isl: 8192 | ||
| osl: 1024 | ||
| search-space: | ||
| # MTP configurations | ||
| # "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 2048 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 16 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
| # "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 256, 512, 1024 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 2 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
|
|
||
| # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 32, 64, 128 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
|
|
||
| decode: | ||
| num-worker: 2 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
| # non-MTP configurations | ||
| # "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 2048 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 16 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
| # "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 256, 512, 1024 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 2 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
| # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 32, 64, 128 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
|
|
||
| decode: | ||
| num-worker: 2 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
|
|
||
| - isl: 1024 | ||
| osl: 8192 | ||
| search-space: | ||
| # MTP configurations | ||
| # "Top of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 2048 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 16 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
|
|
||
| # "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 256, 512, 1024 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 2 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
|
|
||
| # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8) | ||
| - spec-decoding: "mtp" | ||
| conc-list: [ 32, 64, 128 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
|
|
||
| decode: | ||
| num-worker: 2 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=2" | ||
|
|
||
| # non-MTP configurations | ||
| # "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 2048 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 16 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
| # "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 256, 512, 1024 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 2 | ||
| tp: 1 | ||
| ep: 8 | ||
| dp-attn: true | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
| # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8) | ||
| - spec-decoding: "none" | ||
| conc-list: [ 32, 64, 128 ] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "PREFILL_NODES=1" | ||
| decode: | ||
| num-worker: 2 | ||
| tp: 8 | ||
| ep: 8 | ||
| dp-attn: false | ||
| additional-settings: | ||
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" |
There was a problem hiding this comment.
The search-space configuration is duplicated three times for different isl/osl combinations. This makes the file very long and hard to maintain. You can use YAML anchors and aliases to define the search-space once and reuse it. This will significantly reduce the file size and improve readability.
For example:
# At the top level of your YAML, define an anchor
x-dsr1-search-space: &dsr1_search_space
# MTP configurations
# "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16)
- spec-decoding: "mtp"
conc-list: [ 2048 ]
# ... rest of the search space items
ds1-fp8-mi355x-sglang-disagg:
# ... other properties
seq-len-configs:
- isl: 1024
osl: 1024
search-space: *dsr1_search_space # Reuse with an alias
- isl: 8192
osl: 1024
search-space: *dsr1_search_space # Reuse with an alias
- isl: 1024
osl: 8192
search-space: *dsr1_search_space # Reuse with an alias| check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \ | ||
| PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \ | ||
| DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \ | ||
| PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO SGL_SLURM_JOBS_PATH |
There was a problem hiding this comment.
The environment variable SGL_SLURM_JOBS_PATH is listed twice in the call to check_env_vars. While this is harmless, it's redundant and should be removed for better code clarity.
| check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \ | |
| PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \ | |
| DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \ | |
| PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO SGL_SLURM_JOBS_PATH | |
| check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \ | |
| PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \ | |
| DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \ | |
| PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO |
| export ISL="$ISL" | ||
| export OSL="$OSL" | ||
|
|
||
| sudo rm -rf "$SGL_SLURM_JOBS_PATH/logs" 2>/dev/null || true |
There was a problem hiding this comment.
The script uses sudo rm -rf ... to clean up log directories (here and on line 126). Using sudo within a CI script can be a security risk and may indicate underlying permissions issues. It would be better to investigate why sudo is needed and fix the permissions of the files and directories being created, so that the runner user can remove them without elevated privileges. If sudo is unavoidable, please add a comment explaining why.
| for result_file in $(find $LOGS_DIR -type f); do | ||
| # result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json | ||
| file_name=$(basename $result_file) | ||
| if [ -f $result_file ]; then | ||
| # Copy the result file to workspace with a unique name | ||
| WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}" | ||
| echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}" | ||
| cp $result_file $WORKSPACE_RESULT_FILE | ||
| fi | ||
| done |
There was a problem hiding this comment.
The for result_file in $(find ...) loop is not safe for filenames that contain spaces or other special characters. While it might not be an issue with the current file naming scheme, it's a good practice to write robust shell scripts. Consider using a while read loop with find -print0 to handle all possible filenames correctly. This also allows for safe quoting of variables and removal of a redundant file check.
| for result_file in $(find $LOGS_DIR -type f); do | |
| # result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json | |
| file_name=$(basename $result_file) | |
| if [ -f $result_file ]; then | |
| # Copy the result file to workspace with a unique name | |
| WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}" | |
| echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}" | |
| cp $result_file $WORKSPACE_RESULT_FILE | |
| fi | |
| done | |
| find "$LOGS_DIR" -type f -print0 | while IFS= read -r -d '' result_file; do | |
| # result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json | |
| file_name=$(basename "$result_file") | |
| # Copy the result file to workspace with a unique name | |
| WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}" | |
| echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}" | |
| cp "$result_file" "$WORKSPACE_RESULT_FILE" | |
| done |
| bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh" | ||
|
|
||
| # Wait for job to complete | ||
| JOB_ID=$(squeue -u $USER --noheader --format='%i') |
There was a problem hiding this comment.
Missing head filter causes multi-job JOB_ID capture failure
High Severity
In the multinode code path, JOB_ID is assigned from squeue without | head -n1, unlike the non-multinode path (line 150) and all other launch scripts (launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, etc.). If multiple jobs exist for the user, JOB_ID will contain multiple IDs separated by newlines, causing LOG_FILE to have an invalid path with embedded newlines and breaking subsequent operations like the grep -q "$JOB_ID" check and scancel_sync $JOB_ID call.
🔬 Verification Test
Why verification test was not possible: This bug requires a SLURM cluster environment with the ability to submit jobs and query the queue. The issue is a comparison-based pattern bug that can be verified by code inspection - the multinode path at line 49 uses squeue -u $USER --noheader --format='%i' without | head -n1, while the non-multinode path at line 150 in the same file correctly uses | head -n1, and this pattern is consistent across all other runner scripts (launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, launch_h200-nb.sh, launch_h200-nv.sh).
| export ISL="$ISL" | ||
| export OSL="$OSL" | ||
|
|
||
| sudo rm -rf "$SGL_SLURM_JOBS_PATH/logs" 2>/dev/null || true |
There was a problem hiding this comment.
Silent git clone failure leads to stale code execution
Medium Severity
The launcher only removes $SGL_SLURM_JOBS_PATH/logs but the benchmark script clones the entire sglang_disagg directory. On subsequent runs, the git clone fails silently (no set -e in the benchmark script) because the directory already exists, but cd "$SGL_SLURM_JOBS_PATH" succeeds and the script proceeds with potentially stale code. This could cause benchmarks to run with outdated versions of submit_disagg.sh without any error or warning, leading to incorrect results. The cleanup at line 44 and 126 needs to remove the entire sglang_disagg directory, not just the logs subdirectory.
🔬 Verification Test
Why verification test was not possible: This bug requires a SLURM cluster environment and executing the workflow twice on the same runner to observe the git clone failure and stale code execution. The issue can be verified by code inspection: the cleanup removes sglang_disagg/logs while git clone creates sglang_disagg/, leaving the repository directory intact between runs. Without set -e, the git clone failure on the second run is silent, and the existing directory allows cd "$SGL_SLURM_JOBS_PATH" to succeed with old code.
This patch is to add mi355x distributed inference test CI workflow and integrate recipes https://github.com/billishyahao/sglang_disagg into inferencemax github action
Note
Adds distributed prefill/decode disaggregation benchmarking for MI355x FP8 DeepSeek-R1 and integrates it into CI.
dsr1-fp8-mi355x-sglang-disaggconfig in.github/configs/amd-master.yamlwith multi-node settings (1P/2D), MTP and non-MTP variants acrossisl/oslcombos, and explicittp/ep/dp-attnplus node/worker countsbenchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.shthat clonessglang_disagg, sets env, and submits disagg jobsrunners/launch_mi355x-amd.shto supportIS_MULTINODE: addscancel_sync, set SLURM env, invoke framework-specific launcher, tail logs, collect result JSONs, cancel/cleanup; retain single-node pathperf-changelog.yamlwith the new MI355x disagg config entryWritten by Cursor Bugbot for commit b0b9414. This will update automatically on new commits. Configure here.