Skip to content

[AMD] feat: Add mi355x distributed inference test CI workflow#409

Merged
cquil11 merged 7 commits intomainfrom
feat/mi355x-disagg-2
Jan 8, 2026
Merged

[AMD] feat: Add mi355x distributed inference test CI workflow#409
cquil11 merged 7 commits intomainfrom
feat/mi355x-disagg-2

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Jan 8, 2026

This patch is to add mi355x distributed inference test CI workflow and integrate recipes https://github.com/billishyahao/sglang_disagg into inferencemax github action

Co-authored-by: billishyahao [email protected]
Co-authored-by: ichbinblau [email protected]
Co-authored-by: Duyi-Wang [email protected]
Co-authored-by: inkcherry [email protected]


Note

Adds distributed prefill/decode disaggregation benchmarking for MI355x FP8 DeepSeek-R1 and integrates it into CI.

  • New dsr1-fp8-mi355x-sglang-disagg config in .github/configs/amd-master.yaml with multi-node settings (1P/2D), MTP and non-MTP variants across isl/osl combos, and explicit tp/ep/dp-attn plus node/worker counts
  • New SLURM launcher benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh that clones sglang_disagg, sets env, and submits disagg jobs
  • Update runners/launch_mi355x-amd.sh to support IS_MULTINODE: add scancel_sync, set SLURM env, invoke framework-specific launcher, tail logs, collect result JSONs, cancel/cleanup; retain single-node path
  • Update perf-changelog.yaml with the new MI355x disagg config entry

Written by Cursor Bugbot for commit b0b9414. This will update automatically on new commits. Configure here.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @cquil11, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a new continuous integration workflow to validate distributed inference performance on AMD MI355X hardware. It specifically incorporates SGLang disaggregation benchmarks, ensuring that the system can efficiently handle large language models with varying input/output sequence lengths and concurrency levels, both with and without speculative decoding. This enhancement improves the reliability and performance monitoring of distributed inference setups.

Highlights

  • New CI Workflow: Introduces a new Continuous Integration workflow specifically for mi355x distributed inference tests.
  • SGLang Disaggregation Integration: Integrates sglang_disagg recipes into the inferencemax GitHub action, enabling distributed inference benchmarking.
  • New Benchmark Configuration: Adds a new configuration, dsr1-fp8-mi355x-sglang-disagg, to amd-master.yaml, defining various sequence length and concurrency settings for both speculative (MTP) and non-speculative decoding.
  • Multinode Benchmark Script: Includes a new Slurm benchmark script (dsr1_fp8_mi355x_sglang-disagg_slurm.sh) designed for launching these distributed tests.
  • Runner Script Update: Modifies launch_mi355x-amd.sh to support multinode SGLang disaggregation benchmarks, incorporating robust job management, log monitoring, and result collection.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new CI workflow for distributed inference testing on mi355x. It adds a new benchmark configuration, a script to run it using SLURM, and extends the runner script to handle multi-node jobs. My review focuses on improving the robustness and maintainability of the new shell scripts and configuration files. I've identified some critical issues with how SLURM job IDs are handled, which could lead to failures in a multi-user environment. I've also suggested improvements for code clarity, dependency management, and script safety.

bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh"

# Wait for job to complete
JOB_ID=$(squeue -u $USER --noheader --format='%i')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

JOB_ID=$(squeue -u $USER --noheader --format='%i') is not a robust way to get the job ID. It will retrieve all job IDs for the current user, which can lead to incorrect behavior if the user has other jobs running. If submit_disagg.sh submits multiple jobs, this will capture all of them, likely breaking subsequent commands that expect a single ID (e.g., LOG_FILE creation).

The job submission script (submit_disagg.sh, called from the benchmark script) should be modified to return the SLURM job ID(s) it creates, and this script should capture that output. For example, if sbatch is used, its output can be captured and printed.

Example of what could be done in the submission script:

# inside submit_disagg.sh
JOB_ID=$(sbatch my_job.slurm | awk '{print $4}')
echo $JOB_ID

Then in launch_mi355x-amd.sh:

JOB_ID=$(bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh")

# git clone --branch cam/sa-251219 https://github.com/cquil11/sglang_disagg.git

# Switch to origin repo url for supporting wide ep configs
git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script clones a repository from a personal GitHub account (billishyahao) and a specific branch (sa-260107). This introduces a significant risk to the CI/CD pipeline's stability and security. Personal branches can be force-pushed, rebased, or deleted at any time, which would break this script. It's strongly recommended to use a versioned release, a tag, or a commit hash from an organization-controlled repository to ensure reproducibility and stability.

Comment on lines +149 to +150
salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell
JOB_ID=$(squeue -u $USER -h -o %A | head -n1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

JOB_ID=$(squeue -u $USER -h -o %A | head -n1) is not a robust way to get the job ID. It relies on the job being the most recent one submitted by the user, which can be fragile in a shared environment. salloc prints the job ID to standard output when an allocation is granted. You should capture this output directly.

Suggested change
salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell
JOB_ID=$(squeue -u $USER -h -o %A | head -n1)
JOB_ID=$(salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell)

Comment on lines +192 to +559
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
# MTP configurations
# "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16)
- spec-decoding: "mtp"
conc-list: [ 2048 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 1
ep: 16
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"

# "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8)
- spec-decoding: "mtp"
conc-list: [ 256, 512, 1024 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"


# "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
- spec-decoding: "mtp"
conc-list: [ 32, 64, 128 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"

decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"

# non-MTP configurations
# "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16)
- spec-decoding: "none"
conc-list: [ 2048 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 1
ep: 16
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8)
- spec-decoding: "none"
conc-list: [ 256, 512, 1024 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"


# "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
- spec-decoding: "none"
conc-list: [ 32, 64, 128 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"

decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

- isl: 8192
osl: 1024
search-space:
# MTP configurations
# "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16)
- spec-decoding: "mtp"
conc-list: [ 2048 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 1
ep: 16
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"

# "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8)
- spec-decoding: "mtp"
conc-list: [ 256, 512, 1024 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"


# "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
- spec-decoding: "mtp"
conc-list: [ 32, 64, 128 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"

decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"

# non-MTP configurations
# "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16)
- spec-decoding: "none"
conc-list: [ 2048 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 1
ep: 16
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8)
- spec-decoding: "none"
conc-list: [ 256, 512, 1024 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
- spec-decoding: "none"
conc-list: [ 32, 64, 128 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"

decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"


- isl: 1024
osl: 8192
search-space:
# MTP configurations
# "Top of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8)
- spec-decoding: "mtp"
conc-list: [ 2048 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 1
ep: 16
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"


# "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8)
- spec-decoding: "mtp"
conc-list: [ 256, 512, 1024 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"


# "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
- spec-decoding: "mtp"
conc-list: [ 32, 64, 128 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"

decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=2"

# non-MTP configurations
# "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16)
- spec-decoding: "none"
conc-list: [ 2048 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 1
ep: 16
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8)
- spec-decoding: "none"
conc-list: [ 256, 512, 1024 ]
prefill:
num-worker: 1
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 1
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
- spec-decoding: "none"
conc-list: [ 32, 64, 128 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The search-space configuration is duplicated three times for different isl/osl combinations. This makes the file very long and hard to maintain. You can use YAML anchors and aliases to define the search-space once and reuse it. This will significantly reduce the file size and improve readability.

For example:

# At the top level of your YAML, define an anchor
x-dsr1-search-space: &dsr1_search_space
  # MTP configurations
  # "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16)
  - spec-decoding: "mtp"
    conc-list: [ 2048 ]
    # ... rest of the search space items

ds1-fp8-mi355x-sglang-disagg:
  # ... other properties
  seq-len-configs:
  - isl: 1024
    osl: 1024
    search-space: *dsr1_search_space # Reuse with an alias
  - isl: 8192
    osl: 1024
    search-space: *dsr1_search_space # Reuse with an alias
  - isl: 1024
    osl: 8192
    search-space: *dsr1_search_space # Reuse with an alias

Comment on lines +7 to +10
check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \
PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \
DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \
PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO SGL_SLURM_JOBS_PATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable SGL_SLURM_JOBS_PATH is listed twice in the call to check_env_vars. While this is harmless, it's redundant and should be removed for better code clarity.

Suggested change
check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \
PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \
DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \
PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO SGL_SLURM_JOBS_PATH
check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \
PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \
DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \
PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO

export ISL="$ISL"
export OSL="$OSL"

sudo rm -rf "$SGL_SLURM_JOBS_PATH/logs" 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script uses sudo rm -rf ... to clean up log directories (here and on line 126). Using sudo within a CI script can be a security risk and may indicate underlying permissions issues. It would be better to investigate why sudo is needed and fix the permissions of the files and directories being created, so that the runner user can remove them without elevated privileges. If sudo is unavoidable, please add a comment explaining why.

Comment on lines +108 to +117
for result_file in $(find $LOGS_DIR -type f); do
# result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json
file_name=$(basename $result_file)
if [ -f $result_file ]; then
# Copy the result file to workspace with a unique name
WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}"
echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}"
cp $result_file $WORKSPACE_RESULT_FILE
fi
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The for result_file in $(find ...) loop is not safe for filenames that contain spaces or other special characters. While it might not be an issue with the current file naming scheme, it's a good practice to write robust shell scripts. Consider using a while read loop with find -print0 to handle all possible filenames correctly. This also allows for safe quoting of variables and removal of a redundant file check.

Suggested change
for result_file in $(find $LOGS_DIR -type f); do
# result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json
file_name=$(basename $result_file)
if [ -f $result_file ]; then
# Copy the result file to workspace with a unique name
WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}"
echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}"
cp $result_file $WORKSPACE_RESULT_FILE
fi
done
find "$LOGS_DIR" -type f -print0 | while IFS= read -r -d '' result_file; do
# result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json
file_name=$(basename "$result_file")
# Copy the result file to workspace with a unique name
WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}"
echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}"
cp "$result_file" "$WORKSPACE_RESULT_FILE"
done

@cquil11 cquil11 merged commit e264c0e into main Jan 8, 2026
11 checks passed
@cquil11 cquil11 deleted the feat/mi355x-disagg-2 branch January 8, 2026 21:32
bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh"

# Wait for job to complete
JOB_ID=$(squeue -u $USER --noheader --format='%i')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing head filter causes multi-job JOB_ID capture failure

High Severity

In the multinode code path, JOB_ID is assigned from squeue without | head -n1, unlike the non-multinode path (line 150) and all other launch scripts (launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, etc.). If multiple jobs exist for the user, JOB_ID will contain multiple IDs separated by newlines, causing LOG_FILE to have an invalid path with embedded newlines and breaking subsequent operations like the grep -q "$JOB_ID" check and scancel_sync $JOB_ID call.

🔬 Verification Test

Why verification test was not possible: This bug requires a SLURM cluster environment with the ability to submit jobs and query the queue. The issue is a comparison-based pattern bug that can be verified by code inspection - the multinode path at line 49 uses squeue -u $USER --noheader --format='%i' without | head -n1, while the non-multinode path at line 150 in the same file correctly uses | head -n1, and this pattern is consistent across all other runner scripts (launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, launch_h200-nb.sh, launch_h200-nv.sh).

Fix in Cursor Fix in Web

export ISL="$ISL"
export OSL="$OSL"

sudo rm -rf "$SGL_SLURM_JOBS_PATH/logs" 2>/dev/null || true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent git clone failure leads to stale code execution

Medium Severity

The launcher only removes $SGL_SLURM_JOBS_PATH/logs but the benchmark script clones the entire sglang_disagg directory. On subsequent runs, the git clone fails silently (no set -e in the benchmark script) because the directory already exists, but cd "$SGL_SLURM_JOBS_PATH" succeeds and the script proceeds with potentially stale code. This could cause benchmarks to run with outdated versions of submit_disagg.sh without any error or warning, leading to incorrect results. The cleanup at line 44 and 126 needs to remove the entire sglang_disagg directory, not just the logs subdirectory.

🔬 Verification Test

Why verification test was not possible: This bug requires a SLURM cluster environment and executing the workflow twice on the same runner to observe the git clone failure and stale code execution. The issue can be verified by code inspection: the cleanup removes sglang_disagg/logs while git clone creates sglang_disagg/, leaving the repository directory intact between runs. Without set -e, the git clone failure on the second run is silent, and the existing directory allows cd "$SGL_SLURM_JOBS_PATH" to succeed with old code.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants