[AMD] feat: Add mi355x distributed inference test CI workflow by cquil11 · Pull Request #409 · SemiAnalysisAI/InferenceX

cquil11 · 2026-01-08T21:28:59Z

This patch is to add mi355x distributed inference test CI workflow and integrate recipes https://github.com/billishyahao/sglang_disagg into inferencemax github action

Co-authored-by: billishyahao [email protected]
Co-authored-by: ichbinblau [email protected]
Co-authored-by: Duyi-Wang [email protected]
Co-authored-by: inkcherry [email protected]

Note

Adds distributed prefill/decode disaggregation benchmarking for MI355x FP8 DeepSeek-R1 and integrates it into CI.

New dsr1-fp8-mi355x-sglang-disagg config in .github/configs/amd-master.yaml with multi-node settings (1P/2D), MTP and non-MTP variants across isl/osl combos, and explicit tp/ep/dp-attn plus node/worker counts
New SLURM launcher benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh that clones sglang_disagg, sets env, and submits disagg jobs
Update runners/launch_mi355x-amd.sh to support IS_MULTINODE: add scancel_sync, set SLURM env, invoke framework-specific launcher, tail logs, collect result JSONs, cancel/cleanup; retain single-node path
Update perf-changelog.yaml with the new MI355x disagg config entry

^{Written by Cursor Bugbot for commit b0b9414. This will update automatically on new commits. Configure here.}

…ow (#348)" (#400) [skip-sweep]" This reverts commit a075f2e.

gemini-code-assist · 2026-01-08T21:29:20Z

Summary of Changes

Hello @cquil11, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a new continuous integration workflow to validate distributed inference performance on AMD MI355X hardware. It specifically incorporates SGLang disaggregation benchmarks, ensuring that the system can efficiently handle large language models with varying input/output sequence lengths and concurrency levels, both with and without speculative decoding. This enhancement improves the reliability and performance monitoring of distributed inference setups.

Highlights

New CI Workflow: Introduces a new Continuous Integration workflow specifically for mi355x distributed inference tests.
SGLang Disaggregation Integration: Integrates sglang_disagg recipes into the inferencemax GitHub action, enabling distributed inference benchmarking.
New Benchmark Configuration: Adds a new configuration, dsr1-fp8-mi355x-sglang-disagg, to amd-master.yaml, defining various sequence length and concurrency settings for both speculative (MTP) and non-speculative decoding.
Multinode Benchmark Script: Includes a new Slurm benchmark script (dsr1_fp8_mi355x_sglang-disagg_slurm.sh) designed for launching these distributed tests.
Runner Script Update: Modifies launch_mi355x-amd.sh to support multinode SGLang disaggregation benchmarks, incorporating robust job management, log monitoring, and result collection.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new CI workflow for distributed inference testing on mi355x. It adds a new benchmark configuration, a script to run it using SLURM, and extends the runner script to handle multi-node jobs. My review focuses on improving the robustness and maintainability of the new shell scripts and configuration files. I've identified some critical issues with how SLURM job IDs are handled, which could lead to failures in a multi-user environment. I've also suggested improvements for code clarity, dependency management, and script safety.

gemini-code-assist · 2026-01-08T21:31:31Z

runners/launch_mi355x-amd.sh

+    bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh"
+
+    # Wait for job to complete
+    JOB_ID=$(squeue -u $USER --noheader --format='%i')


JOB_ID=$(squeue -u $USER --noheader --format='%i') is not a robust way to get the job ID. It will retrieve all job IDs for the current user, which can lead to incorrect behavior if the user has other jobs running. If submit_disagg.sh submits multiple jobs, this will capture all of them, likely breaking subsequent commands that expect a single ID (e.g., LOG_FILE creation).

The job submission script (submit_disagg.sh, called from the benchmark script) should be modified to return the SLURM job ID(s) it creates, and this script should capture that output. For example, if sbatch is used, its output can be captured and printed.

Example of what could be done in the submission script:

# inside submit_disagg.sh JOB_ID=$(sbatch my_job.slurm | awk '{print $4}') echo $JOB_ID

Then in launch_mi355x-amd.sh:

JOB_ID=$(bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh")

gemini-code-assist · 2026-01-08T21:31:31Z

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh

+# git clone --branch cam/sa-251219 https://github.com/cquil11/sglang_disagg.git
+
+# Switch to origin repo url for supporting wide ep configs
+git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git


The script clones a repository from a personal GitHub account (billishyahao) and a specific branch (sa-260107). This introduces a significant risk to the CI/CD pipeline's stability and security. Personal branches can be force-pushed, rebased, or deleted at any time, which would break this script. It's strongly recommended to use a versioned release, a tag, or a commit hash from an organization-controlled repository to ensure reproducibility and stability.

gemini-code-assist · 2026-01-08T21:31:31Z

runners/launch_mi355x-amd.sh

+    salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell
+    JOB_ID=$(squeue -u $USER -h -o %A | head -n1)


JOB_ID=$(squeue -u $USER -h -o %A | head -n1) is not a robust way to get the job ID. It relies on the job being the most recent one submitted by the user, which can be fragile in a shared environment. salloc prints the job ID to standard output when an allocation is granted. You should capture this output directly.

Suggested change

salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell

JOB_ID=$(squeue -u $USER -h -o %A | head -n1)

JOB_ID=$(salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell)

gemini-code-assist · 2026-01-08T21:31:31Z

.github/configs/amd-master.yaml

+  seq-len-configs:
+  - isl: 1024
+    osl: 1024
+    search-space:
+    # MTP configurations
+    # "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16)
+    - spec-decoding: "mtp"
+      conc-list: [ 2048 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 1
+        tp: 1
+        ep: 16
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+    # "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8)
+    - spec-decoding: "mtp"
+      conc-list: [ 256, 512, 1024 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 2
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+
+    # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
+    - spec-decoding: "mtp"
+      conc-list: [ 32, 64, 128 ]
+      prefill:
+        num-worker: 1
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "PREFILL_NODES=1"
+
+      decode:
+        num-worker: 2
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+    # non-MTP configurations
+    # "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16)
+    - spec-decoding: "none"
+      conc-list: [ 2048 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 1
+        tp: 1
+        ep: 16
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"
+
+    # "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8)
+    - spec-decoding: "none"
+      conc-list: [ 256, 512, 1024 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 2
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"
+
+
+    # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
+    - spec-decoding: "none"
+      conc-list: [ 32, 64, 128 ]
+      prefill:
+        num-worker: 1
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "PREFILL_NODES=1"
+
+      decode:
+        num-worker: 2
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"
+
+  - isl: 8192
+    osl: 1024
+    search-space:
+    # MTP configurations
+    # "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16)
+    - spec-decoding: "mtp"
+      conc-list: [ 2048 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 1
+        tp: 1
+        ep: 16
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+    # "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8)
+    - spec-decoding: "mtp"
+      conc-list: [ 256, 512, 1024 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 2
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+
+    # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
+    - spec-decoding: "mtp"
+      conc-list: [ 32, 64, 128 ]
+      prefill:
+        num-worker: 1
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "PREFILL_NODES=1"
+
+      decode:
+        num-worker: 2
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+    # non-MTP configurations
+    # "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16)
+    - spec-decoding: "none"
+      conc-list: [ 2048 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 1
+        tp: 1
+        ep: 16
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"
+
+    # "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8)
+    - spec-decoding: "none"
+      conc-list: [ 256, 512, 1024 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 2
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"
+
+    # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
+    - spec-decoding: "none"
+      conc-list: [ 32, 64, 128 ]
+      prefill:
+        num-worker: 1
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "PREFILL_NODES=1"
+
+      decode:
+        num-worker: 2
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"
+
+
+  - isl: 1024
+    osl: 8192
+    search-space:
+    # MTP configurations
+    # "Top of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8)
+    - spec-decoding: "mtp"
+      conc-list: [ 2048 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 1
+        tp: 1
+        ep: 16
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+
+    # "Middle of curve" (1 prefill worker at DEP8 and 2 decode workers each at DEP8)
+    - spec-decoding: "mtp"
+      conc-list: [ 256, 512, 1024 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 2
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+
+    # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
+    - spec-decoding: "mtp"
+      conc-list: [ 32, 64, 128 ]
+      prefill:
+        num-worker: 1
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "PREFILL_NODES=1"
+
+      decode:
+        num-worker: 2
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=2"
+
+    # non-MTP configurations
+    # "Top of curve" (1 prefill workers each at DEP8 and 1 decode workers at DEP16)
+    - spec-decoding: "none"
+      conc-list: [ 2048 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 1
+        tp: 1
+        ep: 16
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"
+
+    # "Middle of curve" (1 prefill workers each at DEP8 and 2 decode workers at DEP8)
+    - spec-decoding: "none"
+      conc-list: [ 256, 512, 1024 ]
+      prefill:
+        num-worker: 1
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 2
+        tp: 1
+        ep: 8
+        dp-attn: true
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"
+
+    # "Bottom of curve" (1 prefill worker at TEP8 and 2 decode workers at TEP8)
+    - spec-decoding: "none"
+      conc-list: [ 32, 64, 128 ]
+      prefill:
+        num-worker: 1
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "PREFILL_NODES=1"
+      decode:
+        num-worker: 2
+        tp: 8
+        ep: 8
+        dp-attn: false
+        additional-settings:
+        - "DECODE_NODES=2"
+        - "DECODE_MTP_SIZE=0"


The search-space configuration is duplicated three times for different isl/osl combinations. This makes the file very long and hard to maintain. You can use YAML anchors and aliases to define the search-space once and reuse it. This will significantly reduce the file size and improve readability.

For example:

# At the top level of your YAML, define an anchor x-dsr1-search-space: &dsr1_search_space # MTP configurations # "Top of curve" (1 prefill worker at DEP8 and 1 decode worker at DEP16) - spec-decoding: "mtp" conc-list: [ 2048 ] # ... rest of the search space items ds1-fp8-mi355x-sglang-disagg: # ... other properties seq-len-configs: - isl: 1024 osl: 1024 search-space: *dsr1_search_space # Reuse with an alias - isl: 8192 osl: 1024 search-space: *dsr1_search_space # Reuse with an alias - isl: 1024 osl: 8192 search-space: *dsr1_search_space # Reuse with an alias

gemini-code-assist · 2026-01-08T21:31:31Z

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh

+check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \
+    PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \
+    DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \
+    PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO SGL_SLURM_JOBS_PATH


The environment variable SGL_SLURM_JOBS_PATH is listed twice in the call to check_env_vars. While this is harmless, it's redundant and should be removed for better code clarity.

Suggested change

check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \

PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \

DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \

PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO SGL_SLURM_JOBS_PATH

check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \

PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \

DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \

PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH RANDOM_RANGE_RATIO

gemini-code-assist · 2026-01-08T21:31:31Z

runners/launch_mi355x-amd.sh

+    export ISL="$ISL"
+    export OSL="$OSL"
+
+    sudo rm -rf "$SGL_SLURM_JOBS_PATH/logs" 2>/dev/null || true


The script uses sudo rm -rf ... to clean up log directories (here and on line 126). Using sudo within a CI script can be a security risk and may indicate underlying permissions issues. It would be better to investigate why sudo is needed and fix the permissions of the files and directories being created, so that the runner user can remove them without elevated privileges. If sudo is unavoidable, please add a comment explaining why.

gemini-code-assist · 2026-01-08T21:31:31Z

runners/launch_mi355x-amd.sh

+    for result_file in $(find $LOGS_DIR -type f); do
+        # result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json
+        file_name=$(basename $result_file)
+        if [ -f $result_file ]; then
+            # Copy the result file to workspace with a unique name
+            WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}"
+            echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}"
+            cp $result_file $WORKSPACE_RESULT_FILE
+        fi
+    done


The for result_file in $(find ...) loop is not safe for filenames that contain spaces or other special characters. While it might not be an issue with the current file naming scheme, it's a good practice to write robust shell scripts. Consider using a while read loop with find -print0 to handle all possible filenames correctly. This also allows for safe quoting of variables and removal of a redundant file check.

Suggested change

for result_file in $(find $LOGS_DIR -type f); do

# result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json

file_name=$(basename $result_file)

if [ -f $result_file ]; then

# Copy the result file to workspace with a unique name

WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}"

echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}"

cp $result_file $WORKSPACE_RESULT_FILE

fi

done

find "$LOGS_DIR" -type f -print0 | while IFS= read -r -d '' result_file; do

# result_file should directly be isl_ISL_osl_OSL_concurrency_CONC_req_rate_R_gpus_N_ctx_M_gen_N.json

file_name=$(basename "$result_file")

# Copy the result file to workspace with a unique name

WORKSPACE_RESULT_FILE="$GITHUB_WORKSPACE/${RESULT_FILENAME}_${file_name}"

echo "Found result file ${result_file}. Copying it to ${WORKSPACE_RESULT_FILE}"

cp "$result_file" "$WORKSPACE_RESULT_FILE"

done

cursor · 2026-01-08T21:38:00Z

runners/launch_mi355x-amd.sh

+    bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh"
+
+    # Wait for job to complete
+    JOB_ID=$(squeue -u $USER --noheader --format='%i')


Missing head filter causes multi-job JOB_ID capture failure

High Severity

In the multinode code path, JOB_ID is assigned from squeue without | head -n1, unlike the non-multinode path (line 150) and all other launch scripts (launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, etc.). If multiple jobs exist for the user, JOB_ID will contain multiple IDs separated by newlines, causing LOG_FILE to have an invalid path with embedded newlines and breaking subsequent operations like the grep -q "$JOB_ID" check and scancel_sync $JOB_ID call.

🔬 Verification Test

Why verification test was not possible: This bug requires a SLURM cluster environment with the ability to submit jobs and query the queue. The issue is a comparison-based pattern bug that can be verified by code inspection - the multinode path at line 49 uses squeue -u $USER --noheader --format='%i' without | head -n1, while the non-multinode path at line 150 in the same file correctly uses | head -n1, and this pattern is consistent across all other runner scripts (launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, launch_h200-nb.sh, launch_h200-nv.sh).

cursor · 2026-01-08T21:38:00Z

runners/launch_mi355x-amd.sh

+    export ISL="$ISL"
+    export OSL="$OSL"
+
+    sudo rm -rf "$SGL_SLURM_JOBS_PATH/logs" 2>/dev/null || true


Silent git clone failure leads to stale code execution

Medium Severity

The launcher only removes $SGL_SLURM_JOBS_PATH/logs but the benchmark script clones the entire sglang_disagg directory. On subsequent runs, the git clone fails silently (no set -e in the benchmark script) because the directory already exists, but cd "$SGL_SLURM_JOBS_PATH" succeeds and the script proceeds with potentially stale code. This could cause benchmarks to run with outdated versions of submit_disagg.sh without any error or warning, leading to incorrect results. The cleanup at line 44 and 126 needs to remove the entire sglang_disagg directory, not just the logs subdirectory.

🔬 Verification Test

Why verification test was not possible: This bug requires a SLURM cluster environment and executing the workflow twice on the same runner to observe the git clone failure and stale code execution. The issue can be verified by code inspection: the cleanup removes sglang_disagg/logs while git clone creates sglang_disagg/, leaving the repository directory intact between runs. Without set -e, the git clone failure on the second run is silent, and the existing directory allows cd "$SGL_SLURM_JOBS_PATH" to succeed with old code.

Additional Locations (1)

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh#L15-L18

cquil11 and others added 6 commits January 7, 2026 13:23

Revert "Revert "[AMD] Add mi355x distributed inference test CI workfl…

94e26f4

…ow (#348)" (#400) [skip-sweep]" This reverts commit a075f2e.

add random range ratio that is appropriate (#402)

b8f8004

comment out 1k8k 8k1k

afe471f

change recipe back to upstream

5cd7366

revert comment out 1k8k 8k1k

47df874

Add empty line at the beginning of the script

1460803

cquil11 requested a review from a team as a code owner January 8, 2026 21:29

github-project-automation bot added this to InferenceMAX Board Jan 8, 2026

Merge branch 'main' into feat/mi355x-disagg-2

b0b9414

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

cquil11 merged commit e264c0e into main Jan 8, 2026
11 checks passed

cquil11 deleted the feat/mi355x-disagg-2 branch January 8, 2026 21:32

github-project-automation bot moved this to Done in InferenceMAX Board Jan 8, 2026

cursor bot reviewed Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] feat: Add mi355x distributed inference test CI workflow#409

[AMD] feat: Add mi355x distributed inference test CI workflow#409
cquil11 merged 7 commits intomainfrom
feat/mi355x-disagg-2

cquil11 commented Jan 8, 2026 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot commented Jan 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

Uh oh!

cursor bot Jan 8, 2026

Uh oh!

cursor bot Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell
		JOB_ID=$(squeue -u $USER -h -o %A \| head -n1)

	salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell
	JOB_ID=$(squeue -u $USER -h -o %A \| head -n1)
	JOB_ID=$(salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell)

Conversation

cquil11 commented Jan 8, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Missing head filter causes multi-job JOB_ID capture failure

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Silent git clone failure leads to stale code execution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cquil11 commented Jan 8, 2026 •

edited by cursor bot

Loading