[AMD] Add mi355x distributed inference test CI workflow by billishyahao · Pull Request #348 · SemiAnalysisAI/InferenceX

billishyahao · 2025-12-19T05:15:53Z

This patch is to add mi355x distributed inference test CI workflow and integrate recipes https://github.com/billishyahao/sglang_disagg into inferencemax github action

Co-authored-by: billishyahao [email protected]
Co-authored-by: ichbinblau [email protected]
Co-authored-by: Duyi-Wang [email protected]
Co-authored-by: inkcherry [email protected]

Note

Introduces multi-node, disaggregated SGLang benchmarking for MI355x FP8 DeepSeek-R1.

Adds dsr1-fp8-mi355x-sglang-disagg to amd-master.yaml with PD disaggregation (1P2D), both with and without speculative decoding, across isl/osl variants (1k/1k, 8k/1k, 1k/8k), and detailed prefill/decode worker, TP/EP, DP-attn settings
New benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh that clones sglang_disagg, sets EP/DP flags, and submits jobs via submit_disagg.sh
Enhances runners/launch_mi355x-amd.sh with a multinode path: SLURM env setup, job submission for sglang-disagg, log tailing, result collection to workspace, and synchronized job cancellation; retains single-node behavior
Updates perf-changelog.yaml with the new MI355x disagg config and related notes

^{Written by Cursor Bugbot for commit b4f3940. This will update automatically on new commits. Configure here.}

chatgpt-codex-connector · 2025-12-19T05:15:57Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

billishyahao · 2025-12-20T16:23:37Z

Hi Semi Analysis team, I just fixed the merge conflict. Feel free to review it. Thanks!

cquil11 · 2025-12-21T20:57:36Z

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

github-actions · 2025-12-21T20:57:45Z

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20415702216
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 946dab8
Approval: not required (trusted collaborator).

cquil11 · 2025-12-21T21:11:38Z

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

github-actions · 2025-12-21T21:11:48Z

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20415873469
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 1b0fea2
Approval: not required (trusted collaborator).

cquil11 · 2025-12-21T23:24:43Z

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

github-actions · 2025-12-21T23:24:53Z

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20417380873
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 443ca85
Approval: not required (trusted collaborator).

cquil11 · 2025-12-21T23:45:23Z

@billishyahao as you can by comments, I am validating right now
On behalf of all of us on the InferenceMAX team, I wanna first thank y'all for the hard work that went into getting this to work. This is important!

There are some things that need fixing. I have a fork going off of https://github.com/billishyahao/sglang_disagg
Will update this thread shortly with issues

cquil11 · 2025-12-22T00:03:17Z

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

github-actions · 2025-12-22T00:03:29Z

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20417798440
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 443ca85
Approval: not required (trusted collaborator).

billishyahao · 2026-01-06T15:43:29Z

Got all validation passed https://github.com/InferenceMAX/InferenceMAX/actions/runs/20741958744 I will switch back to main branch of recipe and kick off a final round validation

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh

runners/launch_mi355x-amd.sh

billishyahao · 2026-01-06T16:13:35Z

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

github-actions · 2026-01-06T16:13:45Z

@billishyahao Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20754354192
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: e5b36bf
Approval: not required (trusted collaborator).

cursor · 2026-01-06T16:18:03Z

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh

+# Switch to origin repo url for supporting wide ep configs
+git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git
+
+cd "$SGL_SLURM_JOBS_PATH" || exit 1


Git clone silently fails if directory already exists

The comment states "Always clone and setup sglang_disagg" but the git clone command on line 16 will fail silently if the sglang_disagg directory already exists (from a previous run on a self-hosted runner). Since there's no set -e, the script continues and cd "$SGL_SLURM_JOBS_PATH" succeeds using the stale existing directory. The cleanup in runners/launch_mi355x-amd.sh only removes the logs subdirectory, not the entire clone. This could cause the CI to run outdated code from a previous run, leading to hard-to-debug issues where the intended changes aren't being tested.

…r output exist

cquil11 · 2026-01-06T19:42:53Z

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

github-actions · 2026-01-06T19:43:04Z

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20759988001
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 94181a5
Approval: not required (trusted collaborator).

cursor · 2026-01-06T19:50:35Z

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh

+check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \
+    PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \
+    DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \
+    PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH # SGL_SLURM_JOBS_PATH FIXME


Missing validation for MTP size when speculative decoding enabled

Medium Severity

The SPEC_DECODING environment variable is validated via check_env_vars but never used in the script. Meanwhile, DECODE_MTP_SIZE (which controls MTP/speculative decoding behavior in the underlying submit_disagg.sh script) is not validated. Other scripts like dsr1_fp4_gb200_dynamo-trt_slurm.sh follow a pattern of conditionally checking DECODE_MTP_SIZE when SPEC_DECODING equals "mtp". This validation gap could allow misconfigured benchmark runs if a future YAML config sets spec-decoding: "mtp" without properly including DECODE_MTP_SIZE in additional-settings.

cursor · 2026-01-06T19:50:35Z

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh

+# Switch to origin repo url for supporting wide ep configs
+git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git
+
+cd "$SGL_SLURM_JOBS_PATH" || exit 1


Missing cleanup causes git clone failure with stale code

High Severity

The script runs git clone to create the sglang_disagg directory but neither the benchmark script nor the runner removes this directory before cloning. On persistent self-hosted runners, if the directory exists from a previous run, git clone fails silently (no set -e), but cd "$SGL_SLURM_JOBS_PATH" succeeds because the directory exists. The script then continues using stale code from the previous run, potentially causing incorrect benchmark results or failures that are difficult to debug. The runner only cleans up $SGL_SLURM_JOBS_PATH/logs, not the parent directory.

cursor · 2026-01-06T19:50:35Z

runners/launch_mi355x-amd.sh

+    bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh"
+
+    # Wait for job to complete
+    JOB_ID=$(squeue -u $USER --noheader --format='%i')


Multinode job ID retrieval may capture multiple jobs

Medium Severity

The multinode section retrieves JOB_ID using squeue -u $USER --noheader --format='%i' without head -n1, unlike all other runner scripts and the non-multinode section (line 150) which use head -n1 to ensure only one job ID is captured. If multiple jobs exist for the user, JOB_ID would contain multiple lines, causing LOG_FILE path construction to be invalid, grep pattern matching to behave unexpectedly, and scancel_sync to only process the first job ID.

cquil11

lgtm

cursor · 2026-01-07T17:11:54Z

runners/launch_mi355x-amd.sh

+
+    # Wait for job to complete
+    JOB_ID=$(squeue -u $USER --noheader --format='%i')
+    LOG_FILE="$SGL_SLURM_JOBS_PATH/slurm_job-${JOB_ID}.out"


Multiple job IDs break log file path and job monitoring

High Severity

The squeue command at line 49 captures ALL job IDs for the user, not just the newly submitted job. If multiple jobs exist (from disaggregated workload or leftover jobs), JOB_ID will contain multiple newline-separated values. This breaks LOG_FILE path construction (embedded newlines), the grep checks at lines 57/69, and scancel_sync which expects a single ID. Compare with line 150 in the non-multinode path which correctly uses | head -n1 to get only the first job ID.

Additional Locations (1)

runners/launch_mi355x-amd.sh#L121-L122

This reverts commit c4bbfb4.

…" (#400) [skip-sweep] This reverts commit c4bbfb4.

…ow (#348)" (#400) [skip-sweep]" This reverts commit a075f2e.

* Revert "Revert "[AMD] Add mi355x distributed inference test CI workflow (#348)" (#400) [skip-sweep]" This reverts commit a075f2e. * add random range ratio that is appropriate (#402) * comment out 1k8k 8k1k * change recipe back to upstream * revert comment out 1k8k 8k1k * Add empty line at the beginning of the script --------- Co-authored-by: billishyahao <[email protected]>

billishyahao requested a review from a team as a code owner December 19, 2025 05:15

github-project-automation bot added this to InferenceMAX Board Dec 19, 2025

billishyahao force-pushed the billhe/up_di branch from 1d86adb to bdc0f2f Compare December 20, 2025 16:22

SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025

cquil11 and others added 4 commits January 6, 2026 09:55

update perf changelog

da5d029

add scancel sync

2b750e9

mute the scancel sync

1e15e04

change timeout

2091238

cursor bot reviewed Jan 6, 2026

View reviewed changes

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh Outdated Show resolved Hide resolved

benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh Show resolved Hide resolved

runners/launch_mi355x-amd.sh Show resolved Hide resolved

billishyahao and others added 2 commits January 6, 2026 16:04

switch to sa-260107 of recipe

aeee56a

Merge branch 'main' into billhe/up_di

e5b36bf

cquil11 moved this to In Progress in InferenceMAX Board Jan 6, 2026

cursor bot reviewed Jan 6, 2026

View reviewed changes

fix bug where slurm cleanup was checking exit code rather than whethe…

94181a5

…r output exist

cursor bot reviewed Jan 6, 2026

View reviewed changes

cquil11 added 2 commits January 7, 2026 11:06

update perf changelog pt 2

b4f3940

Merge branch 'main' into billhe/up_di

ccc0302

cquil11 approved these changes Jan 7, 2026

View reviewed changes

cquil11 added the sweep-enabled label Jan 7, 2026

cquil11 merged commit c4bbfb4 into SemiAnalysisAI:main Jan 7, 2026
16 of 36 checks passed

github-project-automation bot moved this from In Progress to Done in InferenceMAX Board Jan 7, 2026

cursor bot reviewed Jan 7, 2026

View reviewed changes

cquil11 mentioned this pull request Jan 7, 2026

fix: add appropriate random range ratio to Mi355X disagg #399

Closed

cquil11 added a commit that referenced this pull request Jan 7, 2026

Revert "[AMD] Add mi355x distributed inference test CI workflow (#348)"

5dd6a9b

This reverts commit c4bbfb4.

cquil11 mentioned this pull request Jan 7, 2026

Revert "[AMD] Add mi355x distributed inference test CI workflow" #400

Merged

cquil11 added a commit that referenced this pull request Jan 7, 2026

Revert "[AMD] Add mi355x distributed inference test CI workflow (#348)…

a075f2e

…" (#400) [skip-sweep] This reverts commit c4bbfb4.

cquil11 added a commit that referenced this pull request Jan 7, 2026

Revert "Revert "[AMD] Add mi355x distributed inference test CI workfl…

94e26f4

…ow (#348)" (#400) [skip-sweep]" This reverts commit a075f2e.

cquil11 mentioned this pull request Jan 7, 2026

add random range ratio that is appropriate #402

Merged

Conversation

billishyahao commented Dec 19, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Dec 19, 2025

Uh oh!

billishyahao commented Dec 20, 2025

Uh oh!

cquil11 commented Dec 21, 2025

Uh oh!

github-actions bot commented Dec 21, 2025

Uh oh!

cquil11 commented Dec 21, 2025

Uh oh!

github-actions bot commented Dec 21, 2025

Uh oh!

cquil11 commented Dec 21, 2025

Uh oh!

github-actions bot commented Dec 21, 2025

Uh oh!

cquil11 commented Dec 21, 2025

Uh oh!

cquil11 commented Dec 22, 2025

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

billishyahao commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

billishyahao commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

cursor bot Jan 6, 2026

Choose a reason for hiding this comment

Git clone silently fails if directory already exists

Uh oh!

cquil11 commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

cursor bot Jan 6, 2026

Choose a reason for hiding this comment

Missing validation for MTP size when speculative decoding enabled

Uh oh!

cursor bot Jan 6, 2026

Choose a reason for hiding this comment

Missing cleanup causes git clone failure with stale code

Uh oh!

cursor bot Jan 6, 2026

Choose a reason for hiding this comment

Multinode job ID retrieval may capture multiple jobs

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

Multiple job IDs break log file path and job monitoring

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

billishyahao commented Dec 19, 2025 •

edited by cursor bot

Loading