Skip to content

[AMD] Add mi355x distributed inference test CI workflow#348

Merged
cquil11 merged 49 commits intoSemiAnalysisAI:mainfrom
billishyahao:billhe/up_di
Jan 7, 2026
Merged

[AMD] Add mi355x distributed inference test CI workflow#348
cquil11 merged 49 commits intoSemiAnalysisAI:mainfrom
billishyahao:billhe/up_di

Conversation

@billishyahao
Copy link
Collaborator

@billishyahao billishyahao commented Dec 19, 2025

This patch is to add mi355x distributed inference test CI workflow and integrate recipes https://github.com/billishyahao/sglang_disagg into inferencemax github action

Co-authored-by: billishyahao [email protected]
Co-authored-by: ichbinblau [email protected]
Co-authored-by: Duyi-Wang [email protected]
Co-authored-by: inkcherry [email protected]


Note

Introduces multi-node, disaggregated SGLang benchmarking for MI355x FP8 DeepSeek-R1.

  • Adds dsr1-fp8-mi355x-sglang-disagg to amd-master.yaml with PD disaggregation (1P2D), both with and without speculative decoding, across isl/osl variants (1k/1k, 8k/1k, 1k/8k), and detailed prefill/decode worker, TP/EP, DP-attn settings
  • New benchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.sh that clones sglang_disagg, sets EP/DP flags, and submits jobs via submit_disagg.sh
  • Enhances runners/launch_mi355x-amd.sh with a multinode path: SLURM env setup, job submission for sglang-disagg, log tailing, result collection to workspace, and synchronized job cancellation; retains single-node behavior
  • Updates perf-changelog.yaml with the new MI355x disagg config and related notes

Written by Cursor Bugbot for commit b4f3940. This will update automatically on new commits. Configure here.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@billishyahao
Copy link
Collaborator Author

Hi Semi Analysis team, I just fixed the merge conflict. Feel free to review it. Thanks!

@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@cquil11
Copy link
Collaborator

cquil11 commented Dec 21, 2025

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@github-actions
Copy link
Contributor

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20415702216
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 946dab8
Approval: not required (trusted collaborator).

@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Dec 21, 2025
@cquil11
Copy link
Collaborator

cquil11 commented Dec 21, 2025

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

@github-actions
Copy link
Contributor

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20415873469
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 1b0fea2
Approval: not required (trusted collaborator).

@cquil11
Copy link
Collaborator

cquil11 commented Dec 21, 2025

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

@github-actions
Copy link
Contributor

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20417380873
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 443ca85
Approval: not required (trusted collaborator).

@cquil11
Copy link
Collaborator

cquil11 commented Dec 21, 2025

@billishyahao as you can by comments, I am validating right now
On behalf of all of us on the InferenceMAX team, I wanna first thank y'all for the hard work that went into getting this to work. This is important!

There are some things that need fixing. I have a fork going off of https://github.com/billishyahao/sglang_disagg
Will update this thread shortly with issues

@cquil11
Copy link
Collaborator

cquil11 commented Dec 22, 2025

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

@github-actions
Copy link
Contributor

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20417798440
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 443ca85
Approval: not required (trusted collaborator).

@billishyahao
Copy link
Collaborator Author

Got all validation passed https://github.com/InferenceMAX/InferenceMAX/actions/runs/20741958744 I will switch back to main branch of recipe and kick off a final round validation

@cquil11 cquil11 moved this to In Progress in InferenceMAX Board Jan 6, 2026
@billishyahao
Copy link
Collaborator Author

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

@billishyahao Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20754354192
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: e5b36bf
Approval: not required (trusted collaborator).

# Switch to origin repo url for supporting wide ep configs
git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git

cd "$SGL_SLURM_JOBS_PATH" || exit 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git clone silently fails if directory already exists

The comment states "Always clone and setup sglang_disagg" but the git clone command on line 16 will fail silently if the sglang_disagg directory already exists (from a previous run on a self-hosted runner). Since there's no set -e, the script continues and cd "$SGL_SLURM_JOBS_PATH" succeeds using the stale existing directory. The cleanup in runners/launch_mi355x-amd.sh only removes the logs subdirectory, not the entire clone. This could cause the CI to run outdated code from a previous run, leading to hard-to-debug issues where the intended changes aren't being tested.

Fix in Cursor Fix in Web

@cquil11
Copy link
Collaborator

cquil11 commented Jan 6, 2026

/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

@cquil11 Kicking off a sweep.

Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20759988001
Command: test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml
Pinned ref: 94181a5
Approval: not required (trusted collaborator).

check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \
PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \
DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \
PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH # SGL_SLURM_JOBS_PATH FIXME
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for MTP size when speculative decoding enabled

Medium Severity

The SPEC_DECODING environment variable is validated via check_env_vars but never used in the script. Meanwhile, DECODE_MTP_SIZE (which controls MTP/speculative decoding behavior in the underlying submit_disagg.sh script) is not validated. Other scripts like dsr1_fp4_gb200_dynamo-trt_slurm.sh follow a pattern of conditionally checking DECODE_MTP_SIZE when SPEC_DECODING equals "mtp". This validation gap could allow misconfigured benchmark runs if a future YAML config sets spec-decoding: "mtp" without properly including DECODE_MTP_SIZE in additional-settings.

Fix in Cursor Fix in Web

# Switch to origin repo url for supporting wide ep configs
git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git

cd "$SGL_SLURM_JOBS_PATH" || exit 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing cleanup causes git clone failure with stale code

High Severity

The script runs git clone to create the sglang_disagg directory but neither the benchmark script nor the runner removes this directory before cloning. On persistent self-hosted runners, if the directory exists from a previous run, git clone fails silently (no set -e), but cd "$SGL_SLURM_JOBS_PATH" succeeds because the directory exists. The script then continues using stale code from the previous run, potentially causing incorrect benchmark results or failures that are difficult to debug. The runner only cleans up $SGL_SLURM_JOBS_PATH/logs, not the parent directory.

Fix in Cursor Fix in Web

bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh"

# Wait for job to complete
JOB_ID=$(squeue -u $USER --noheader --format='%i')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multinode job ID retrieval may capture multiple jobs

Medium Severity

The multinode section retrieves JOB_ID using squeue -u $USER --noheader --format='%i' without head -n1, unlike all other runner scripts and the non-multinode section (line 150) which use head -n1 to ensure only one job ID is captured. If multiple jobs exist for the user, JOB_ID would contain multiple lines, causing LOG_FILE path construction to be invalid, grep pattern matching to behave unexpectedly, and scancel_sync to only process the first job ID.

Fix in Cursor Fix in Web

Copy link
Collaborator

@cquil11 cquil11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cquil11 cquil11 merged commit c4bbfb4 into SemiAnalysisAI:main Jan 7, 2026
16 of 36 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in InferenceMAX Board Jan 7, 2026

# Wait for job to complete
JOB_ID=$(squeue -u $USER --noheader --format='%i')
LOG_FILE="$SGL_SLURM_JOBS_PATH/slurm_job-${JOB_ID}.out"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple job IDs break log file path and job monitoring

High Severity

The squeue command at line 49 captures ALL job IDs for the user, not just the newly submitted job. If multiple jobs exist (from disaggregated workload or leftover jobs), JOB_ID will contain multiple newline-separated values. This breaks LOG_FILE path construction (embedded newlines), the grep checks at lines 57/69, and scancel_sync which expects a single ID. Compare with line 150 in the non-multinode path which correctly uses | head -n1 to get only the first job ID.

Additional Locations (1)

Fix in Cursor Fix in Web

cquil11 added a commit that referenced this pull request Jan 7, 2026
cquil11 added a commit that referenced this pull request Jan 7, 2026
cquil11 added a commit that referenced this pull request Jan 7, 2026
cquil11 added a commit that referenced this pull request Jan 8, 2026
* Revert "Revert "[AMD] Add mi355x distributed inference test CI workflow (#348)" (#400) [skip-sweep]"

This reverts commit a075f2e.

* add random range ratio that is appropriate (#402)

* comment out 1k8k 8k1k

* change recipe back to upstream

* revert comment out 1k8k 8k1k

* Add empty line at the beginning of the script

---------

Co-authored-by: billishyahao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants