[AMD] Add mi355x distributed inference test CI workflow#348
[AMD] Add mi355x distributed inference test CI workflow#348cquil11 merged 49 commits intoSemiAnalysisAI:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
1d86adb to
bdc0f2f
Compare
|
Hi Semi Analysis team, I just fixed the merge conflict. Feel free to review it. Thanks! |
|
/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml |
|
@cquil11 Kicking off a sweep. Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20415702216 |
|
/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml |
|
@cquil11 Kicking off a sweep. Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20415873469 |
|
/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml |
|
@cquil11 Kicking off a sweep. Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20417380873 |
|
@billishyahao as you can by comments, I am validating right now There are some things that need fixing. I have a fork going off of https://github.com/billishyahao/sglang_disagg |
|
/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml |
|
@cquil11 Kicking off a sweep. Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20417798440 |
|
Got all validation passed https://github.com/InferenceMAX/InferenceMAX/actions/runs/20741958744 I will switch back to main branch of recipe and kick off a final round validation |
|
/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml |
|
@billishyahao Kicking off a sweep. Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20754354192 |
| # Switch to origin repo url for supporting wide ep configs | ||
| git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git | ||
|
|
||
| cd "$SGL_SLURM_JOBS_PATH" || exit 1 |
There was a problem hiding this comment.
Git clone silently fails if directory already exists
The comment states "Always clone and setup sglang_disagg" but the git clone command on line 16 will fail silently if the sglang_disagg directory already exists (from a previous run on a self-hosted runner). Since there's no set -e, the script continues and cd "$SGL_SLURM_JOBS_PATH" succeeds using the stale existing directory. The cleanup in runners/launch_mi355x-amd.sh only removes the logs subdirectory, not the entire clone. This could cause the CI to run outdated code from a previous run, leading to hard-to-debug issues where the intended changes aren't being tested.
|
/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-disagg --runner-config .github/configs/runners.yaml --config-files .github/configs/amd-master.yaml |
|
@cquil11 Kicking off a sweep. Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20759988001 |
| check_env_vars CONC_LIST ISL OSL IMAGE SPEC_DECODING MODEL_PATH \ | ||
| PREFILL_NUM_WORKERS PREFILL_TP PREFILL_EP PREFILL_DP_ATTN \ | ||
| DECODE_NUM_WORKERS DECODE_TP DECODE_EP DECODE_DP_ATTN \ | ||
| PREFILL_NODES DECODE_NODES SGL_SLURM_JOBS_PATH # SGL_SLURM_JOBS_PATH FIXME |
There was a problem hiding this comment.
Missing validation for MTP size when speculative decoding enabled
Medium Severity
The SPEC_DECODING environment variable is validated via check_env_vars but never used in the script. Meanwhile, DECODE_MTP_SIZE (which controls MTP/speculative decoding behavior in the underlying submit_disagg.sh script) is not validated. Other scripts like dsr1_fp4_gb200_dynamo-trt_slurm.sh follow a pattern of conditionally checking DECODE_MTP_SIZE when SPEC_DECODING equals "mtp". This validation gap could allow misconfigured benchmark runs if a future YAML config sets spec-decoding: "mtp" without properly including DECODE_MTP_SIZE in additional-settings.
| # Switch to origin repo url for supporting wide ep configs | ||
| git clone --branch sa-260107 https://github.com/billishyahao/sglang_disagg.git | ||
|
|
||
| cd "$SGL_SLURM_JOBS_PATH" || exit 1 |
There was a problem hiding this comment.
Missing cleanup causes git clone failure with stale code
High Severity
The script runs git clone to create the sglang_disagg directory but neither the benchmark script nor the runner removes this directory before cloning. On persistent self-hosted runners, if the directory exists from a previous run, git clone fails silently (no set -e), but cd "$SGL_SLURM_JOBS_PATH" succeeds because the directory exists. The script then continues using stale code from the previous run, potentially causing incorrect benchmark results or failures that are difficult to debug. The runner only cleans up $SGL_SLURM_JOBS_PATH/logs, not the parent directory.
| bash benchmarks/"${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}_slurm.sh" | ||
|
|
||
| # Wait for job to complete | ||
| JOB_ID=$(squeue -u $USER --noheader --format='%i') |
There was a problem hiding this comment.
Multinode job ID retrieval may capture multiple jobs
Medium Severity
The multinode section retrieves JOB_ID using squeue -u $USER --noheader --format='%i' without head -n1, unlike all other runner scripts and the non-multinode section (line 150) which use head -n1 to ensure only one job ID is captured. If multiple jobs exist for the user, JOB_ID would contain multiple lines, causing LOG_FILE path construction to be invalid, grep pattern matching to behave unexpectedly, and scancel_sync to only process the first job ID.
|
|
||
| # Wait for job to complete | ||
| JOB_ID=$(squeue -u $USER --noheader --format='%i') | ||
| LOG_FILE="$SGL_SLURM_JOBS_PATH/slurm_job-${JOB_ID}.out" |
There was a problem hiding this comment.
Multiple job IDs break log file path and job monitoring
High Severity
The squeue command at line 49 captures ALL job IDs for the user, not just the newly submitted job. If multiple jobs exist (from disaggregated workload or leftover jobs), JOB_ID will contain multiple newline-separated values. This breaks LOG_FILE path construction (embedded newlines), the grep checks at lines 57/69, and scancel_sync which expects a single ID. Compare with line 150 in the non-multinode path which correctly uses | head -n1 to get only the first job ID.
Additional Locations (1)
* Revert "Revert "[AMD] Add mi355x distributed inference test CI workflow (#348)" (#400) [skip-sweep]" This reverts commit a075f2e. * add random range ratio that is appropriate (#402) * comment out 1k8k 8k1k * change recipe back to upstream * revert comment out 1k8k 8k1k * Add empty line at the beginning of the script --------- Co-authored-by: billishyahao <[email protected]>
This patch is to add mi355x distributed inference test CI workflow and integrate recipes https://github.com/billishyahao/sglang_disagg into inferencemax github action
Note
Introduces multi-node, disaggregated SGLang benchmarking for MI355x FP8 DeepSeek-R1.
dsr1-fp8-mi355x-sglang-disaggtoamd-master.yamlwith PD disaggregation (1P2D), both with and without speculative decoding, acrossisl/oslvariants (1k/1k, 8k/1k, 1k/8k), and detailed prefill/decode worker, TP/EP, DP-attn settingsbenchmarks/dsr1_fp8_mi355x_sglang-disagg_slurm.shthat clonessglang_disagg, sets EP/DP flags, and submits jobs viasubmit_disagg.shrunners/launch_mi355x-amd.shwith a multinode path: SLURM env setup, job submission forsglang-disagg, log tailing, result collection to workspace, and synchronized job cancellation; retains single-node behaviorperf-changelog.yamlwith the new MI355x disagg config and related notesWritten by Cursor Bugbot for commit b4f3940. This will update automatically on new commits. Configure here.