Skip to content

DAOS-18236 test: overhaul daos_racer usage#17980

Open
daltonbohning wants to merge 1 commit intomasterfrom
dbohning/daos-18236-overhaul
Open

DAOS-18236 test: overhaul daos_racer usage#17980
daltonbohning wants to merge 1 commit intomasterfrom
dbohning/daos-18236-overhaul

Conversation

@daltonbohning
Copy link
Copy Markdown
Contributor

@daltonbohning daltonbohning commented Apr 10, 2026

  • Change clush_timeout to daos_racer_timeout to be clear
  • Removing debug loggin from daos_racer/parallel.py
  • Use ppn in daos_racer/parallel.py
  • Remove hardcoded envs and openmpi load from daos_racer_utils.py
    because this is done by the job manager
  • Adjust Orterun.assign_processes to accept ppn

Test-tag: daos_racer
Skip-unit-tests: true
Skip-fault-injection-test: true

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@daltonbohning daltonbohning self-assigned this Apr 10, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 10, 2026

Ticket title is 'daos_racer/parallel.py:DaosRacerParallelTest.test_daos_racer_parallel - Failed to initialize step=2, rc=-1025'
Status is 'In Progress'
Labels: '2.6.4-aurora.p1,2.8.0tb1,ci_master_provider,ci_master_weekly,daily_test,testp1,weekly_test'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18236

@github-actions github-actions Bot added the priority Ticket has high priority (automatically managed) label Apr 10, 2026
@daltonbohning daltonbohning force-pushed the dbohning/daos-18236-overhaul branch 2 times, most recently from 617a0e6 to c467c98 Compare April 13, 2026 16:44
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17980/3/execution/node/735/log

@daltonbohning daltonbohning force-pushed the dbohning/daos-18236-overhaul branch from c467c98 to dd01ab8 Compare April 14, 2026 15:40
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17980/4/execution/node/735/log

@daltonbohning daltonbohning force-pushed the dbohning/daos-18236-overhaul branch from dd01ab8 to 5ff7666 Compare April 15, 2026 20:39
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17980/5/execution/node/735/log

@daltonbohning daltonbohning force-pushed the dbohning/daos-18236-overhaul branch 3 times, most recently from ce16e52 to e92f061 Compare April 20, 2026 23:20
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17980/8/execution/node/721/log

@daltonbohning daltonbohning force-pushed the dbohning/daos-18236-overhaul branch from e92f061 to 1d185e1 Compare April 21, 2026 15:02
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17980/9/execution/node/680/log

@daltonbohning daltonbohning force-pushed the dbohning/daos-18236-overhaul branch from 1d185e1 to 6f20294 Compare April 21, 2026 15:44
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17980/10/execution/node/681/log

@daltonbohning daltonbohning force-pushed the dbohning/daos-18236-overhaul branch from 6f20294 to 76d421b Compare April 21, 2026 16:40
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17980/11/execution/node/681/log

- Change clush_timeout to daos_racer_timeout to be clear
- Removing debug loggin from daos_racer/parallel.py
- Use ppn in daos_racer/parallel.py
- Remove hardcoded envs and openmpi load from daos_racer_utils.py
  because this is done by the job manager
- Adjust Orterun.assign_processes to accept ppn

Test-tag: daos_racer OSAOnlineExtend OSAOnlineParallelTest OSAOnlineReintegration SoakSmoke test_daos_management
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Dalton Bohning <[email protected]>
@daltonbohning daltonbohning force-pushed the dbohning/daos-18236-overhaul branch from 76d421b to b6fa1ab Compare April 22, 2026 13:19
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17980/12/execution/node/764/log

@daltonbohning daltonbohning marked this pull request as ready for review April 22, 2026 20:02
@daltonbohning daltonbohning requested review from a team as code owners April 22, 2026 20:02
@daltonbohning
Copy link
Copy Markdown
Contributor Author

We can see this is actually using MPI now, where it was not before. Checking for No MPI found helps catch cases where it does not detect MPI.
https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17980/12/artifact/Functional%20Hardware%20Large%20MD%20on%20SSD/daos_racer/parallel.py/job.log

2026-04-22 13:52:08,632 process          L0416 DEBUG| [stdout] [1,2]<stdout>:Using compatible version
2026-04-22 13:52:08,632 process          L0416 DEBUG| [stdout] [1,3]<stdout>:Using compatible version
2026-04-22 13:52:08,632 process          L0416 DEBUG| [stdout] [1,1]<stdout>:Using compatible version
2026-04-22 13:52:08,632 process          L0416 DEBUG| [stdout] [1,0]<stdout>:Using compatible version
2026-04-22 13:52:08,701 process          L0416 DEBUG| [stdout] [1,0]<stdout>:racer start with 4 threads duration 600 secs
2026-04-22 13:52:08,701 process          L0416 DEBUG| [stdout] [1,0]<stdout>:	pool size     : SCM: 2048 MB, NVMe: 8192 MB
...
Checking the command output for any bad keywords: (<stderr>|No MPI found)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

4 participants