Skip to content

avoid test failures for netCDF with iimpi toolchain, by setting $I_MPI_HYDRA_BOOTSTRAP to ssh#24735

Merged
boegel merged 1 commit intoeasybuilders:developfrom
Flamefire:netcdf-test
Apr 8, 2026
Merged

avoid test failures for netCDF with iimpi toolchain, by setting $I_MPI_HYDRA_BOOTSTRAP to ssh#24735
boegel merged 1 commit intoeasybuilders:developfrom
Flamefire:netcdf-test

Conversation

@Flamefire
Copy link
Copy Markdown
Contributor

@Flamefire Flamefire commented Dec 2, 2025

Copied from cb01b35

This happens when running EasyBuild inside a SLURM job which causes I_MPI_HYDRA_BOOTSTRAP and HYDRA_BOOTSTRAP to be set and also auto-detected by mpirun when SLURM_JOBID is set

@github-actions github-actions Bot added 2022a 2022b 2023a 2023b 2025a issues & PRs related to 2025a common toolchains 2025b issues & PRs related to 2025b common toolchains change labels Dec 2, 2025
@jfgrimm
Copy link
Copy Markdown
Member

jfgrimm commented Dec 3, 2025

Test report by @jfgrimm
FAILED
Build succeeded for 5 out of 9 (total: 1 hour 23 mins 49 secs) (6 easyconfigs in total)
node057.viking2.yor.alces.network - Linux Rocky Linux 8.10, x86_64, AMD EPYC 7643 48-Core Processor, Python 3.6.8
See https://gist.github.com/jfgrimm/73fb5a47f33671d5f4edcdd2c5730049 for a full test report.

edit: only one failed (locks):

SUCCESS netCDF-4.9.0-iimpi-2022a.eb
SUCCESS netCDF-4.9.0-iimpi-2022b.eb
SUCCESS netCDF-4.9.2-iimpi-2023a.eb
SUCCESS netCDF-4.9.3-iimpi-2025a.eb
SUCCESS netCDF-4.9.3-iimpi-2025b.eb
[...]
FAIL (build issue) netCDF-4.9.2-iimpi-2023b.eb

@Flamefire
Copy link
Copy Markdown
Contributor Author

@jfgrimm Existing locks

@jfgrimm
Copy link
Copy Markdown
Member

jfgrimm commented Dec 3, 2025

Test report by @jfgrimm
SUCCESS
Build succeeded for 3 out of 3 (total: 11 mins 58 secs) (1 easyconfigs in total)
node057.viking2.yor.alces.network - Linux Rocky Linux 8.10, x86_64, AMD EPYC 7643 48-Core Processor, Python 3.6.8
See https://gist.github.com/jfgrimm/b521c235845cff85cf17f3f204335231 for a full test report.

@jfgrimm jfgrimm added this to the next release (5.2.0?) milestone Dec 3, 2025
@jfgrimm
Copy link
Copy Markdown
Member

jfgrimm commented Dec 3, 2025

@boegelbot: please test @ jsc-zen3
CORE_CNT=16
EB_ARGS="--installpath=/tmp/pr24735"

@boegelbot
Copy link
Copy Markdown
Collaborator

@jfgrimm: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24735 EB_ARGS="--installpath=/tmp/pr24735" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24735 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8998

Test results coming soon (I hope)...

Details

- notification for comment with ID 3607809364 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 5 out of 6 (total: 1 hour 35 mins 46 secs) (6 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21
See https://gist.github.com/boegelbot/31d99808b646a12f5e95f1f78d6dc9ab for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

@jfgrimm One test failure which looks weird:

malloc(): unaligned fastbin chunk detected

Could be a temporary OS issue

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 21 out of 28 (total: 3 hours 28 mins 5 secs) (6 easyconfigs in total)
c9 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/17c2f71f1e8c80a4147f4031c1ae0dbe for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 6 (total: 7 hours 35 mins 55 secs) (6 easyconfigs in total)
i7012 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.9.21
See https://gist.github.com/Flamefire/99d58de291a6f14b3e0c0a2c74e40b5a for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Flamefire commented Dec 11, 2025

Still running into the timeouts with:

161 - nc_test4_run_par_test (Timeout)
188 - h5_test_run_par_tests (Timeout)

Seems like an issue with HDF5, see #15959 (comment)

Also only on our ROME cluster. Running out of ideas

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 8 out of 8 (total: 1 hour 5 mins 52 secs) (6 easyconfigs in total)
n1643.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/7cd6f5cfe5014940974f816506d16555 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

I see the hanging with this simple MPI program:

#include <stdio.h>
#include <mpi.h>

#define PRINT(s) fprintf(stderr, "[%d] %s", rank, s);

int main(int argc, char **argv) {
  int rank = -1, res;
  MPI_File fh;

  PRINT("Init...\n");
  MPI_Init(&argc, &argv);
  PRINT("Rank...\n");
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  PRINT("Create...\n");
  res = MPI_File_open(MPI_COMM_WORLD, "test_file.h5", MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh);
  if(res != MPI_SUCCESS) { PRINT("ERROR\n"); return 1; }

  PRINT("Closing...\n");
  res = MPI_File_close(&fh);
  if(res != MPI_SUCCESS) { PRINT("ERROR\n"); return 1; }

  PRINT("SUCCESS\n");

  MPI_Finalize();
  return 0;
}

But only with impi/2022a and impi/2022b

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (total: 2 hours 41 mins 34 secs) (6 easyconfigs in total)
i7031 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.9.21
See https://gist.github.com/Flamefire/a223c138b512869a901307e653fd0559 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Even with a generic "Tuning" file set it times out in another test although most now succeed. It hangs in nc_put_vara_float after calling another MPI-IO collective op.
I guess Intel MPI really shouldn't be used in AMD CPUs

IMO this PR can be merged as it improves things and has no downsides I can see.

@boegel
Copy link
Copy Markdown
Member

boegel commented Dec 21, 2025

@boegelbot: please test @ jsc-zen3
CORE_CNT=16
EB_ARGS="--installpath=/tmp/pr24735"

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24735 EB_ARGS="--installpath=/tmp/pr24735" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24735 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9267

Test results coming soon (I hope)...

Details

- notification for comment with ID 3678577760 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 6 out of 6 (total: 1 hour 46 mins 59 secs) (6 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.23
See https://gist.github.com/boegelbot/dda760fcdb31f46b45bacf3c90ab78f4 for a full test report.

@boegel boegel changed the title Avoid test failures in Intel netCDF avoid test failures for netCDF with iimpi toolchain, by setting $I_MPI_HYDRA_BOOTSTRAP to ssh Dec 29, 2025
@boegel boegel added bug fix and removed change labels Dec 29, 2025
@Flamefire
Copy link
Copy Markdown
Contributor Author

Even with a generic "Tuning" file set it times out in another test although most now succeed. It hangs in nc_put_vara_float after calling another MPI-IO collective op.
I guess Intel MPI really shouldn't be used in AMD CPUs

IMO this PR can be merged as it improves things and has no downsides I can see.

@akesandgren
Copy link
Copy Markdown
Contributor

@Flamefire conflict resolution needed

@Flamefire
Copy link
Copy Markdown
Contributor Author

Thanks, rebased

Copy link
Copy Markdown
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Copy Markdown
Member

boegel commented Apr 8, 2026

@boegelbot: please test @ jsc-zen3
CORE_CNT=16
EB_ARGS="--installpath=/tmp/pr24735"

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24735 EB_ARGS="--installpath=/tmp/pr24735" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24735 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 10184

Test results coming soon (I hope)...

Details

- notification for comment with ID 4208701324 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 6 out of 6 (total: 1 hour 45 mins 11 secs) (6 easyconfigs in total)
jsczen3c3.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.25
See https://gist.github.com/boegelbot/ed8742b657d2075de4a5214fca6e056b for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Apr 8, 2026

Going in, thanks @Flamefire!

@boegel boegel merged commit 3eaa68b into easybuilders:develop Apr 8, 2026
6 checks passed
@Flamefire Flamefire deleted the netcdf-test branch April 9, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2022a 2022b 2023a 2023b 2025a issues & PRs related to 2025a common toolchains 2025b issues & PRs related to 2025b common toolchains bug fix change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants