Skip to content

add patch file required for correct CUDA-aware OpenMPI v1.7.3 build#631

Merged
boegel merged 2 commits intoeasybuilders:developfrom
boegel:OpenMPI-CUDA-fix
Jun 3, 2015
Merged

add patch file required for correct CUDA-aware OpenMPI v1.7.3 build#631
boegel merged 2 commits intoeasybuilders:developfrom
boegel:OpenMPI-CUDA-fix

Conversation

@boegel
Copy link
Copy Markdown
Member

@boegel boegel commented Dec 22, 2013

required for building GROMACS with goolfc/2.6.10 on a GPU system

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it all looks good, really appreciate to have the above references embedded in here

@boegel
Copy link
Copy Markdown
Member Author

boegel commented Dec 22, 2013

Although http://permalink.gmane.org/gmane.comp.clustering.open-mpi.user/20404 suggests otherwise, this patch and a rebuild of the whole stack from scratch doesn't resolve the problem.
The GROMACS tests keep failing with errors like:

Abnormal return value for 'mpirun -np 4 -wdir /home-2/khoste/.local/easybuild/build/GROMACS/4.6.5/goolfc-2.6.10-hybrid.3/regressiontests-4.6.5/simple/angles1 mdrun_mpi    -notunepme -table ../table -tablep ../tablep >mdrun.out 2>&1' was 127
No mdrun output files.
FAILED. Check mdrun.out, md.log files in angles1
$ cat /home-2/khoste/.local/easybuild/build/GROMACS/4.6.5/goolfc-2.6.10-hybrid.3/regressiontests-4.6.5/simple/angles1/mdrun.out 
mdrun_mpi: symbol lookup error: /home-2/khoste/.local/easybuild/software/OpenMPI/1.7.3-gcccuda-2.6.10/lib/openmpi/mca_pml_ob1.so: undefined symbol: progress_one_cuda_htod_event
mdrun_mpi: symbol lookup error: /home-2/khoste/.local/easybuild/software/OpenMPI/1.7.3-gcccuda-2.6.10/lib/openmpi/mca_pml_ob1.so: undefined symbol: progress_one_cuda_htod_event
mdrun_mpi: symbol lookup error: /home-2/khoste/.local/easybuild/software/OpenMPI/1.7.3-gcccuda-2.6.10/lib/openmpi/mca_pml_ob1.so: undefined symbol: progress_one_cuda_htod_event
mdrun_mpi: symbol lookup error: /home-2/khoste/.local/easybuild/software/OpenMPI/1.7.3-gcccuda-2.6.10/lib/openmpi/mca_pml_ob1.so: undefined symbol: progress_one_cuda_htod_event
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 23929 on
node sb001 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--------------------------------------------------------------------------

@boegel
Copy link
Copy Markdown
Member Author

boegel commented Dec 22, 2013

@ajdecon: You mentioned you have a contact in NVIDIA who might know more about this. Can you check with him what the deal is, and report back in here?

@boegel
Copy link
Copy Markdown
Member Author

boegel commented Jan 13, 2014

@ajdecon: Any update on this?

@ajdecon
Copy link
Copy Markdown
Contributor

ajdecon commented Jan 13, 2014

Emailed the relevant person on the NV side but they just got back from
vacation this week. Will send them a reminder.

On Mon, Jan 13, 2014 at 8:47 AM, Kenneth Hoste [email protected]:

@ajdecon https://github.com/ajdecon: Any update on this?


Reply to this email directly or view it on GitHubhttps://github.com//pull/631#issuecomment-32187035
.

@boegel boegel modified the milestones: v1.11, v1.10 Feb 15, 2014
@boegel boegel modified the milestones: v1.12, v1.11 Mar 14, 2014
@fgeorgatos
Copy link
Copy Markdown
Contributor

Test report by @fgeorgatos
SUCCESS
Build succeeded for 1 out of 1
Linux debian 6.0.10, Intel(R) Xeon(R) CPU L5640 @ 2.27GHz, Python 2.6.6
See https://gist.github.com/b39834cda3f7c5e4214b for a full test report.

@boegel
Copy link
Copy Markdown
Member Author

boegel commented Jun 3, 2015

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
Linux SL 6.6, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, Python 2.6.6
See https://gist.github.com/6865a6384ff9ab7e2e6b for a full test report.

@hpcugentbot
Copy link
Copy Markdown

Refer to this link for build results (access rights to CI server needed):
https://jenkins1.ugent.be/job/easybuild-easyconfigs-pr-builder/3309/
Easyconfigs unit test suite PASSed (see https://jenkins1.ugent.be/job/easybuild-easyconfigs-pr-builder/3309/console for more details).

This pull request is now ready for review/testing.

Please try and find someone who can tackle this; contact @boegel if you're not sure what to do.

@boegel
Copy link
Copy Markdown
Member Author

boegel commented Jun 3, 2015

Going in (finally!)

Thanks for the feedback everyone!

boegel added a commit that referenced this pull request Jun 3, 2015
add patch file required for correct CUDA-aware OpenMPI v1.7.3 build
@boegel boegel merged commit 87fcdd4 into easybuilders:develop Jun 3, 2015
@boegel boegel deleted the OpenMPI-CUDA-fix branch June 3, 2015 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants