Skip to content

{numlib}[foss/2020a,intel/2020a] SuperLU_DIST v6.4.0#11693

Merged
boegel merged 7 commits intoeasybuilders:developfrom
Darkless012:20201112120951_new_pr_SuperLU_DIST640
Dec 16, 2020
Merged

{numlib}[foss/2020a,intel/2020a] SuperLU_DIST v6.4.0#11693
boegel merged 7 commits intoeasybuilders:developfrom
Darkless012:20201112120951_new_pr_SuperLU_DIST640

Conversation

@Darkless012
Copy link
Copy Markdown
Contributor

(created using eb --new-pr)

@Micket Micket added the new label Nov 12, 2020
@Micket Micket added this to the 4.3.2 milestone Nov 12, 2020
@Micket
Copy link
Copy Markdown
Contributor

Micket commented Nov 12, 2020

Test report by @Micket
FAILED
Build succeeded for 3 out of 4 (2 easyconfigs in total)
vera-c1 - Linux centos linux 7.8.2003, x86_64, Intel Xeon Processor (Skylake), Python 2.7.5
See https://gist.github.com/084d97ddf1c94dd1e1a9dade1392fff2 for a full test report.

@Darkless012
Copy link
Copy Markdown
Contributor Author

I got randomly those errors as well. Not sure what causes that.
(extracted from the report above)

      Start  1: pdtest_1x1_1_2_8_20_SP
 1/24 Test  #1: pdtest_1x1_1_2_8_20_SP ...........***Failed    1.69 sec
Time to read and distribute matrix 0.00
[vera-c1:3389 :0:3418] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4d)
==== backtrace (tid:   3418) ====
 0 0x00000000000214ae ucs_debug_print_backtrace()  /local/EB/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x000000000034a552 mkl_blas_avx512_xdgemv()  ???:0
 2 0x000000000022bb4e mkl_blas_xdgemv()  ???:0
 3 0x0000000000250b6a mkl_blas_dgemv()  ???:0
 4 0x00000000002fba86 mkl_blas_dgemm()  ???:0
 5 0x000000000019251f DGEMM()  ???:0
 6 0x0000000000476fea dlsum_bmod_inv()  ???:0
 7 0x0000000000476ee1 dlsum_bmod_inv()  ???:0
 8 0x00000000000e12c2 _INTERNALfadf56ac::__kmp_invoke_task()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:1782
 9 0x00000000000ecdcc _INTERNALfadf56ac::__kmp_execute_tasks_template<kmp_flag_64<false, true> >()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:3185
10 0x00000000000ecdcc __kmp_execute_tasks_64<false, true>()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:3284
11 0x000000000006bb01 kmp_flag_64<false, true>::execute_tasks()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:964
12 0x000000000006bb01 _INTERNAL51694e09::__kmp_wait_template<kmp_flag_64<false, true>, true, false, true>()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:374
13 0x000000000006e23d kmp_flag_64<false, true>::wait()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:971
14 0x000000000007526b __kmp_fork_barrier()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_barrier.cpp:2369
15 0x00000000000b1170 __kmp_launch_thread()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:6080
16 0x000000000012d19c _INTERNAL27dd4e00::__kmp_launch_worker()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:593
17 0x0000000000007ea5 start_thread()  pthread_create.c:0
18 0x00000000000fe8dd __clone()  ???:0
=================================

@boegel
Copy link
Copy Markdown
Member

boegel commented Nov 13, 2020

Could be a bug in Intel MKL triggered by the SuperLU tests...

# Some tests run longer than default 1500s timeout on fairly big machine (36 cores).
# Increasing timeout to 3000 resolves the timeout error.
# Be ware that tests run ~2hrs.
pretestopts = 'export ARGS="$ARGS --timeout 3000" && '
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this is too excessive to run during the installation...

Is there any way to only run a subsuite of the tests?
If not, we should leave this commented out, imagine doing the installation on just 3 cores...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Patchfile with modifications to Ctest a way?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use the --tests-regex option to pick a couple of tests, or use --exclude-regex to exclude the very long test(s)?

@boegel boegel changed the title {numlib}[foss/2020a] SuperLU_DIST v6.4.0 {numlib}[foss/2020a,intel/2002a] SuperLU_DIST v5.3.0 + v6.4.0 Nov 13, 2020
Comment thread easybuild/easyconfigs/s/SuperLU_DIST/SuperLU_DIST-5.3.0-intel-2020a.eb Outdated
@Darkless012 Darkless012 changed the title {numlib}[foss/2020a,intel/2002a] SuperLU_DIST v5.3.0 + v6.4.0 {numlib}[foss/2020a] SuperLU_DIST v6.4.0 Nov 25, 2020
@Darkless012
Copy link
Copy Markdown
Contributor Author

This is only foss/2020a version, which passed tests. Please retest. (should be possible to merge)
intel2020a version will be split to another PR.

@boegel
Copy link
Copy Markdown
Member

boegel commented Nov 26, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11693 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11693 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9796

Test results coming soon (I hope)...

Details

- notification for comment with ID 734201235 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
generoso-c1-s-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/a94e93199acb01dc2396a1cf8f0a9d0c for a full test report.

@Darkless012 Darkless012 changed the title {numlib}[foss/2020a] SuperLU_DIST v6.4.0 {numlib}[foss/2020a,intel/2020a] SuperLU_DIST v6.4.0 Nov 27, 2020
@Darkless012
Copy link
Copy Markdown
Contributor Author

Rerun tests please. included tests now are minimal only.

@boegel boegel modified the milestones: 4.3.2 (next release), 4.4.0 Dec 8, 2020
@boegel
Copy link
Copy Markdown
Member

boegel commented Dec 16, 2020

@boegelbot please test @ generoso

@boegel
Copy link
Copy Markdown
Member

boegel commented Dec 16, 2020

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3501.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/9f0cce6389452000769bb74e8cdcb5fb for a full test report.

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11693 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11693 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12301

Test results coming soon (I hope)...

Details

- notification for comment with ID 745823841 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Copy Markdown
Member

boegel commented Dec 16, 2020

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node2676.swalot.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/70123989f5ab8c43b76b5b379b51aafc for a full test report.

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
generoso-c1-s-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/4cc4802850a917bcfe2dc00df88f2cd6 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Dec 16, 2020

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3100.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/699d818b3741129094afedfcf5bbc730 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Dec 16, 2020

Going in, thanks @Darkless012!

@boegel boegel merged commit c1831cc into easybuilders:develop Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants