Skip to content

fix tests on POWER9 for BLIS 0.9.0 + fix auto-detect for POWER10 for BLIS (AMD) v2.0 + v3.0#15889

Merged
boegel merged 2 commits intoeasybuilders:developfrom
Flamefire:20220721160647_new_pr_BLIS090
Aug 7, 2022
Merged

fix tests on POWER9 for BLIS 0.9.0 + fix auto-detect for POWER10 for BLIS (AMD) v2.0 + v3.0#15889
boegel merged 2 commits intoeasybuilders:developfrom
Flamefire:20220721160647_new_pr_BLIS090

Conversation

@Flamefire
Copy link
Copy Markdown
Contributor

@Flamefire Flamefire commented Jul 21, 2022

(created using eb --new-pr)

  • Tests segfault on POWER9. See Tests fail on POWER machines flame/blis#621
  • BLIS 2.0 - 3.0 don't support POWER10 which makes the patch fail (during build due to missing symbol BLIS_ARCH_POWER10). 3.0.1 has POWER10 configs

Followup to #15826

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusml14 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/2fb179db59bfa6064b50cd48458d5304 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusa11 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/f20329c099de9a2046dc6d7f6155efd9 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Flamefire commented Jul 25, 2022

Test for 2.2 worked on PPC. 3.0 is stuck on testing.

An alternative patch for 0.9.0 would be flame/blis@ae10d94 which fixes the affected kernels but is (very) large. Not sure if that is worth it.

For 3.0 and above the test is stuck with the last output being:

% --- gemmt ---
% 
% gemmt m k                    -1 -1
% gemmt operand params         ?nn
% 

% blis_<dt><op>_<params>_<stor>      m     k   gflops   resid      result
blis_sgemmt_lnn_rrr                100   100    12.98   0.00e+00   PASS
blis_sgemmt_unn_rrr                100   100    11.46   0.00e+00   PASS

% blis_<dt><op>_<params>_<stor>      m     k   gflops   resid      result

As even the 2022a toolchain sticks to the non-AMD fork 0.9.0 version I don't really want to spend more time trying to fix that and hope that AMD eventually pulls enough from the flame repo so that this works again.

@boegel boegel added the bug fix label Aug 6, 2022
@boegel boegel added this to the next release (4.6.1?) milestone Aug 6, 2022
@boegel boegel changed the title BLIS 0.9.0: Fix build/tests on POWER9 fix tests on POWER9 for BLIS 0.9.0 + fix auto-detect for POWER10 for BLIS (AMD) v2.0 + v3.0 Aug 7, 2022
@boegel
Copy link
Copy Markdown
Member

boegel commented Aug 7, 2022

@boegelbot please test @ generoso

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=15889 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_15889 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8945

Test results coming soon (I hope)...

Details

- notification for comment with ID 1207386466 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Copy Markdown
Member

boegel commented Aug 7, 2022

Test report by @boegel
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
node3121.skitty.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/4199448c379df2e2a43bb0536ea9f8c6 for a full test report.

edit: tests fail for AMD fork of BLIS (v3.0) on Intel Skylake, but that's not a blocker for this PR (see also #14030)

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/dcd2998a6a2d8f20ca22bbf5e4fd3c52 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Aug 7, 2022

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3522.doduo.os - Linux RHEL 8.4, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/612dfc3b9e490425956d99045c60239a for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Aug 7, 2022

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
easybuild2.novalocal - Linux CentOS Stream 8, POWER, IBM pSeries (emulated by qemu) (power9le), Python 3.6.8
See https://gist.github.com/5d6ea2feb3c5b38a988a957d720dc6af for a full test report.

Copy link
Copy Markdown
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Copy Markdown
Member

boegel commented Aug 7, 2022

Going in, thanks @Flamefire!

@boegel boegel merged commit 42fd0bb into easybuilders:develop Aug 7, 2022
@boegel
Copy link
Copy Markdown
Member

boegel commented Aug 7, 2022

Test report by @boegel
SUCCESS
Build succeeded for 36 out of 36 (2 easyconfigs in total)
fair-mastodon-c6g-2xlarge-0001 - Linux Rocky Linux 8.5, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/be6883eb59670cfe6c440455fb75fc6c for a full test report.

@Flamefire Flamefire deleted the 20220721160647_new_pr_BLIS090 branch August 15, 2022 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants