Skip to content

exclude (flaky) fault_tolerance_test and fix non-x86 build for TensorFlow 2.7.1#15882

Merged
boegel merged 3 commits intoeasybuilders:developfrom
Flamefire:20220720103719_new_pr_TensorFlow271
Sep 11, 2022
Merged

exclude (flaky) fault_tolerance_test and fix non-x86 build for TensorFlow 2.7.1#15882
boegel merged 3 commits intoeasybuilders:developfrom
Flamefire:20220720103719_new_pr_TensorFlow271

Conversation

@Flamefire
Copy link
Copy Markdown
Contributor

@Flamefire Flamefire commented Jul 20, 2022

(created using eb --new-pr)

This test fails for me on an AMD EPYC system with 8 A100 GPUs. Both the CUDA and non-CUDA ECs fail. TF 2.6.0 is fine.

See tensorflow/tensorflow#56717 for the upstream issue and tensorflow/tensorflow@c08fda5 which disables the test with commit message

Disable flaky fault_tolerance_test on mac.

Hence I think this is safe to disable.

Edit: Added another fix due to an x86-only binary python package used which cannot be built from source due to a cyclic dependency: tensorflow/tensorflow#56636
Also the libclang python package is essentially a binary package and hence may also be missing on some architectures (like PPC), so remove that too as it is not actually required (yet): tensorflow/tensorflow@c211472

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi8008 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/ce65fa04f5441106b533b5314d7a95eb for a full test report.

@Flamefire Flamefire changed the title TensorFlow: Exclude (flaky) fault_tolerance_test TensorFlow: Exclude (flaky) fault_tolerance_test and fix non-x86 build Jul 25, 2022
@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa11 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/762339f9cca2ed1fd13f59c6c9235dde for a full test report.

@boegel boegel changed the title TensorFlow: Exclude (flaky) fault_tolerance_test and fix non-x86 build exclude (flaky) fault_tolerance_test and fix non-x86 build for TensorFlow 2.7.1 Aug 3, 2022
@boegel boegel added the bug fix label Aug 3, 2022
@boegel boegel added this to the next release (4.6.1?) milestone Aug 3, 2022
@branfosj branfosj mentioned this pull request Aug 9, 2022
Comment thread easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.7.1-foss-2021b.eb Outdated
@Flamefire Flamefire force-pushed the 20220720103719_new_pr_TensorFlow271 branch from c21f919 to 0b6c059 Compare September 5, 2022 14:20
@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 11, 2022

Test report by @boegel
FAILED
Build succeeded for 3 out of 5 (2 easyconfigs in total)
fair-mastodon-c6g-2xlarge-0001 - Linux Rocky Linux 8.5, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/0f433c570e879e352ed0e088075b4853 for a full test report.

edit: this test was mostly out of curiosity, since the PR title mentioned non-x86; it's not a blocker for this PR (since TensorFlow doesn't even seem to build on aarch64...)

@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 11, 2022

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3307.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 510.73.08, Python 3.6.8
See https://gist.github.com/3c7a276f15326617902afa333888dad1 for a full test report.

Copy link
Copy Markdown
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel dismissed akesandgren’s stale review September 11, 2022 20:16

requested changes made

@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 11, 2022

Going in, thanks @Flamefire!

@boegel boegel merged commit dee3561 into easybuilders:develop Sep 11, 2022
@Flamefire Flamefire deleted the 20220720103719_new_pr_TensorFlow271 branch September 12, 2022 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants