Skip to content

{devel} Pytorch/0.3.1 (intel and foss 2018a)#6152

Merged
boegel merged 24 commits intoeasybuilders:developfrom
wpoely86:pytorch-vub
Jun 11, 2018
Merged

{devel} Pytorch/0.3.1 (intel and foss 2018a)#6152
boegel merged 24 commits intoeasybuilders:developfrom
wpoely86:pytorch-vub

Conversation

@wpoely86
Copy link
Copy Markdown
Member

@wpoely86 wpoely86 commented Apr 13, 2018

This is based on @zao work in #5530 but with the regular intel and foss toolchains (using CUDA as a versionsuffix).

This needs easybuilders/easybuild-easyblocks#1398 to let it use all the correct dependencies.

@wpoely86
Copy link
Copy Markdown
Member Author

Test report by @wpoely86
SUCCESS
Build succeeded for 7 out of 7 (7 easyconfigs in this PR)
nic96 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/4dbbb48ebc886a1784d3e42217aee659 for a full test report.

@wpoely86 wpoely86 requested a review from boegel April 13, 2018 12:11
@hajgato
Copy link
Copy Markdown
Collaborator

hajgato commented Apr 30, 2018

@wpoely86 PyTorch tests are very poorly written. They do not take into account what deps were available at installation time. (if you want all the tests passed)

  • GPU shoudl be in Default mode, not in EXCLUSIVE_PROCESS mode (as set by Torque)
  • You need magma
  • You need CUDA aware OpenMPI (so will not work with foss, only goolfc or similar)

@wpoely86
Copy link
Copy Markdown
Member Author

OK, thanks @hajgato
Magma is merged in #6154 so please add it.

For the tests, can you add some comments in the easyconfig about what you just said?

@hajgato
Copy link
Copy Markdown
Collaborator

hajgato commented May 1, 2018

@wpoely86 Still working on it, will make a PR for your PR.

@hajgato
Copy link
Copy Markdown
Collaborator

hajgato commented May 2, 2018

@wpoley86: I get as far as I can get: it seems that MPI does not work for at all with PyTorch, do not know the problem:

RuntimeError: refcounted file mapping not supported on your system at /tmp/hajgato/build/PyTorch/0.3.1/intel-2018a-Python-3.6.4/pytorch-0.3.1/torch/lib/TH/THAllocator.c:525

(for both foss and intel, so it seems that the problem is not the CUDA aware MPI (as intel does not have cuda at all)

I was also wrong with the EXCLUSIVE_PROCESS as well.
(magma is still needed if you use CUDA)

Using Intel compiler, one test hangs (test_autograd.py), when I attach with strace, it is in an inifnite loop of:

sched_yield()                           = 0

I am not able to figure out the cause of this.
What I have until now see:
hajgato@24c0316

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 2, 2018

That's more then good enough for me. Can I just pull in your branch into mine or are you still working on it?

Balazs Hajgato and others added 4 commits May 2, 2018 16:23
* origin/develop: (211 commits)
  fix typo in Nipype & NiBabel causing failure in test suite due to triggering typo detection
  adding easyconfigs: SLEPc-3.8.3-foss-2017b.eb
  adding easyconfigs: TensorFlow-1.8.0-foss-2018a-Python-3.6.4.eb, wheel-0.31.0-foss-2018a-Python-3.6.4.eb, Bazel-0.12.0-GCCcore-6.4.0.eb and patches: TensorFlow-1.8.0_remove-msse-hardcoding.patch
  make suggested change
  Fix checksum
  Fix reviewers comments
  Boost 1.63 no longer needed
  fix source spec for networkx 2.1 extension in scikit-image easyconfig, .tar.gz is not longer available on PyPI?
  fix installation of Libint 2.4.2 by building with -std=c++11
  adding easyconfigs: Pillow-5.0.0-intel-2018a-Python-2.7.14.eb
  add binutils dep
  {bio}[GCCcore 6.4.0] MCL 14.137 /w Perl 5.26.1 (REVIEW)
  adding easyconfigs: Pandoc-2.1.3.eb
  fix checksum for foreign extension in R 3.4.3 and R 3.4.4 easyconfigs
  add future extension in Nipype bundle + enhance sanity check
  add pkg-config build dep to most recent libdrm easyconfigs
  libdrm: add `pkg-config` as build dependency
  adding easyconfigs: Nipype-1.0.2-intel-2018a-Python-3.6.4.eb, NiBabel-2.2.1-intel-2018a-Python-3.6.4.eb
  adding easyconfigs: EasyBuild-3.6.0.eb
  bump version to v3.6.1
  ...
* origin/develop: (211 commits)
  fix typo in Nipype & NiBabel causing failure in test suite due to triggering typo detection
  adding easyconfigs: SLEPc-3.8.3-foss-2017b.eb
  adding easyconfigs: TensorFlow-1.8.0-foss-2018a-Python-3.6.4.eb, wheel-0.31.0-foss-2018a-Python-3.6.4.eb, Bazel-0.12.0-GCCcore-6.4.0.eb and patches: TensorFlow-1.8.0_remove-msse-hardcoding.patch
  make suggested change
  Fix checksum
  Fix reviewers comments
  Boost 1.63 no longer needed
  fix source spec for networkx 2.1 extension in scikit-image easyconfig, .tar.gz is not longer available on PyPI?
  fix installation of Libint 2.4.2 by building with -std=c++11
  adding easyconfigs: Pillow-5.0.0-intel-2018a-Python-2.7.14.eb
  add binutils dep
  {bio}[GCCcore 6.4.0] MCL 14.137 /w Perl 5.26.1 (REVIEW)
  adding easyconfigs: Pandoc-2.1.3.eb
  fix checksum for foreign extension in R 3.4.3 and R 3.4.4 easyconfigs
  add future extension in Nipype bundle + enhance sanity check
  add pkg-config build dep to most recent libdrm easyconfigs
  libdrm: add `pkg-config` as build dependency
  adding easyconfigs: Nipype-1.0.2-intel-2018a-Python-3.6.4.eb, NiBabel-2.2.1-intel-2018a-Python-3.6.4.eb
  adding easyconfigs: EasyBuild-3.6.0.eb
  bump version to v3.6.1
  ...
@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 2, 2018

@boegel this suffix thing where boegelbot is complaining about, I don't think it's an issue as it's two different toolchains?

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 2, 2018

Test report by @wpoely86
FAILED
Build succeeded for 7 out of 8 (8 easyconfigs in this PR)
nic166 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/28f1d425673407a464a116cacc129fa9 for a full test report.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 3, 2018

Test report by @wpoely86
SUCCESS
Build succeeded for 8 out of 8 (8 easyconfigs in this PR)
nic96 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/da4ac16a33d45df781609197573c0da2 for a full test report.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 3, 2018

Test report by @wpoely86
FAILED
Build succeeded for 7 out of 8 (8 easyconfigs in this PR)
nic166 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/b3ef53c67b304ee5b051ba81bf0aae56 for a full test report.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 3, 2018

Test report by @wpoely86
FAILED
Build succeeded for 7 out of 8 (8 easyconfigs in this PR)
nic167 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/a21f56917b3a5a288d21660f53e897f9 for a full test report.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 7, 2018

Test report by @wpoely86
FAILED
Build succeeded for 7 out of 8 (8 easyconfigs in this PR)
nic167 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/f2613916ce329e3449685e2b323b1bc3 for a full test report.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 8, 2018

Test report by @wpoely86
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
nic167 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/ff59fa1fadf154a26e7a8306903d4fc3 for a full test report.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented May 8, 2018

Test report by @wpoely86
SUCCESS
Build succeeded for 8 out of 8 (8 easyconfigs in this PR)
nic167 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/e29fa73477e30106a85ddbc4266fdf2d for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented May 12, 2018

@wpoely86 I enhanced the tests to allow for the -CUDA-* specific variants, see wpoely86#40

allow dependency variants that are specific to a particular CUDA version
@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented Jun 6, 2018

@boegel merging time?

@easybuilders easybuilders deleted a comment from boegelbot Jun 6, 2018
@boegel
Copy link
Copy Markdown
Member

boegel commented Jun 6, 2018

@wpoely86 Please submit a new test report on top of the latest updates.

@easybuilders easybuilders deleted a comment from boegelbot Jun 6, 2018
@easybuilders easybuilders deleted a comment from boegelbot Jun 6, 2018

dependencies = [
('Python', '3.6.4'),
('libyaml', '0.1.7', '', True),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 Why is this hardcoded to dummy? We already have libyaml-0.1.7-GCCcore-6.4.0.eb that can resolve this dep.


# you can choice here: either give a list of CUDA cc version or tell it All
# by default it does autodetect of the GPU on the local machine
prebuildopts += ' TORCH_CUDA_ARCH_LIST="3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.1 7.0"'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 prebuildopts should be moved down, below dependencies

sources = [
'v%(version)s.tar.gz', # PyTorch
{
'filename': 'cb002e4eb8d167c2c60fc3bdaae4e1844e0f9353.tar.gz',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 Can we use download_file here, and rename the downloaded file using filename to a more meaningful filename like gloo-<datestamp-of-commit>.tar.gz (same below)?

]

# PyTorch pulls in a bunch of submodules which don't have releases.
# We download the submodule revisions from their repos.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 This doesn't belong here, but with sources?


# PyTorch pulls in a bunch of submodules which don't have releases.
# We download the submodule revisions from their repos.
options = {'modulename': 'torch'}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 This should be moved down, below sanity_check_paths

sources = [
'v%(version)s.tar.gz', # PyTorch
{
'filename': 'cb002e4eb8d167c2c60fc3bdaae4e1844e0f9353.tar.gz',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 Please use download_filename & rename

]

# PyTorch pulls in a bunch of submodules which don't have releases.
# We download the submodule revisions from their repos.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 comment in wrong place


# PyTorch pulls in a bunch of submodules which don't have releases.
# We download the submodule revisions from their repos.
options = {'modulename': 'torch'}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 please move down

toolchain = {'name': 'foss', 'version': '2018a'}

source_urls = [
'https://github.com/pytorch/vision/archive',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 single line?

toolchain = {'name': 'intel', 'version': '2018a'}

source_urls = [
'https://github.com/pytorch/vision/archive',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 single line?

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented Jun 6, 2018

Test report by @wpoely86
FAILED
Build succeeded for 7 out of 8 (8 easyconfigs in this PR)
nic167 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/636407f46cc272d57d33fce9e3f49d49 for a full test report.

wpoely86 added 3 commits June 7, 2018 13:51
…nto pytorch-vub

* 'pytorch-vub' of github:wpoely86/easybuild-easyconfigs:
  take into account some there may be *only* variants based on different CUDA versions
  make sure dep_vars stays a dict in check_dep_vars helper function
  allow dependency variants that are specific to a particular CUDA version
  Fix icc AVX detection
  checksums injected
  add skip mpi patch
  trying to fix tests
  1
@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented Jun 7, 2018

Test report by @wpoely86
SUCCESS
Build succeeded for 8 out of 8 (8 easyconfigs in this PR)
nic151 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/657d8d234f26be1599a568b63594a790 for a full test report.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented Jun 7, 2018

Test report by @wpoely86
FAILED
Build succeeded for 7 out of 8 (8 easyconfigs in this PR)
nic166 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/c0dd1ece085466daf4e60649950b7838 for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Jun 7, 2018
@boegel
Copy link
Copy Markdown
Member

boegel commented Jun 7, 2018

@wpoely86 what's up with those failing tests?

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented Jun 8, 2018

It's the multiprocess test that fails if you use CUDA. The first process claims all the GPUs and the second process fails immediately as it cannot access any GPUs...

I can try to make the GPUs shared (by default it's process exclusive). Or drop the tests.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented Jun 8, 2018

Test report by @wpoely86
SUCCESS
Build succeeded for 8 out of 8 (8 easyconfigs in this PR)
nic166 - Linux centos linux 7.4.1708, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Python 2.7.5
See https://gist.github.com/fe240b9456a23f64d3027b23c2698128 for a full test report.

@wpoely86
Copy link
Copy Markdown
Member Author

wpoely86 commented Jun 8, 2018

OK, put the GPU mode to shared fixed it. @boegel good to go?

@boegel boegel added this to the 3.6.2 milestone Jun 11, 2018
@boegel
Copy link
Copy Markdown
Member

boegel commented Jun 11, 2018

Going in, thanks @wpoely86!

@boegel boegel merged commit 5faa25f into easybuilders:develop Jun 11, 2018
@easybuilders easybuilders deleted a comment from boegelbot Jun 11, 2018
@wpoely86 wpoely86 deleted the pytorch-vub branch June 11, 2018 14:37
sources = [
'v%(version)s.tar.gz', # PyTorch
{
'download_filename': 'cb002e4eb8d167c2c60fc3bdaae4e1844e0f9353.tar.gz',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 @smoors @hajgato @zao How was this determined? Latest & great commit in https://github.com/facebookincubator/gloo ?

Same questions below...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly they come from the .gitsubmodules files.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That helps, see https://github.com/pytorch/pytorch/blob/master/.gitmodules, but it doesn't explain the old googletest commit...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to look at https://github.com/pytorch/pytorch/blob/2b4748011b5881583567bb166801ca6625f2fdda/.gitmodules

No clue why the old googletest. It probably had a good reason. If I remember correct, I let the full repo clone (with all submodules) and took all those commits.

'extract_cmd': extract_cmd_pattern % (pytorchdir, 'torch/lib/gloo'),
},
{
'download_filename': 'ec44c6c1675c25b9827aacd08c02433cccde7780.tar.gz',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpoely86 @smoors @hajgato @zao This commits points to an old version of googletest, any specific reason for that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are exactly the hashes that the submodule indicated when pulling the 0.3.0 tag. I have no idea why they use the version they do.
My workflow was to do a recursive init of submodules and mechanically copy the hashes and paths into my recipe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like they re-organized things a bit, now you need to look at https://github.com/pytorch/pytorch/tree/v0.4.1/third_party...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants