{devel} Pytorch/0.3.1 (intel and foss 2018a)#6152
{devel} Pytorch/0.3.1 (intel and foss 2018a)#6152boegel merged 24 commits intoeasybuilders:developfrom
Conversation
|
Test report by @wpoely86 |
|
@wpoely86
|
|
@wpoely86 Still working on it, will make a PR for your PR. |
|
@wpoley86: I get as far as I can get: it seems that MPI does not work for at all with (for both foss and intel, so it seems that the problem is not the CUDA aware MPI (as intel does not have cuda at all) I was also wrong with the Using Intel compiler, one test hangs ( I am not able to figure out the cause of this. |
|
That's more then good enough for me. Can I just pull in your branch into mine or are you still working on it? |
* origin/develop: (211 commits)
fix typo in Nipype & NiBabel causing failure in test suite due to triggering typo detection
adding easyconfigs: SLEPc-3.8.3-foss-2017b.eb
adding easyconfigs: TensorFlow-1.8.0-foss-2018a-Python-3.6.4.eb, wheel-0.31.0-foss-2018a-Python-3.6.4.eb, Bazel-0.12.0-GCCcore-6.4.0.eb and patches: TensorFlow-1.8.0_remove-msse-hardcoding.patch
make suggested change
Fix checksum
Fix reviewers comments
Boost 1.63 no longer needed
fix source spec for networkx 2.1 extension in scikit-image easyconfig, .tar.gz is not longer available on PyPI?
fix installation of Libint 2.4.2 by building with -std=c++11
adding easyconfigs: Pillow-5.0.0-intel-2018a-Python-2.7.14.eb
add binutils dep
{bio}[GCCcore 6.4.0] MCL 14.137 /w Perl 5.26.1 (REVIEW)
adding easyconfigs: Pandoc-2.1.3.eb
fix checksum for foreign extension in R 3.4.3 and R 3.4.4 easyconfigs
add future extension in Nipype bundle + enhance sanity check
add pkg-config build dep to most recent libdrm easyconfigs
libdrm: add `pkg-config` as build dependency
adding easyconfigs: Nipype-1.0.2-intel-2018a-Python-3.6.4.eb, NiBabel-2.2.1-intel-2018a-Python-3.6.4.eb
adding easyconfigs: EasyBuild-3.6.0.eb
bump version to v3.6.1
...
* origin/develop: (211 commits)
fix typo in Nipype & NiBabel causing failure in test suite due to triggering typo detection
adding easyconfigs: SLEPc-3.8.3-foss-2017b.eb
adding easyconfigs: TensorFlow-1.8.0-foss-2018a-Python-3.6.4.eb, wheel-0.31.0-foss-2018a-Python-3.6.4.eb, Bazel-0.12.0-GCCcore-6.4.0.eb and patches: TensorFlow-1.8.0_remove-msse-hardcoding.patch
make suggested change
Fix checksum
Fix reviewers comments
Boost 1.63 no longer needed
fix source spec for networkx 2.1 extension in scikit-image easyconfig, .tar.gz is not longer available on PyPI?
fix installation of Libint 2.4.2 by building with -std=c++11
adding easyconfigs: Pillow-5.0.0-intel-2018a-Python-2.7.14.eb
add binutils dep
{bio}[GCCcore 6.4.0] MCL 14.137 /w Perl 5.26.1 (REVIEW)
adding easyconfigs: Pandoc-2.1.3.eb
fix checksum for foreign extension in R 3.4.3 and R 3.4.4 easyconfigs
add future extension in Nipype bundle + enhance sanity check
add pkg-config build dep to most recent libdrm easyconfigs
libdrm: add `pkg-config` as build dependency
adding easyconfigs: Nipype-1.0.2-intel-2018a-Python-3.6.4.eb, NiBabel-2.2.1-intel-2018a-Python-3.6.4.eb
adding easyconfigs: EasyBuild-3.6.0.eb
bump version to v3.6.1
...
|
@boegel this suffix thing where boegelbot is complaining about, I don't think it's an issue as it's two different toolchains? |
|
Test report by @wpoely86 |
|
Test report by @wpoely86 |
|
Test report by @wpoely86 |
|
Test report by @wpoely86 |
|
Test report by @wpoely86 |
|
Test report by @wpoely86 |
|
Test report by @wpoely86 |
|
@wpoely86 I enhanced the tests to allow for the |
allow dependency variants that are specific to a particular CUDA version
|
@boegel merging time? |
|
@wpoely86 Please submit a new test report on top of the latest updates. |
|
|
||
| dependencies = [ | ||
| ('Python', '3.6.4'), | ||
| ('libyaml', '0.1.7', '', True), |
There was a problem hiding this comment.
@wpoely86 Why is this hardcoded to dummy? We already have libyaml-0.1.7-GCCcore-6.4.0.eb that can resolve this dep.
|
|
||
| # you can choice here: either give a list of CUDA cc version or tell it All | ||
| # by default it does autodetect of the GPU on the local machine | ||
| prebuildopts += ' TORCH_CUDA_ARCH_LIST="3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.1 7.0"' |
There was a problem hiding this comment.
@wpoely86 prebuildopts should be moved down, below dependencies
| sources = [ | ||
| 'v%(version)s.tar.gz', # PyTorch | ||
| { | ||
| 'filename': 'cb002e4eb8d167c2c60fc3bdaae4e1844e0f9353.tar.gz', |
There was a problem hiding this comment.
@wpoely86 Can we use download_file here, and rename the downloaded file using filename to a more meaningful filename like gloo-<datestamp-of-commit>.tar.gz (same below)?
| ] | ||
|
|
||
| # PyTorch pulls in a bunch of submodules which don't have releases. | ||
| # We download the submodule revisions from their repos. |
There was a problem hiding this comment.
@wpoely86 This doesn't belong here, but with sources?
|
|
||
| # PyTorch pulls in a bunch of submodules which don't have releases. | ||
| # We download the submodule revisions from their repos. | ||
| options = {'modulename': 'torch'} |
There was a problem hiding this comment.
@wpoely86 This should be moved down, below sanity_check_paths
| sources = [ | ||
| 'v%(version)s.tar.gz', # PyTorch | ||
| { | ||
| 'filename': 'cb002e4eb8d167c2c60fc3bdaae4e1844e0f9353.tar.gz', |
| ] | ||
|
|
||
| # PyTorch pulls in a bunch of submodules which don't have releases. | ||
| # We download the submodule revisions from their repos. |
|
|
||
| # PyTorch pulls in a bunch of submodules which don't have releases. | ||
| # We download the submodule revisions from their repos. | ||
| options = {'modulename': 'torch'} |
| toolchain = {'name': 'foss', 'version': '2018a'} | ||
|
|
||
| source_urls = [ | ||
| 'https://github.com/pytorch/vision/archive', |
| toolchain = {'name': 'intel', 'version': '2018a'} | ||
|
|
||
| source_urls = [ | ||
| 'https://github.com/pytorch/vision/archive', |
|
Test report by @wpoely86 |
…nto pytorch-vub * 'pytorch-vub' of github:wpoely86/easybuild-easyconfigs: take into account some there may be *only* variants based on different CUDA versions make sure dep_vars stays a dict in check_dep_vars helper function allow dependency variants that are specific to a particular CUDA version Fix icc AVX detection checksums injected add skip mpi patch trying to fix tests 1
|
Test report by @wpoely86 |
|
Test report by @wpoely86 |
|
@wpoely86 what's up with those failing tests? |
|
It's the multiprocess test that fails if you use CUDA. The first process claims all the GPUs and the second process fails immediately as it cannot access any GPUs... I can try to make the GPUs shared (by default it's process exclusive). Or drop the tests. |
|
Test report by @wpoely86 |
|
OK, put the GPU mode to shared fixed it. @boegel good to go? |
|
Going in, thanks @wpoely86! |
| sources = [ | ||
| 'v%(version)s.tar.gz', # PyTorch | ||
| { | ||
| 'download_filename': 'cb002e4eb8d167c2c60fc3bdaae4e1844e0f9353.tar.gz', |
There was a problem hiding this comment.
@wpoely86 @smoors @hajgato @zao How was this determined? Latest & great commit in https://github.com/facebookincubator/gloo ?
Same questions below...
There was a problem hiding this comment.
If I remember correctly they come from the .gitsubmodules files.
There was a problem hiding this comment.
That helps, see https://github.com/pytorch/pytorch/blob/master/.gitmodules, but it doesn't explain the old googletest commit...
There was a problem hiding this comment.
you need to look at https://github.com/pytorch/pytorch/blob/2b4748011b5881583567bb166801ca6625f2fdda/.gitmodules
No clue why the old googletest. It probably had a good reason. If I remember correct, I let the full repo clone (with all submodules) and took all those commits.
| 'extract_cmd': extract_cmd_pattern % (pytorchdir, 'torch/lib/gloo'), | ||
| }, | ||
| { | ||
| 'download_filename': 'ec44c6c1675c25b9827aacd08c02433cccde7780.tar.gz', |
There was a problem hiding this comment.
They are exactly the hashes that the submodule indicated when pulling the 0.3.0 tag. I have no idea why they use the version they do.
My workflow was to do a recursive init of submodules and mechanically copy the hashes and paths into my recipe.
There was a problem hiding this comment.
It looks like they re-organized things a bit, now you need to look at https://github.com/pytorch/pytorch/tree/v0.4.1/third_party...
This is based on @zao work in #5530 but with the regular intel and foss toolchains (using CUDA as a versionsuffix).
This needs easybuilders/easybuild-easyblocks#1398 to let it use all the correct dependencies.