Skip to content

fix patch for torchtext to use system libraries, to fix linking of RE2 in both PyTorch-bundle and torchtext#23823

Merged
boegel merged 8 commits intoeasybuilders:developfrom
Flamefire:20250909082151_new_pr_PyTorch-bundle1121
Sep 23, 2025
Merged

fix patch for torchtext to use system libraries, to fix linking of RE2 in both PyTorch-bundle and torchtext#23823
boegel merged 8 commits intoeasybuilders:developfrom
Flamefire:20250909082151_new_pr_PyTorch-bundle1121

Conversation

@Flamefire
Copy link
Copy Markdown
Contributor

@Flamefire Flamefire commented Sep 9, 2025

(created using eb --new-pr)

torchtext links against our RE2 library which links against Abseil.

Since #22805 they are static libraries so dependents of RE2 need to link against them too.
Our patch using system libraries did:

find_package(re2)
find_library(SENTENCEPIECE_LIBRARY sentencepiece PATHS $ENV{EBROOTSENTENCEPIECE}/lib64)
find_library(SENTENCEPIECE_TRAIN_LIBRARY sentencepiece_train PATHS $ENV{EBROOTSENTENCEPIECE}/lib64)

However torchtext still links against re2 sentencepiece which causes linker flags -lre2 -lsentencepiece i.e. the above statements had no effect.

Fix was to link against the target re2::re2 which can be easily done by introducing an ALIAS. The same (but in the other direction) is done in the RE2 sources which used to be add_subdirectoryed.
And this target has the correct dependencies set.

Similar sentencepiece is missing a target causing the link against the library directly (w/o targets). But the installation has no target and turns out to require C++17 for std::string_view. So add a target for that and set the C++ standard

Fixes #23762

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 15 out of 18 (7 easyconfigs in total)
n1518 - Linux RHEL 8.9 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.18
See https://gist.github.com/Flamefire/ad504efefe3edaa1c09719c3778e931c for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 8 (7 easyconfigs in total)
i8024 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/4823398bd8a9baa3ccc833c85b178115 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 7 (7 easyconfigs in total)
n1136 - Linux RHEL 8.9 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.18
See https://gist.github.com/Flamefire/ea2919c67baa588e77dc14055780ec5f for a full test report.

@boegel boegel added bug fix and removed change labels Sep 10, 2025
@boegel boegel added this to the next release (5.1.2) milestone Sep 10, 2025
@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 10, 2025

@Flamefire What's up with the failing test reports?

@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 10, 2025

@Flamefire Can you clarify why these changes are needed? Seems linked to the change made in:

@Flamefire Flamefire force-pushed the 20250909082151_new_pr_PyTorch-bundle1121 branch from 0380f0b to 61ea463 Compare September 10, 2025 07:53
@Flamefire Flamefire force-pushed the 20250909082151_new_pr_PyTorch-bundle1121 branch from 52431c2 to 2aa20cb Compare September 10, 2025 07:55
@verdurin
Copy link
Copy Markdown
Member

@Flamefire Can you clarify why these changes are needed? Seems linked to the change made in:

I think it's partly motivated by the linking problems I have been seeing with torchtext.
Cf. #23762

@Flamefire
Copy link
Copy Markdown
Contributor Author

Ah thanks for the issue link, added this and description to the PR description above.

Failures are partially fixed by the recent patch update (wrong C++ version used for Sentencepiece) while others are accuracy issues in the tests.

@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 10, 2025

@Flamefire Can you clarify why these changes are needed? Seems linked to the change made in:

I think it's partly motivated by the linking problems I have been seeing with torchtext. Cf. #23762

Can you submit a test report for PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb to confirm the fix from your end?

@verdurin
Copy link
Copy Markdown
Member

Have a build running, those nodes aren't setup yet to send test reports. Will comment here anyway.

@verdurin
Copy link
Copy Markdown
Member

Started another build on a node with test report upload enabled.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
n1607 - Linux RHEL 8.9 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.18
See https://gist.github.com/Flamefire/584cf1eed14b1865508e7f3b05041ea1 for a full test report.

@verdurin
Copy link
Copy Markdown
Member

The first build failed with what I think is an unrelated error which I've raised elsewhere, relating to tcmalloc:

== FAILED: Installation ended unsuccessfully: Sanity check failed: extensions sanity check failed for 1 extensions: torchtext
failing sanity check for 'torchtext' extension: command "/apps/eb/el9/2023a/aarch64/software/Python/3.11.3-GCCcore-12.3.0/bin/python -c "import torchtext"" failed; output:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/apps/eb/el9/2023a/aarch64/software/PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torchtext/__init__.py", line 6, in <module>
    from torchtext import _extension  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/eb/el9/2023a/aarch64/software/PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torchtext/_extension.py", line 64, in <module>
    _init_extension()
  File "/apps/eb/el9/2023a/aarch64/software/PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torchtext/_extension.py", line 58, in _init_extension
    _load_lib("libtorchtext")
  File "/apps/eb/el9/2023a/aarch64/software/PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torchtext/_extension.py", line 50, in _load_lib
    torch.ops.load_library(path)
  File "/apps/eb/el9/2023a/aarch64/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/_ops.py", line 852, in load_library
    ctypes.CDLL(path)
  File "/apps/eb/el9/2023a/aarch64/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /apps/eb/el9/2023a/aarch64/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block,  (took 3 hours 26 mins 51 secs)

@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 10, 2025

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16
EB_ARGS="PyTorch-bundle-1.13.1-foss-2022a-CUDA-11.7.0.eb torchtext-0.14.1-foss-2022a-PyTorch-1.12.0.eb"

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=23823 EB_ARGS="PyTorch-bundle-1.13.1-foss-2022a-CUDA-11.7.0.eb torchtext-0.14.1-foss-2022a-PyTorch-1.12.0.eb" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_23823 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7922

Test results coming soon (I hope)...

Details

- notification for comment with ID 3275073540 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 575.57.08, Python 3.9.21
See https://gist.github.com/boegelbot/302251fcf6b8788e440745b362017392 for a full test report.

@verdurin
Copy link
Copy Markdown
Member

Test report by @verdurin
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
cdtgpu02.cloud.in.bmrc.ox.ac.uk - Linux Rocky Linux 9.6 (Blue Onyx), x86_64, Intel(R) Xeon(R) Platinum 8352M CPU @ 2.30GHz, 1 x NVIDIA NVIDIA A100 80GB PCIe, 535.261.03, Python 3.9.21
See https://gist.github.com/verdurin/151aa4b126f21ab4763667ffd8e6f9e0 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Can you submit a test report for PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb to confirm the fix from your end?

Still failing #23823 (comment)

Although I can't see why in all the warnings. Manually rebuilding now

@Flamefire
Copy link
Copy Markdown
Contributor Author

Flamefire commented Sep 10, 2025

Test report by @Flamefire
FAILED
Build succeeded for 7 out of 8 (7 easyconfigs in total)
c144 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18
See https://gist.github.com/Flamefire/0b44cd7e80d96b30bdbcfcaacd216512 for a full test report.

Failure due to H100 not supporting CC 8.6 while CUDA 11.7 doesn't support CC 9.0

@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 10, 2025

Test report by @boegel
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
node3903.accelgor.os - Linux RHEL 9.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 570.133.20, Python 3.9.18
See https://gist.github.com/boegel/5a678c6044559979bf2a356d054bad52 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

@boegel My error seems to be related to CUDA capabilities. Looks like we need to set them for the torch-extensions.

Yours: "No space left on device"

@verdurin
Copy link
Copy Markdown
Member

I think my failure is because I am using MIG slices on this node - will go back to non-MIG mode and re-try.

@Flamefire any ideas for that tcmalloc fix in the sanity check? I'm already using a hook for LD_PRELOAD in the test.

@Flamefire
Copy link
Copy Markdown
Contributor Author

I found that we need to pass CUDA compute capabilities or it will use (unsuitable) defaults. Testing a fix right now.

@Flamefire any ideas for that tcmalloc fix in the sanity check? I'm already using a hook for LD_PRELOAD in the test.

It seems I'm not seeing this on my side, so not sure what even causes this. What is the LD_PRELOAD value/lib you add?

@verdurin
Copy link
Copy Markdown
Member

Test report by @verdurin
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cdtgpu02.cloud.in.bmrc.ox.ac.uk - Linux Rocky Linux 9.6 (Blue Onyx), x86_64, Intel(R) Xeon(R) Platinum 8352M CPU @ 2.30GHz, 1 x NVIDIA NVIDIA A100 80GB PCIe, 535.261.03, Python 3.9.21
See https://gist.github.com/verdurin/9fcd070468ddcd819d4e134c225020a7 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
c144 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18
See https://gist.github.com/Flamefire/72af35a0a60f77be7c55366c3ca5c05d for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Flamefire commented Sep 11, 2025

Test report by @Flamefire
FAILED
Build succeeded for 5 out of 7 (7 easyconfigs in total)
login1.alpha.hpc.tu-dresden.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/36e04b506355ac0d29bee665e3c4db03 for a full test report.

Failures in tests, likely existed before so I'd ignore them

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 7 out of 7 (7 easyconfigs in total)
n1597 - Linux RHEL 8.9 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.18
See https://gist.github.com/Flamefire/e988293952ecdc51b14356ba9a7050e5 for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 7 out of 7 (7 easyconfigs in total)
i8039 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/116e6e594498bbe1331d35ab38dee073 for a full test report.

@boegel boegel changed the title Fix linking of RE2 in PyTorch-bundle fix patch for torchtext to use system libraries, to fix linking of RE2 in both PyTorch-bundle and torchtext Sep 23, 2025
Copy link
Copy Markdown
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Copy Markdown
Member

boegel commented Sep 23, 2025

Going in, thanks @Flamefire!

@boegel boegel merged commit c10740a into easybuilders:develop Sep 23, 2025
8 checks passed
@Flamefire Flamefire deleted the 20250909082151_new_pr_PyTorch-bundle1121 branch September 23, 2025 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Linking error for the test step of torchtext on EL9

5 participants