Skip to content

add patch to fix hardcoded num_cores in DMCfun extension included with R 4.0.x#12583

Merged
boegel merged 6 commits intoeasybuilders:developfrom
hajgato:R-dmcfun
May 5, 2021
Merged

add patch to fix hardcoded num_cores in DMCfun extension included with R 4.0.x#12583
boegel merged 6 commits intoeasybuilders:developfrom
hajgato:R-dmcfun

Conversation

@hajgato
Copy link
Copy Markdown
Collaborator

@hajgato hajgato commented Apr 13, 2021

fixes #12569
This only solves PBS cases so other queue systems might need different solution(s)

@hajgato hajgato changed the title Fix DMCfun hard-coded num_cores {lang} R/4.0.x Fix DMCfun hard-coded num_cores Apr 13, 2021
@Micket Micket added the bug fix label Apr 14, 2021
@verdurin
Copy link
Copy Markdown
Member

I think it would be good to see testing of this on different schedulers before merging.

num_cores <- 2L
} else {
- num_cores <- parallel::detectCores() / 2
+ num_cores <- as.numeric(Sys.getenv("PBS_NP", parallel::detectCores() / 2))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not extend this to
Sys.getenv("PBS_NP", Sys.getenv("SLURM_CPUS_ON_NODE", parallel::detectCores() / 2))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or possibly the other way around seing that Slurm is more common than PBS these days.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added num_cores <- as.numeric(Sys.getenv("SLURM_NTASKS", Sys.getenv("PBS_NP", parallel::detectCores() / 2))
I think SLURM_NTASKS are closer to the meaning of PBS_NP, which is the number of processors (nodes=x:ppn=y, so PBS_NP=X*Y)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so it is MPI aware?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then SLURM_CPUS_ON_NODE is probably more correct.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I find this hairsplitting, but be it. All in all, if the user incorrectly assumes from an only OpenMP software that it is MPI, then both versions are wrong. You can argue which one is worst, though, but it does not change the fact that both are incorrect. If it is used in a correct way, both versions are just fine.
Just telling in advance, I will not compare SLURM_CPUS_ON_NODE and SLURM_NTASKS with each other and if they differ, then program will not stop nor write a friendly message that please do not use multiple nodes, because this software is only for one node.

@boegel boegel added this to the next release (4.3.5?) milestone Apr 20, 2021
@hajgato
Copy link
Copy Markdown
Collaborator Author

hajgato commented Apr 20, 2021

@verdurin I am open for suggestions. I can only reach slurm with pbs compatibility enabled.

@boegel
Copy link
Copy Markdown
Member

boegel commented May 5, 2021

@hajgato Patch introduces broken code it seems, DMCfun fails to reinstall:

Error in parse(outFile) :
  /tmp/boegelbot/R/4.0.3/foss-2020b/DMCfun/DMCfun/R/dmcFit.R:163:5: unexpected '}'
162:       num_cores <- as.numeric(Sys.getenv("SLURM_CPUS_ON_NODE", Sys.getenv("PBS_NUM_PPN", parallel::detectCores() / 2))
163:     }
         ^
ERROR: unable to collate and parse R files for package DMCfun

@boegel
Copy link
Copy Markdown
Member

boegel commented May 5, 2021

@boegelbot please test @ generoso
EB_ARGS="--skip R-4.0.3-foss-2020b.eb R-4.0.4-foss-2020b.eb"

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12583 EB_ARGS="--skip R-4.0.3-foss-2020b.eb R-4.0.4-foss-2020b.eb" /apps/slurm/default/bin/sbatch --job-name test_PR_12583 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 17037

Test results coming soon (I hope)...

Details

- notification for comment with ID 832597968 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
generoso-x-3 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/04e3b9bc7cb2534545df95a7b337ba7d for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented May 5, 2021

I've tested the patched installation of DMCfun on a Slurm system, and it indeed behaves as expected now: it only starts as many threads as there are cores available in the job 👍

Copy link
Copy Markdown
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel changed the title {lang} R/4.0.x Fix DMCfun hard-coded num_cores add patch to fix hardcoded num_cores in DMCfun extension included with R 4.0.x May 5, 2021
@boegel
Copy link
Copy Markdown
Member

boegel commented May 5, 2021

Going in, thanks @hajgato!

@boegel boegel merged commit 616e4b0 into easybuilders:develop May 5, 2021
robert-mijakovic pushed a commit to robert-mijakovic/easybuild-easyconfigs that referenced this pull request Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

R's DMCFun might detects number of (available) cores for parallel run incorrectly in PBS

6 participants