Skip to content

also try collecting AMD GPU info (via rocm-smi) for --show-system-info#3978

Merged
bartoldeman merged 3 commits intoeasybuilders:developfrom
SebastianAchilles:gpuinfo
Mar 23, 2022
Merged

also try collecting AMD GPU info (via rocm-smi) for --show-system-info#3978
bartoldeman merged 3 commits intoeasybuilders:developfrom
SebastianAchilles:gpuinfo

Conversation

@SebastianAchilles
Copy link
Copy Markdown
Member

Extends the GPU info to also report info for AMD GPUs

@easybuilders easybuilders deleted a comment from boegelbot Mar 16, 2022
@boegel boegel modified the milestones: 5.0, 4.x Mar 16, 2022
Copy link
Copy Markdown
Contributor

@bartoldeman bartoldeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

It's hard to add a test case and there isn't one for Nvidia GPUs either, so this should be ok.

@bartoldeman bartoldeman merged commit 74c9726 into easybuilders:develop Mar 23, 2022
@boegel boegel modified the milestones: 4.x, next release (4.5.4) Mar 25, 2022
@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 26, 2022

Works like a charm on the MI100 system I have access to @ RUG:

$ eb --show-system-info
System information (pg-lab02.hpc.rug.nl):

* OS:
  -> name: RHEL
  -> type: Linux
  -> version: 8.4
  -> platform name: x86_64-unknown-linux

* CPU:
  -> vendor: AMD
  -> architecture: x86_64
  -> family: AMD
  -> arch name: UNKNOWN (archspec is not installed?)
  -> model: AMD EPYC 7542 32-Core Processor
  -> speed: 2894.608
  -> cores: 64
  -> features: 3dnowprefetch,abm,adx,aes,amd_ppin,aperfmperf,apic,arat,avic,avx,avx2,bmi1,bmi2,bpext,cat_l3,cdp_l3,clflush,clflushopt,clwb,clzero,cmov,cmp_legacy,constant_tsc,cpb,cpuid,cqm,cqm_llc,cqm_mbm_local,cqm_mbm_total,cqm_occup_llc,cr8_legacy,cx16,cx8,de,decodeassists,extapic,extd_apicid,f16c,flushbyasid,fma,fpu,fsgsbase,fxsr,fxsr_opt,ht,hw_pstate,ibpb,ibrs,ibs,irperf,lahf_lm,lbrv,lm,mba,mca,mce,misalignsse,mmx,mmxext,monitor,movbe,msr,mtrr,mwaitx,nonstop_tsc,nopl,npt,nrip_save,nx,osvw,overflow_recov,pae,pat,pausefilter,pclmulqdq,pdpe1gb,perfctr_core,perfctr_llc,perfctr_nb,pfthreshold,pge,pni,popcnt,pse,pse36,rdpid,rdrand,rdseed,rdt_a,rdtscp,rep_good,sep,sev,sev_es,sha_ni,skinit,smap,smca,sme,smep,ssbd,sse,sse2,sse4_1,sse4_2,sse4a,ssse3,stibp,succor,svm,svm_lock,syscall,tce,topoext,tsc,tsc_scale,umip,v_spec_ctrl,v_vmsave_vmload,vgif,vmcb_clean,vme,vmmcall,wbnoinvd,wdt,x2apic,xgetbv1,xsave,xsavec,xsaveerptr,xsaveopt,xsaves

* GPU:
  -> AMD
    -> 2x 0x0c34, 4.18.0-348.12.2.el8_5.x86_64

* software:
  -> glibc version: 2.28
  -> Python binary: /usr/bin/python3
  -> Python version: 3.6.8

@branfosj
Copy link
Copy Markdown
Member

* GPU:
  -> AMD
    -> 2x 0x0c34, 4.18.0-348.12.2.el8_5.x86_64

Is that the right part of the GPU info output?

@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 26, 2022

Yeah, I just realised that could be better ^_^

Here's the raw output:

$ rocm-smi --showdriverversion --csv
device,Driver version
cardsystem,4.18.0-348.12.2.el8_5.x86_64
$ rocm-smi --showproductname --csv
device,Card series,Card model,Card vendor,Card SKU
card0,Arcturus GL-XL [AMD Instinct MI100],0x0c34,Advanced Micro Devices Inc. [AMD/ATI],D34316
card1,Arcturus GL-XL [AMD Instinct MI100],0x0c34,Advanced Micro Devices Inc. [AMD/ATI],D34316

@SebastianAchilles So from this output, it should be taking the 2nd field ([1]), not the 3rd ([2]) as it is now?

@SebastianAchilles
Copy link
Copy Markdown
Member Author

The idea was to print the card model which is the 3rd field ([2]):

$ rocm-smi --showproductname


======================= ROCm System Management Interface =======================
================================= Product Info =================================
GPU[0]          : Card series:          Fiji [Radeon R9 FURY / NANO Series]
GPU[0]          : Card model:           Radeon R9 FURY X
GPU[0]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             C88001
================================================================================
============================= End of ROCm SMI Log ==============================
$ rocm-smi --showproductname --csv
device,Card series,Card model,Card vendor,Card SKU
card0,Fiji [Radeon R9 FURY / NANO Series],Radeon R9 FURY X,Advanced Micro Devices Inc. [AMD/ATI],C88001

If Card model does not produce informative output on some cards, we could consider to change it to the Card series the 2nd field ([1]) instead.

@boegel Could you share the output of rocm-smi --showproductname?

@SebastianAchilles
Copy link
Copy Markdown
Member Author

The output above is for a newer driver version:

$ rocm-smi --showdriverversion --csv
device,Driver version
cardsystem,5.11.14

@SebastianAchilles
Copy link
Copy Markdown
Member Author

I opened a follow-up PR to also add the AMD card series: #3982

@SebastianAchilles SebastianAchilles changed the title add support for collecting GPU info (via rocm-smi) add support for collecting AMD GPU info (via rocm-smi) Mar 28, 2022
@boegel boegel changed the title add support for collecting AMD GPU info (via rocm-smi) also try collecting AMD GPU info (via rocm-smi) for --show-system-info Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants