@luqman Musings and posts from Luqman. Zola 2022-10-23T15:00:00+00:00 https://luqman.ca/atom.xml Multi-Kernel Drifting 2022-10-23T15:00:00+00:00 2022-10-23T15:00:00+00:00 https://luqman.ca/blog/multi-kernel-drifting/ <p>I was setting up some automation to build Windows images pre-loaded with some drivers and software (a story for another day). I had already gotten it working with QEMU under KVM on Linux but wanted to port it to <a href="https://github.com/oxidecomputer/propolis">propolis</a> on our illumos distro, Helios. I figured it should be mostly straightforward; maybe a couple different flags or utilities to futz around with the disk images and mount them. Which was the case. Mostly. That is except for the one minor detail of not being able to mount an NTFS image.</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ pfexec mount -F ntfs-3g $LOOPBACK_DEV /mnt/test </span><span>fuse: mount failed: Not a directory </span></code></pre> <span id="continue-reading"></span><h2 id="the-setup">the setup</h2> <p>Ok, let's step back a second. To give some context, I was trying to create a raw image that contained an NTFS partition. Maybe it didn't like the way I created the GPT? Ok, let's try something simpler and forget partitions for a moment and just try solely creating an NTFS file system:</p> <ol> <li>Create an empty disk image:</li> </ol> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ qemu-img create -f raw test.img 8G </span><span>Formatting &#39;test.img&#39;, fmt=raw size=8589934592 </span></code></pre> <ol start="2"> <li>Create loopback device:</li> </ol> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ pfexec lofiadm -l -a test.img </span><span>/dev/dsk/c2t1d0p0 </span></code></pre> <ol start="3"> <li>Create an NTFS file system:</li> </ol> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ mkntfs -Q /dev/dsk/c2t1d0p0 </span><span>The sector size was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically. It has been set to 512 bytes. </span><span>The partition start sector was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically. It has been set to 0. </span><span>The number of sectors per track was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically. It has been set to 0. </span><span>The number of heads was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically. It has been set to 0. </span><span>Cluster size has been automatically set to 4096 bytes. </span><span>To boot from a device, Windows needs the &#39;partition start sector&#39;, the &#39;sectors per track&#39; and the &#39;number of heads&#39; to be set. </span><span>Windows will not be able to boot from this device. </span><span>Creating NTFS volume structures. </span><span>mkntfs completed successfully. Have a nice day. </span></code></pre> <p>Ok, that's a lot of warnings (that I did handle properly in the real scenario!) but shouldn't be relevant right now. We don't care if Windows can't boot off of this image.</p> <ol start="4"> <li>Mount the file system:</li> </ol> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ pfexec mount -F ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test </span><span>fuse: mount failed: Not a directory </span></code></pre> <p>Something's clearly wrong.</p> <h3 id="linux">linux</h3> <p>Mind you this same scenario using the fuse-based ntfs-3g driver works in linux:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>➜ ~ qemu-img create -f raw test.img 8G </span><span>Formatting &#39;test.img&#39;, fmt=raw size=8589934592 </span><span>➜ ~ sudo losetup -f --show test.img </span><span>/dev/loop1 </span><span>➜ ~ sudo mkntfs -Q /dev/loop1 </span><span>The partition start sector was not specified for /dev/loop1 and it could not be obtained automatically. It has been set to 0. </span><span>The number of sectors per track was not specified for /dev/loop1 and it could not be obtained automatically. It has been set to 0. </span><span>The number of heads was not specified for /dev/loop1 and it could not be obtained automatically. It has been set to 0. </span><span>Cluster size has been automatically set to 4096 bytes. </span><span>To boot from a device, Windows needs the &#39;partition start sector&#39;, the &#39;sectors per track&#39; and the &#39;number of heads&#39; to be set. </span><span>Windows will not be able to boot from this device. </span><span>Creating NTFS volume structures. </span><span>mkntfs completed successfully. Have a nice day. </span><span>➜ ~ mkdir test </span><span>➜ ~ sudo mount -t ntfs-3g /dev/loop1 test </span><span>➜ ~ echo hello &gt; test/world </span><span>➜ ~ cat test/world </span><span>hello </span></code></pre> <h2 id="ntfs-3g">ntfs-3g</h2> <p>If you're using NTFS on non-Windows chances are you're using some <a href="https://github.com/tuxera/ntfs-3g">ntfs-3g</a> based driver. It is used along with Filesystem in USErspace (FUSE) to provide access to NTFS volumes. This arrangement consists of two parts: the FUSE kernel driver and the userspace application that links against <code>libfuse</code>. In this case, that is <code>ntfs-3g</code>.</p> <p>Let's skip the <code>mount</code> wrapper and just ask <code>ntfs-3g</code> directly to mount our image:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ pfexec ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test </span><span>fuse: mount failed: Not a directory </span></code></pre> <p>Alas, still no good. I guess we gotta dig.</p> <h3 id="truss">truss</h3> <p>On Linux you have <code>strace</code> to trace system calls. On Illumos there's <code>truss</code>:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ pfexec truss ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test </span><span>execve(&quot;/opt/ooce/ntfs-3g/bin/ntfs-3g&quot;, 0xFFFFFC7FFFDFDE48, 0xFFFFFC7FFFDFDE68) argc = 3 </span><span>sysinfo(SI_MACHINE, &quot;i86pc&quot;, 257) = 6 </span><span>[...snip...] </span><span>mount(&quot;/devices/pseudo/lofi@1:q&quot;, &quot;/mnt/test&quot;, MS_NOSUID|MS_OPTIONSTR, &quot;fuse&quot;, 0x00000000, 0, 0x00E63E60, 1024) Err#20 ENOTDIR </span><span>open(&quot;/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_OSLIB.mo&quot;, O_RDONLY) Err#2 ENOENT </span><span>fstat(2, 0xFFFFFC7FFFDFC720) = 0 </span><span>fuse: mount failed: write(2, &quot; f u s e : m o u n t &quot;.., 20) = 20 </span><span>Not a directorywrite(2, &quot; N o t a d i r e c t&quot;.., 15) = 15 </span><span> </span><span>write(2, &quot;\n&quot;, 1) = 1 </span><span>close(5) = 0 </span><span>fdsync(4, FSYNC) = 0 </span><span>fcntl(4, F_SETLK, 0xFFFFFC7FFFDFDBD0) = 0 </span><span>close(4) = 0 </span><span>_exit(21) </span><span> </span></code></pre> <p>Hmmm, <code>mount(&quot;/devices/pseudo/lofi@1:q&quot;, &quot;/mnt/test&quot;, MS_NOSUID|MS_OPTIONSTR, &quot;fuse&quot;, 0x00000000, 0, 0x00E63E60, 1024) Err#20 ENOTDIR</code>.</p> <p>That <code>ENOTDIR</code> error is not from <code>ntfs-3g</code>, in fact we see it returns an exit code of <code>21</code> and the manual page tells us that's an &quot;Unclassified FUSE error&quot;.</p> <p>The <code>mount</code> syscall here is what returned that <code>ENOTDIR</code> and its manual page says:</p> <pre style="background-color:#151515;color:#e8e8d3;"><code><span> ENOTDIR </span><span> The dir argument is not a directory, or a component of </span><span> a path prefix is not a directory. </span></code></pre> <p>Not a directory you say?</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ file /mnt/test </span><span>/mnt/test: directory </span></code></pre> <p>Presumably it is the fuse kernel driver which is handling the <code>mount</code> syscall in this case. One quick way to check: <a href="http://dtrace.org/blogs/about/">DTrace</a>.</p> <h3 id="dtrace">dtrace</h3> <p><a href="https://illumos.org/books/dtrace/chp-intro.html#chp-intro">DTrace on illumos</a> offers a wealth of information on a live system. With a lot of introspection capabilities, it makes for a great debugging tool. I'm still learning to reach for it, but it works perfectly here:</p> <pre style="background-color:#151515;color:#e8e8d3;"><code><span>$ pfexec dtrace -n &#39;fuse::return /arg1 == ENOTDIR &amp;&amp; pid == $target/ { stack(); }&#39; -c &quot;ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test&quot; </span><span>dtrace: description &#39;fuse::return &#39; matched 98 probes </span><span>fuse: mount failed: Not a directory </span><span>dtrace: pid 11852 has exited </span><span>CPU ID FUNCTION:NAME </span><span> 1 17911 fuse_mount:return </span><span> genunix`fsop_mount+0x14 </span><span> genunix`domount+0x948 </span><span> genunix`mount+0xfe </span><span> genunix`syscall_ap+0x98 </span><span> unix`sys_syscall+0x17d </span></code></pre> <p>So what did we do there? We ran dtrace (<code>pfexec dtrace</code>) and:</p> <ol> <li> <p>told it what probes to match (<code>fuse::return</code>)</p> <p>Probes are specified as <code>[[[provider:] module:] function:] name</code>, where any unspecified field acts as a wildcard.</p> <p>We want to match the exit (<code>return</code>) probe of any function in the <code>fuse</code> kernel module.</p> </li> <li> <p>for any such probes, a predicate to further filter them (<code>/arg1 == ENOTDIR &amp;&amp; pid == $target/</code>)</p> <p><code>arg1</code> for a <code>return</code> probe corresponds to its return value. Here we compare to the error we're looking for: <code>ENOTDIR</code>.</p> <p><code>pid == $target</code> is to further constrain it by using the provided <code>$target</code> macro which refers to:</p> </li> <li> <p>the command we want to trace (<code>-c &quot;ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test&quot;</code>)</p> </li> <li> <p>and what actions to take for any matches (<code>{ stack(); }</code>)</p> <p>DTrace has a number of actions to inspect the system, here we use the <code>stack()</code> to record and print out the kernel stack trace for our matched probes.</p> </li> </ol> <h2 id="fuse">FUSE</h2> <p>FUSE seems to be the one stumbling over our supposedly &quot;not a directory&quot; directory. Now that <code>dtrace</code> was helpful enough to point out where the error comes from, let's take a look at the <a href="https://github.com/jurikm/illumos-fusefs/blob/ef9a33d4a18131a8c0e50002b138b0431e5db616/kernel/fuse_vfsops.c#L367">code</a>.</p> <p>It certainly doesn't take long to find the spot:</p> <pre data-lang="c" style="background-color:#151515;color:#e8e8d3;" class="language-c "><code class="language-c" data-lang="c"><span style="color:#8fbfdc;">static int </span><span style="color:#fad07a;">fuse_mount</span><span>(</span><span style="color:#8fbfdc;">struct</span><span> vfs *</span><span style="color:#ffb964;">vfsp</span><span>, </span><span style="color:#8fbfdc;">struct</span><span> vnode *</span><span style="color:#ffb964;">mvp</span><span>, </span><span style="color:#8fbfdc;">struct</span><span> mounta *</span><span style="color:#ffb964;">uap</span><span>, </span><span> </span><span style="color:#8fbfdc;">struct</span><span> cred *</span><span style="color:#ffb964;">cr</span><span>) </span><span>{ </span><span> fuse_vfs_data_t *vfsdata; </span><span> fuse_session_t *se; </span><span> dev_t dev; </span><span> </span><span style="color:#8fbfdc;">char </span><span>*fdstr; </span><span> </span><span style="color:#8fbfdc;">int</span><span> err; </span><span> </span><span> </span><span style="color:#8fbfdc;">if </span><span>(</span><span style="color:#ffb964;">secpolicy_fs_mount</span><span>(cr, mvp, vfsp) != </span><span style="color:#cf6a4c;">0</span><span>) </span><span> </span><span style="color:#8fbfdc;">return </span><span>(EPERM); </span><span> </span><span> </span><span style="color:#8fbfdc;">if </span><span>(mvp-&gt;v_type != VDIR) </span><span> </span><span style="color:#8fbfdc;">return </span><span>(ENOTDIR); </span></code></pre> <p>Every file is allocated a <code>vnode</code> and <code>mvp</code> here should represent the one for our mountpoint (<code>/mnt/test</code>). <code>mount</code> understandably requires you only mount things at a directory and so every file system driver should verify that is the case, just as FUSE does here. But if <code>/mnt/test</code> isn't a directory (<code>VDIR</code>), what is it?</p> <p>Back to dtrace!</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ pfexec dtrace -n &#39;fuse_mount:entry /pid == $target/ { printf(&quot;v_type = %d&quot;, args[1]-&gt;v_type); }&#39; -c &quot;ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test&quot; </span><span>dtrace: description &#39;fuse_mount:entry &#39; matched 1 probe </span><span>fuse: mount failed: Not a directory </span><span>dtrace: pid 11921 has exited </span><span>CPU ID FUNCTION:NAME </span><span> 6 17910 fuse_mount:entry v_type = 2 </span></code></pre> <p>This time we match just on entry to <code>fuse_mount</code> and for an entry probe we have access to <code>args</code>, which allows typed access to the function arguments. In this case we print out the <code>v_type</code> field of the second arg (<code>mvp = args[1]</code>). Let's take a look at the <a href="https://github.com/illumos/illumos-gate/blob/1a613b61205f4ee9a9fb00184dbe6cae17a6ede7/usr/src/uts/common/sys/vnode.h#L161-L174">enum definition</a>:</p> <pre data-lang="c" style="background-color:#151515;color:#e8e8d3;" class="language-c "><code class="language-c" data-lang="c"><span style="color:#8fbfdc;">typedef enum</span><span> vtype { </span><span> VNON = </span><span style="color:#cf6a4c;">0</span><span>, </span><span> VREG = </span><span style="color:#cf6a4c;">1</span><span>, </span><span> VDIR = </span><span style="color:#cf6a4c;">2</span><span>, </span><span> VBLK = </span><span style="color:#cf6a4c;">3</span><span>, </span><span> VCHR = </span><span style="color:#cf6a4c;">4</span><span>, </span><span> VLNK = </span><span style="color:#cf6a4c;">5</span><span>, </span><span> VFIFO = </span><span style="color:#cf6a4c;">6</span><span>, </span><span> VDOOR = </span><span style="color:#cf6a4c;">7</span><span>, </span><span> VPROC = </span><span style="color:#cf6a4c;">8</span><span>, </span><span> VSOCK = </span><span style="color:#cf6a4c;">9</span><span>, </span><span> VPORT = </span><span style="color:#cf6a4c;">10</span><span>, </span><span> VBAD = </span><span style="color:#cf6a4c;">11 </span><span>} </span><span style="color:#ffb964;">vtype_t</span><span>; </span></code></pre> <p>...it's <code>VDIR</code>?</p> <p>This is when I started questioning my sanity a little. Theories of weird corruption happening between function entry and the condition check. Was <code>secpolicy_fs_mount</code> secretly modifying it? (No.)</p> <p>Eventually I decide to look at the actual code running on my machine and use the kernel debugger to disassemble the <code>fuse</code> module in-memory:</p> <pre data-lang="asm" style="background-color:#151515;color:#e8e8d3;" class="language-asm "><code class="language-asm" data-lang="asm"><span style="color:#fad07a;">$ pfexec mdb </span><span>-</span><span style="color:#fad07a;">k </span><span style="color:#fad07a;">Loading modules: </span><span>[ </span><span style="color:#fad07a;">unix genunix specfs dtrace mac </span><span>cpu</span><span style="color:#fad07a;">.generic uppc apix scsi_vhci zfs sata </span><span style="color:#ffb964;">ip </span><span style="color:#fad07a;">hook neti sockfs arp usba xhci mm smbios stmf stmf_sbd lofs crypto random cpc ufs logindmux nsmb ptm smbsrv nf </span><span style="color:#fad07a;">s </span><span>] </span><span style="color:#fad07a;">&gt; fuse`fuse_mount::dis </span><span>[...snip...] </span><span style="color:#fad07a;">fuse_mount</span><span>+0x28</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">call </span><span>+0x3a21ec3 </span><span style="color:#fad07a;">&lt;secpolicy_fs_mount&gt; </span><span style="color:#fad07a;">fuse_mount</span><span>+0x2d</span><span style="color:#fad07a;">: testl %</span><span style="color:#ffb964;">eax</span><span>,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;">fuse_mount</span><span>+0x2f</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">jne </span><span>+0x110 </span><span style="color:#fad07a;">&lt;fuse_mount</span><span>+0x145</span><span style="color:#fad07a;">&gt; </span><span style="color:#fad07a;">fuse_mount</span><span>+0x35</span><span style="color:#fad07a;">: cmpl </span><span>$0x2,0x30</span><span style="color:#fad07a;">(%</span><span style="color:#ffb964;">r13</span><span style="color:#fad07a;">) </span><span style="color:#fad07a;">fuse_mount</span><span>+0x3a</span><span style="color:#fad07a;">: movl </span><span>$0x14,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">ebx </span><span style="color:#fad07a;">fuse_mount</span><span>+0x3f</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">jne </span><span>+0x100 </span><span style="color:#fad07a;">&lt;fuse_mount</span><span>+0x145</span><span style="color:#fad07a;">&gt; </span></code></pre> <p><code>0x14</code> is <code>ENOTDIR</code> and the <code>cmpl $0x2,0x30(%r13)</code> would line up with the <code>v_type != VDIR</code> check. But a quick <a href="https://godbolt.org/z/axM6bMPx4">hacked-up</a> validation of that offset does not square:</p> <blockquote> <p>v_type is at: 0x28</p> </blockquote> <p>How about comparing other file system drivers since they all have the same check:</p> <p>NFS?</p> <pre data-lang="asm" style="background-color:#151515;color:#e8e8d3;" class="language-asm "><code class="language-asm" data-lang="asm"><span style="color:#fad07a;">nfs_mount</span><span>+0x66</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">call </span><span>+0x3678985 </span><span style="color:#fad07a;">&lt;secpolicy_fs_mount&gt; </span><span style="color:#fad07a;">nfs_mount</span><span>+0x6b</span><span style="color:#fad07a;">: testl %</span><span style="color:#ffb964;">eax</span><span>,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;">nfs_mount</span><span>+0x6d</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">jne </span><span>+0x2fd </span><span style="color:#fad07a;">&lt;nfs_mount</span><span>+0x370</span><span style="color:#fad07a;">&gt; </span><span style="color:#fad07a;">nfs_mount</span><span>+0x73</span><span style="color:#fad07a;">: cmpl </span><span>$0x2,0x28</span><span style="color:#fad07a;">(%</span><span style="color:#ffb964;">rbx</span><span style="color:#fad07a;">) </span><span style="color:#fad07a;">nfs_mount</span><span>+0x77</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">jne </span><span>+0x31b </span><span style="color:#fad07a;">&lt;nfs_mount</span><span>+0x398</span><span style="color:#fad07a;">&gt; </span><span>[...snip...] </span><span style="color:#fad07a;">nfs_mount</span><span>+0x398</span><span style="color:#fad07a;">: movl </span><span>$0x14,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;"># ENOTDIR </span><span style="color:#fad07a;">nfs_mount</span><span>+0x39d</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">jmp </span><span>-0x2f </span><span style="color:#fad07a;">&lt;nfs_mount</span><span>+0x370</span><span style="color:#fad07a;">&gt; </span></code></pre> <p>It uses an offset of <code>0x28</code>. We need another data point, tmpfs?</p> <pre data-lang="asm" style="background-color:#151515;color:#e8e8d3;" class="language-asm "><code class="language-asm" data-lang="asm"><span style="color:#fad07a;">tmp_mount</span><span>+0x58</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">call </span><span>+0x3f1a993 </span><span style="color:#fad07a;">&lt;secpolicy_fs_mount&gt; </span><span style="color:#fad07a;">tmp_mount</span><span>+0x5d</span><span style="color:#fad07a;">: testl %</span><span style="color:#ffb964;">eax</span><span>,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;">tmp_mount</span><span>+0x5f</span><span style="color:#fad07a;">: movl %</span><span style="color:#ffb964;">eax</span><span>,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">r15d </span><span style="color:#fad07a;">tmp_mount</span><span>+0x62</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">jne </span><span>+0xc </span><span style="color:#fad07a;">&lt;tmp_mount</span><span>+0x70</span><span style="color:#fad07a;">&gt; </span><span style="color:#fad07a;">tmp_mount</span><span>+0x64</span><span style="color:#fad07a;">: cmpl </span><span>$0x2,0x28</span><span style="color:#fad07a;">(%</span><span style="color:#ffb964;">rbx</span><span style="color:#fad07a;">) </span><span style="color:#fad07a;">tmp_mount</span><span>+0x68</span><span style="color:#fad07a;">: movl </span><span>$0x14,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">r15d </span><span style="color:#fad07a;"># ENOTDIR </span><span style="color:#fad07a;">tmp_mount</span><span>+0x6e</span><span style="color:#fad07a;">: </span><span style="color:#8fbfdc;">je </span><span>+0x30 </span><span style="color:#fad07a;">&lt;tmp_mount</span><span>+0xa0</span><span style="color:#fad07a;">&gt; </span></code></pre> <p>It also uses an offset of <code>0x28</code>.</p> <h3 id="local-build">local build</h3> <p>Something's definitely going on here. At this point it's looking like the fuse driver has a different idea of what the <code>vnode</code> struct looks like. If for some reason it was compiled without <code>_LP64</code> defined then an offset of <code>0x30</code> could make sense but would certainly lead to other issues. And this is definitely a 64-bit module:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ file /usr/kernel/drv/amd64/fuse </span><span>/usr/kernel/drv/amd64/fuse: ELF 64-bit LSB relocatable AMD64 Version 1 </span></code></pre> <p>This is the point I took a detour into building the module locally. After some time trawling through build scripts I got it built. TL;DR:</p> <pre data-lang="bash" style="background-color:#151515;color:#e8e8d3;" class="language-bash "><code class="language-bash" data-lang="bash"><span style="color:#ffb964;">wget</span><span> https://mirrors.omnios.org/fuse/Version-1.4.tar.gz</span><span style="color:#ffb964;"> -O</span><span> illumos-fusefs-Version-1.4.tar.gz </span><span style="color:#ffb964;">gtar</span><span> xf illumos-fusefs-Version-1.4.tar.gz </span><span>cd illumos-fusefs-Version-1.4/kernel/amd64 </span><span style="color:#ffb964;">PATH</span><span>=</span><span style="color:#99ad6a;">$</span><span style="color:#ffb964;">PATH</span><span style="color:#99ad6a;">:/opt/onbld/bin/i386 </span><span style="color:#ffb964;">dmake</span><span> CC=gcc CFLAGS=</span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">-fident -fno-builtin -fno-asm -nodefaultlibs -Wall -Wno-unknown-pragmas -Wno-unused -fno-inline-functions -m64 -mcmodel=kernel -g -O2 -fno-inline -ffreestanding -fno-strict-aliasing -Wpointer-arith -gdwarf-2 -std=gnu99 -mno-red-zone -D_KERNEL -D__SOLARIS__ -mindirect-branch=thunk-extern -mindirect-branch-register</span><span style="color:#556633;">&quot; </span></code></pre> <p>Now to check what offset our newly built driver uses:</p> <pre data-lang="asm" style="background-color:#151515;color:#e8e8d3;" class="language-asm "><code class="language-asm" data-lang="asm"><span style="color:#fad07a;">$ objdump </span><span>-</span><span style="color:#fad07a;">D fuse | less </span><span>0000000000008040 </span><span style="color:#fad07a;">&lt;fuse_mount&gt;: </span><span>[...snip...] </span><span style="color:#fad07a;"> </span><span>8069</span><span style="color:#fad07a;">: e8 </span><span>00 00 00 00 </span><span style="color:#8fbfdc;">call </span><span style="color:#fad07a;">806e &lt;fuse_mount</span><span>+0x2e</span><span style="color:#fad07a;">&gt; </span><span style="color:#fad07a;"> 806e: </span><span>85 </span><span style="color:#fad07a;">c0 </span><span style="color:#8fbfdc;">test </span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">eax</span><span>,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;"> </span><span>8070</span><span style="color:#fad07a;">: 0f </span><span>85 </span><span style="color:#fad07a;">0c </span><span>01 00 00 </span><span style="color:#8fbfdc;">jne </span><span>8182 </span><span style="color:#fad07a;">&lt;fuse_mount</span><span>+0x142</span><span style="color:#fad07a;">&gt; </span><span style="color:#fad07a;"> </span><span>8076</span><span style="color:#fad07a;">: </span><span>83 </span><span style="color:#fad07a;">7b </span><span>28 02 </span><span style="color:#fad07a;">cmpl </span><span>$0x2,0x28</span><span style="color:#fad07a;">(%</span><span style="color:#ffb964;">rbx</span><span style="color:#fad07a;">) </span><span style="color:#fad07a;"> 807a: </span><span>41 </span><span style="color:#fad07a;">be </span><span>14 00 00 00 </span><span style="color:#8fbfdc;">mov </span><span>$0x14,</span><span style="color:#fad07a;">%</span><span style="color:#ffb964;">r14d </span><span style="color:#fad07a;"> </span><span>8080</span><span style="color:#fad07a;">: 0f </span><span>85 </span><span style="color:#fad07a;">fc </span><span>00 00 00 </span><span style="color:#8fbfdc;">jne </span><span>8182 </span><span style="color:#fad07a;">&lt;fuse_mount</span><span>+0x142</span><span style="color:#fad07a;">&gt; </span></code></pre> <p>It's <code>0x28</code>! Not <code>0x30</code>! Does that mean it would work? Let's try</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span># unload the current driver </span><span>$ modinfo | grep fuse </span><span>257 fffffffff800d000 d188 284 1 fuse (fuse driver) </span><span>257 fffffffff800d000 d188 28 1 fuse (filesystem for fuse) </span><span>$ pfexec modunload -i 257 </span><span> </span><span># load the newly built one </span><span>$ pfexec modload ./fuse </span><span> </span><span># try mounting the image again </span><span>$ pfexec ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test </span><span>$ echo hello &gt; /mnt/test/world </span><span>$ cat /mnt/test/world </span><span>hello </span></code></pre> <p>🎉 Success! 🎉</p> <p>Although, in this case success just brings more questions than answers.</p> <h2 id="breakthrough">breakthrough</h2> <p>At this point I'm really confused. Using the pre-built binary package fails on Helios. Building the driver locally works on Helios.</p> <p>I also gave <a href="https://omnios.org/">OmniOS</a> (a different Illumos distro and the source for a lot of the build scripts used for Helios packages) a try. The pre-built packages worked there too. But it was on OmniOS that I discovered that the &quot;bad&quot; offset of <code>0x30</code> was actually fine!? And not just for the FUSE driver but also NFS and tmpfs.</p> <p>Eventually while trying to figure out this odd difference between Helios and OmniOS came the breakthrough. Recall we were able to use typed arguments in our DTrace commands; that is enabled by the fact that a lot of software on Illumos comes with Compressed Type Format (CTF) data. CTF is a compact representation of data types and function signatures stored inside ELF objects. It is much smaller than the DWARF it is derived from. The smaller footprint makes it easy to ship by default and enable rich usescases like DTrace.</p> <p>We can use <code>ctfdump</code> to print out all the CTF data in our pre-built vs locally built driver and compare the <code>vnode</code> definitions used:</p> <p>First for the local build:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ ctfdump -t ./fuse | grep -A 8 &#39;struct vnode (&#39; </span><span> &lt;208&gt; struct vnode (216 bytes) </span><span> v_lock type=98 off=0 </span><span> v_flag type=28 off=64 </span><span> v_count type=28 off=96 </span><span> v_data type=36 off=128 </span><span> v_vfsp type=735 off=192 </span><span> v_stream type=736 off=256 </span><span> v_type type=230 off=320 </span><span> v_rdev type=65 off=384 </span></code></pre> <p><code>v_type</code> is at offset 320 bits = 40 / 0x28 bytes, as expected. What about the pre-built:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ ctfdump -t /usr/kernel/drv/amd64/fuse | grep -A 8 &#39;struct vnode (&#39; </span><span> &lt;229&gt; struct vnode (224 bytes) </span><span> v_lock type=103 off=0 </span><span> v_flag type=28 off=64 </span><span> v_count type=28 off=96 </span><span> v_phantom_count type=28 off=128 </span><span> v_data type=37 off=192 </span><span> v_vfsp type=830 off=256 </span><span> v_stream type=831 off=320 </span><span> v_type type=255 off=384 </span></code></pre> <p>Would you look at that <code>v_type</code> shows an offset of 384 bits = 48 / 0x30 bytes. The even more suspicious line is this field that's not present in our local version: <code>v_phantom_count</code> (aptly named in this instance).</p> <p>So uh, what gives? The <a href="https://github.com/illumos/illumos-gate/blob/1a613b61205f4ee9a9fb00184dbe6cae17a6ede7/usr/src/uts/common/sys/vnode.h#L284-L292">upstream</a> header (which we're more or less using in Helios) certainly doesn't contain it. A little searching leads us to <a href="https://github.com/TritonDataCenter/illumos-joyent/pull/305">this PR</a> adding it to SmartOS's (another distro) illumos fork. But what's probably more relevant in this case is that it also exists in the <a href="https://github.com/omniosorg/illumos-omnios/blob/f8bf0ba10cd0088767e6da200297cfe385ae0ac3/usr/src/uts/common/sys/vnode.h#L291">OmniOS fork</a>.</p> <p>A couple messages later and that more-or-less explains it: this package meant for Helios was accidentally built on an OmniOS box, which has a slightly different definition of some kernel structure.</p> <p><img src="https://luqman.ca/blog/multi-kernel-drifting/images/multi-track-drifting-meme.jpg" alt="Mutli-Track Drifting Meme: Train labeled &quot;FUSE&quot; with front wheels on track labeled &quot;OmniOS Kernel&quot; and back labeled &quot;Helios Kernel&quot;. Bottom panel is close-up of manga character's eyes looking surprised/intense with action bubble to the left, &quot;MULTI-KERNEL DRIFTING!!!&quot;" /></p> Windows NVMe Blues 2022-05-01T15:00:00+00:00 2022-07-07T15:00:00+00:00 https://luqman.ca/blog/windows-nvme-blues/ <p>Emboldened with the newly landed <a href="https://github.com/oxidecomputer/propolis/pull/113">VNC support</a>, we decided to give booting Windows in Propolis a go. Unfortunately, it didn't quite work right away.</p> <span id="continue-reading"></span><h2 id="provisioning">Provisioning</h2> <h3 id="creating-a-windows-vm">Creating a Windows VM</h3> <p>The VNC server built-in to Propolis is currently only one-way: it will show you the guest framebuffer but doesn't relay any input from a client (e.g. mouse or keyboard) back to the guest. This poses a challenge when it comes to installing Windows since we kinda need to be able to poke at the installer. But we can make our way around that easily enough by trying to boot an existing image instead. To create such an image is simple enough with QEMU:</p> <pre data-lang="bash" style="background-color:#151515;color:#e8e8d3;" class="language-bash "><code class="language-bash" data-lang="bash"><span style="color:#888888;"># Create an empty disk image to install to </span><span style="color:#ffb964;">truncate -s</span><span> 40GiB $</span><span style="color:#ffb964;">WIN_IMAGE </span><span> </span><span style="color:#ffb964;">QEMU_ARGS</span><span>=( </span><span> -nodefaults </span><span> -name guest=wintest,debug-threads=on </span><span> -enable-kvm </span><span> -M pc </span><span> -m 2048 </span><span> -cpu host </span><span> -smp 4,sockets=1,cores=4 </span><span> -rtc base=localtime </span><span> </span><span style="color:#888888;"># OVMF UEFI firmware (See Propolis README) </span><span> -drive if=pflash,format=raw,readonly=on,file=$</span><span style="color:#ffb964;">OVMF_CODE </span><span> </span><span style="color:#888888;"># Boot Drive backed by $WIN_IMAGE </span><span> -device nvme,drive=drivec,serial=deadbeef </span><span> -drive if=none,id=drivec,file=$</span><span style="color:#ffb964;">WIN_IMAGE</span><span>,format=raw </span><span> </span><span style="color:#888888;"># Virtio-based NIC </span><span> -netdev tap,ifname=wintestnet,id=net0,script=no,downscript=no </span><span> -device virtio-net-pci,netdev=net0 </span><span> </span><span style="color:#888888;"># RAMFB Display Device </span><span> -device ramfb </span><span> </span><span style="color:#888888;"># Windows Installer ISO (not needed after install) </span><span> -device ide-cd,drive=win-disk,id=cd-disk1,unit=0,bus=ide.0 </span><span> -drive file=$</span><span style="color:#ffb964;">WIN_ISO</span><span>,if=none,id=win-disk,media=cdrom </span><span> </span><span style="color:#888888;"># Virtio Drivers ISO - For Virtio NIC support (not needed after install) </span><span> -device ide-cd,drive=virtio-disk,id=cd-disk2,unit=0,bus=ide.1 </span><span> -drive file=$</span><span style="color:#ffb964;">VIRTIO_ISO</span><span>,if=none,id=virtio-disk,media=cdrom </span><span>) </span><span style="color:#ffb964;">qemu-system-x86_64 </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">${</span><span style="color:#ffb964;">QEMU_ARGS[@]</span><span style="color:#99ad6a;">}</span><span style="color:#556633;">&quot; </span></code></pre> <p>With this we can try to create a VM in QEMU while mostly matching the same virtual hardware as supported by Propolis (e.g., i440FX chipset, NVMe boot drive, VirtIO NIC). A little bit later and we're greeted with a newly installed Windows system.</p> <h3 id="serial-console">Serial Console</h3> <p>Windows has a serial console (aka <code>Emergency Management Services</code>) which can print out early boot errors. Before we shutdown and try booting it in Propolis, let's enable the serial console:</p> <pre data-lang="powershell" style="background-color:#151515;color:#e8e8d3;" class="language-powershell "><code class="language-powershell" data-lang="powershell"><span style="color:#888888;"># Admin prompt </span><span>PS &gt; bcdedit /ems on </span><span>PS &gt; bcdedit /emssettings EMSPORT:</span><span style="color:#cf6a4c;">1</span><span> EMSBAUDRATE:</span><span style="color:#cf6a4c;">115200 </span></code></pre> <p>By default, Windows Server SKUs also include the <code>Special Administration Console</code> (<code>SAC</code>) via <code>EMS</code> which offers an interactive session as well as letting you drop into a <code>CMD</code> prompt over serial. But, fret not if you don't have a Windows Server image handy, we can easily enable it for Desktop SKUs too:</p> <pre data-lang="powershell" style="background-color:#151515;color:#e8e8d3;" class="language-powershell "><code class="language-powershell" data-lang="powershell"><span style="color:#888888;"># Admin prompt </span><span>PS &gt; Add-WindowsCapability -Online -Name Windows.Desktop.EMS-SAC.Tools~~~~</span><span style="color:#cf6a4c;">0.0</span><span>.</span><span style="color:#cf6a4c;">1.0 </span></code></pre> <p>We need to reboot to complete the installation but that'll give us an opportunity to see the console. To do that we'll need to slightly modify the QEMU command line by adding an extra flag: <code>-serial stdio</code>.</p> <p>With that, on next boot we should see something like:</p> <pre style="background-color:#151515;color:#e8e8d3;"><code><span>BdsDxe: loading Boot0007 &quot;Windows Boot Manager&quot; from HD(1,GPT,78455C93-77D8-4B62-8D6E-588FB0E91060,0x800,0x32000)/\EFI\Microsoft\Boot\bootmgfw.efi </span><span>BdsDxe: starting Boot0007 &quot;Windows Boot Manager&quot; from HD(1,GPT,78455C93-77D8-4B62-8D6E-588FB0E91060,0x800,0x32000)/\EFI\Microsoft\Boot\bootmgfw.efi </span><span> </span><span>&lt;?xml version=&quot;1.0&quot;?&gt; </span><span>&lt;machine-info&gt; </span><span>&lt;name&gt;WINTEST&lt;/name&gt; </span><span>&lt;guid&gt;00000000-0000-0000-0000-000000000000&lt;/guid&gt; </span><span>&lt;processor-architecture&gt;AMD64&lt;/processor-architecture&gt; </span><span>&lt;os-version&gt;10.0&lt;/os-version&gt; </span><span>&lt;os-build-number&gt;19041&lt;/os-build-number&gt; </span><span>&lt;os-product&gt;Windows 10&lt;/os-product&gt; </span><span>&lt;os-service-pack&gt;None&lt;/os-service-pack&gt; </span><span>&lt;/machine-info&gt; </span><span> </span><span>Computer is booting, SAC started and initialized. </span><span> </span><span>Use the &quot;ch -?&quot; command for information about using channels. </span><span>Use the &quot;?&quot; command for general help. </span><span> </span><span> </span><span>SAC&gt; </span><span>EVENT: The CMD command is now available. </span><span>SAC&gt; </span></code></pre> <h3 id="remote-access">Remote Access</h3> <p>For some more fun, let's also enable RDP and SSH which will give us more options to interact with the VM:</p> <pre data-lang="powershell" style="background-color:#151515;color:#e8e8d3;" class="language-powershell "><code class="language-powershell" data-lang="powershell"><span style="color:#888888;"># Admin prompt </span><span> </span><span style="color:#888888;"># Enable RDP </span><span>PS &gt; Set-ItemProperty </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">HKLM:\SYSTEM\CurrentControlSet\Control\Terminal Server\</span><span style="color:#556633;">&quot; </span><span>-Name </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">fDenyTSConnections</span><span style="color:#556633;">&quot; </span><span>-Value </span><span style="color:#cf6a4c;">0 </span><span> </span><span style="color:#888888;"># Install SSH Server and make it autostart </span><span>PS &gt; Add-WindowsCapability -Online -Name OpenSSH.Server~~~~</span><span style="color:#cf6a4c;">0.0</span><span>.</span><span style="color:#cf6a4c;">1.0 </span><span>PS &gt; Start-Service sshd </span><span>PS &gt; Set-Service -Name sshd -StartupType Automatic </span></code></pre> <p>(As an aside, so glad Windows includes SSH nowadays. The <code>ssh.exe</code> client is enabled by default too!)</p> <h2 id="booting-atop-propolis">Booting atop Propolis</h2> <p>Ok, so we've created our Windows VM image, setup the serial console and even enabled SSH &amp; RDP. But all that is for naught if we don't tell Propolis about it.</p> <h3 id="propolis-setup">Propolis Setup</h3> <p>Since the VNC support in Propolis is only exposed in the <code>propolis-server</code> frontend, that's what we'll use. First off, we'll just create the TOML describing the VM. We hardcode all the devices we want rather than creating them via REST API calls.</p> <p>After substituting the paths to the OVMF blob and the windows image we created, we're ready to rumble:</p> <pre data-lang="toml" style="background-color:#151515;color:#e8e8d3;" class="language-toml "><code class="language-toml" data-lang="toml"><span style="color:#888888;"># windows.toml </span><span> </span><span style="color:#ffb964;">bootrom </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">/path/to/$OVMF_CODE</span><span style="color:#556633;">&quot; </span><span> </span><span>[</span><span style="color:#ffb964;">block_dev</span><span>.</span><span style="color:#ffb964;">c_drive</span><span>] </span><span style="color:#ffb964;">type </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">file</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">path </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">$WIN_IMAGE</span><span style="color:#556633;">&quot; </span><span> </span><span>[</span><span style="color:#ffb964;">dev</span><span>.</span><span style="color:#ffb964;">block0</span><span>] </span><span style="color:#ffb964;">driver </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">pci-nvme</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">block_dev </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">c_drive</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">pci-path </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">0.5.0</span><span style="color:#556633;">&quot; </span><span> </span><span>[</span><span style="color:#ffb964;">dev</span><span>.</span><span style="color:#ffb964;">net0</span><span>] </span><span style="color:#ffb964;">driver </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">pci-virtio-viona</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">vnic </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">vnic0</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">pci-path </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">0.6.0</span><span style="color:#556633;">&quot; </span></code></pre> <p>Let's also create the host-side of the vNIC the VM will bind to:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ dladm create-vnic -t -l $(dladm show-phys -p -o LINK) vnic0 </span></code></pre> <p>This creates a new virtual NIC atop the existing physical link so the guest will appear as just another device on the LAN. This also means it should acquire an IP address via DHCP if that's setup for the network.</p> <h3 id="setup">Setup</h3> <p>To get everything up and running, we're going to need a couple of terminals.</p> <p>First, we need to start <code>propolis-server</code> which exposes a REST API for creating and managing VMs:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ sudo cargo run --bin propolis-server -- run windows.toml 0.0.0.0:12400 </span></code></pre> <p>(We need <code>sudo</code>/<code>pfexec</code> because illumos <a href="https://www.illumos.org/issues/12714">does not have</a> a privilege to allow access to hypervisor resources currently.)</p> <p>Leaving propolis running, we now need to tell it to actually create the VM instance via the REST API which we can do with our command line tool:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span># Create a 4 core VM w/ 2GiB of RAM </span><span>$ cargo run --bin propolis-cli -- -s 127.0.0.1 new -c 4 -m 2048 wintest </span></code></pre> <p>Looking back at the <code>propolis-server</code> terminal, we should see it having successfully created the VM. But it is currently in a stopped state so before we tell it to run, let's attach our serial console and VNC client.</p> <p>The serial console is served by <code>propolis-server</code> over a WebSocket but the <code>propolis-cli</code> tool makes it simple to access and interact with:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ cargo run --bin propolis-cli -- -s 127.0.0.1 serial </span></code></pre> <p>(<strong>Note</strong>: <kbd>Ctrl</kbd>-<kbd>C</kbd> will exit the serial console. To pass it along to the guest instead, we first prefix it with <kbd>Ctrl</kbd>-<kbd>A</kbd>.)</p> <p>We can use a VNC client like (<code>noVNC</code> or <code>vncviewer</code>) by pointing it to the VNC server which is listening on port <code>5900</code>.</p> <h3 id="running">Running</h3> <p>With that, all our ducks have been lined up. Let's tell Propolis to run the VM!</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ cargo run --bin propolis-cli -- -s 127.0.0.1 state run </span></code></pre> <p>Things start off kinda well! We're greeted with the serial console actually outputting stuff:</p> <pre style="background-color:#151515;color:#e8e8d3;"><code><span>BdsDxe: loading Boot0007 &quot;Windows Boot Manager&quot; from HD(1,GPT,78455C93-77D8-4B62-8D6E-588FB0E91060,0x800,0x32000)/\EFI\Microsoft\Boot\bootmgfw.efi </span><span>BdsDxe: starting Boot0007 &quot;Windows Boot Manager&quot; from HD(1,GPT,78455C93-77D8-4B62-8D6E-588FB0E91060,0x800,0x32000)/\EFI\Microsoft\Boot\bootmgfw.efi </span><span>&lt;?xml version=&quot;1.0&quot;?&gt; </span><span>&lt;machine-info&gt; </span><span>&lt;name&gt;WINTEST&lt;/name&gt; </span><span>&lt;guid&gt;00000000-0000-0000-0000-000000000000&lt;/guid&gt; </span><span>&lt;processor-architecture&gt;AMD64&lt;/processor-architecture&gt; </span><span>&lt;os-version&gt;10.0&lt;/os-version&gt; </span><span>&lt;os-build-number&gt;19041&lt;/os-build-number&gt; </span><span>&lt;os-product&gt;Windows 10&lt;/os-product&gt; </span><span>&lt;os-service-pack&gt;None&lt;/os-service-pack&gt; </span><span>&lt;/machine-info&gt; </span><span>Computer is booting, SAC started and initialized. </span><span> </span><span>Use the &quot;ch -?&quot; command for information about using channels. </span><span>Use the &quot;?&quot; command for general help. </span><span> </span><span>SAC&gt; </span></code></pre> <p>We even sorta see the little windows boot animation over VNC:</p> <p><img src="https://luqman.ca/blog/windows-nvme-blues/images/boot-spinning.png" alt="A screenshot of Windows booting in Propolis over VNC, albeit a bit garbled" /></p> <p>But soon enough we run into a “bluescreen”:</p> <pre style="background-color:#151515;color:#e8e8d3;"><code><span>&lt;?xml&gt;&lt;BP&gt; </span><span>&lt;INSTANCE CLASSNAME=&quot;BLUESCREEN&quot;&gt; </span><span>&lt;PROPERTY NAME=&quot;STOPCODE&quot; TYPE=&quot;string&quot;&gt;&lt;VALUE&gt;&quot;0x7B&quot;&lt;/VALUE&gt;&lt;/PROPERTY&gt;&lt;machine-info&gt; </span><span>&lt;name&gt;WINTEST&lt;/name&gt; </span><span>&lt;guid&gt;00000000-0000-0000-0000-000000000000&lt;/guid&gt; </span><span>&lt;processor-architecture&gt;AMD64&lt;/processor-architecture&gt; </span><span>&lt;os-version&gt;10.0&lt;/os-version&gt; </span><span>&lt;os-build-number&gt;19041&lt;/os-build-number&gt; </span><span>&lt;os-product&gt;Windows 10&lt;/os-product&gt; </span><span>&lt;os-service-pack&gt;None&lt;/os-service-pack&gt; </span><span>&lt;/machine-info&gt; </span><span>&lt;/INSTANCE&gt; </span><span>&lt;/BP&gt; </span><span> </span><span>!SAC&gt; </span><span>Your device ran into a problem and needs to restart. </span><span>If you call a support person, give them this info: </span><span>INACCESSIBLE_BOOT_DEVICE </span><span> </span><span> </span><span>0xFFFFCE0B72206868 </span><span>0xFFFFFFFFC0000034 </span><span>0x0000000000000000 </span><span>0x0000000000000001 </span><span> </span><span>!SAC&gt;? </span><span>d Display all log entries, paging is on. </span><span>help Display this list. </span><span>restart Restart the system immediately. </span><span>? Display this list. </span><span> </span><span>!SAC&gt;d </span><span>20:06:49.065 : KRNL: Loading \Driver\Wdf01000. </span><span>20:06:49.065 : KRNL: Load succeeded. </span><span>20:06:49.065 : KRNL: Loading \Driver\acpiex </span><span>20:06:49.065 : KRNL: Load succeeded. </span><span>[...snip...] </span><span>20:07:23.252 : KRNL: Loading \Driver\hwpolicy </span><span>20:07:23.252 : KRNL: Load failed. </span><span>20:07:23.252 : KRNL: Loading \Driver\disk </span><span>20:07:23.252 : KRNL: Load succeeded. </span><span>20:07:53.846 : KRNL: Failed marking boot partition. </span></code></pre> <figure class="center" > <img src="images/OSOD.png" alt="Windows displaying the bugcheck screen known as the Bluescreen. But due to VNC bugs it shows up as orange." /> <figcaption class="center">(Or perhaps, an orangescreen? 🤨)</figcaption> </figure> <p>Well, what now? For whatever reason, Windows is unable to access the boot disk. Definitely makes you go &quot;wait a second&quot; as it had to have recognized the disk enough to find the boot loader and make it this far. But at least this tells us we're failing in Windows proper and not much earlier like the boot manager or OS loader.</p> <p><strong>EDIT</strong>: We see an orange screen instead of blue because the VNC client and server do not agree on the pixel format. The client side thinks the framebuffer is encoded as <code>XBGR8888</code> whereas the server side is actually sending pixel values encoded as <code>XRGB8888</code>i.e., the red and blue components are getting swapped.</p> <p><strong>EDIT (July 7, 2022)</strong>: The Orange Screen of Death is no more! Thanks to some work from my colleague to better handle multiple pixel formats [<a href="https://github.com/oxidecomputer/propolis/pull/151">PR</a>].</p> <h2 id="debugging">Debugging</h2> <h3 id="bluescreen">Bluescreen</h3> <p>So what are bluescreens anyways? That is the screen displayed by Windows when it encounters some sort of unrecoverable error. Such a crash, aka bug check, can occur due to hardware or software issues.</p> <p>Notwithstanding any bugs in rendering, the bugcheck screen is not always blue! If you encounter a bugcheck, for example, on a Windows Insider build, you'll be greeted by a Green Screen of Death!</p> <p><del>If you wanna have some fun, you can change it by modifying <code>%SystemRoot%\SYSTEM.INI</code>.</del></p> <p><strong>EDIT</strong>: Ok, that's not really true anymore for modern versions of Windows. But of course Raymond Chen's blog <em>The Old New Thing</em> has a <a href="https://devblogs.microsoft.com/oldnewthing/20220201-00/?p=106209">post</a> on exactly this.</p> <h3 id="bug-check-code-7b">Bug Check Code 7B</h3> <p>Back to our crash at hand. Both the serial console and VNC give us a stop code of <code>INACCESSIBLE_BOOT_DEVICE</code>. Microsoft does provide a list of bug checks and possible causes and resolution. We can find the relevant one <a href="https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x7b--inaccessible-boot-device">here</a>.</p> <p>Well the possible causes sure are vague. Clearly, something changed about our virtual hardware platform between Propolis and QEMU that Windows is not happy with. Unfortunately, there isn't much else to go on.</p> <p>Every bug check may also include some parameters with more details about the crash. That is the four hex values that were printed to the serial console. The Microsoft docs though only tell us what parameter 1 may be:</p> <blockquote> <p>The address of a <code>UNICODE_STRING</code> structure, or the address of the device object that could not be mounted.</p> </blockquote> <p>To make sense of that we'll have to break out the kernel debugger!</p> <h3 id="debug-setup">Debug Setup</h3> <p>The basic setup here is connecting the target Windows machine (Propolis VM) to another Windows machine (the debugger) in some way supported by the Windows Kernel Debugger. For that, there are a couple of options:</p> <ol> <li>kdnet — the recommended (and fastest option) but you need a NIC that supports it (VirtIO does not).</li> <li>kdusb — requires special USB debug cables—not to mention the complete lack of USB support in Propolis.</li> <li>kdcom — over a serial port. A classic.</li> </ol> <p>Propolis basically constrains us to option (3): debugging over serial. To do that, we're going to switch from <code>propolis-server</code> to <code>propolis-standalone</code> because it conveniently pipes the first serial port to a unix domain socket. We can connect that to a different Windows machine that will act as the debugger.</p> <h4 id="debugger">Debugger</h4> <p>Since our serial is virtual anyways and I also don't have a bare metal machine with a serial port, the debugger will just be another VM in QEMU. The setup is similar to above except we make sure to include the argument <code>-serial tcp::9999,server,nowait</code>. This tells QEMU to listen on port <code>9999</code> (arbitrary) and once a connection is made, to proxy the virtual serial port over a TCP socket.</p> <h5 id="windbg">WinDbg</h5> <p>Inside the debugger VM, we need to install the debugging components needed. I've found <a href="https://apps.microsoft.com/store/detail/windbg-preview/9PGJGD53TN86?hl=en-us&amp;gl=US">WinDbg Preview</a> to work well enough for me but <a href="https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools">Classic WinDbg</a> still lives on (they both use the same underlying debugger engine).</p> <p>Once installed, we start debugging by choosing &quot;Attach to kernel&quot; using <code>COM1</code> as the port and a baud rate of <code>115200</code>. Also enable &quot;Break on connection&quot;/&quot;Initial Break&quot;. With that, the debugger side is ready. You can leave it running throughout and restart the debuggee as needed.</p> <h4 id="debuggee">Debuggee</h4> <p>Our Propolis VM will be the debugging target but we first need to pop back over to QEMU so we can tell Windows to connect to the debugger on boot.</p> <p>Since we're going to be using the serial port for debugging, let's disable the serial console:</p> <pre data-lang="powershell" style="background-color:#151515;color:#e8e8d3;" class="language-powershell "><code class="language-powershell" data-lang="powershell"><span style="color:#888888;"># Admin prompt </span><span>PS &gt; bcdedit /ems off </span></code></pre> <p>Then we can enable debugging over the serial port instead, matching the baudrate we chose on the debugger side:</p> <pre data-lang="powershell" style="background-color:#151515;color:#e8e8d3;" class="language-powershell "><code class="language-powershell" data-lang="powershell"><span style="color:#888888;"># Admin prompt </span><span>PS &gt; bcdedit /debug on </span><span>PS &gt; bcdedit /dbgsettings serial debugport:</span><span style="color:#cf6a4c;">1</span><span> baudrate:</span><span style="color:#cf6a4c;">115200 </span></code></pre> <p>Before trying it out in Propolis, we can make sure we setup it up right with QEMU. Just replace any <code>-serial</code> argument with <code>-serial tcp:host-or-ip-of-debugger-vm:9999</code>. This will have QEMU initiate a TCP connection to the debugger VM and proxy its serial port over it. In this way, we should have both VMs hooked up together and should see signs of life in WinDbg shortly afterwards:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>Microsoft (R) Windows Debugger Version 10.0.22549.1000 AMD64 </span><span>Copyright (c) Microsoft Corporation. All rights reserved. </span><span> </span><span>Opened \\.\com1 </span><span>Waiting to reconnect... </span><span>Connected to Windows 10 19041 x64 target at (Sun May 1 07:32:44.449 2022 (UTC - 7:00)), ptr64 TRUE </span><span>Kernel Debugger connection established. (Initial Breakpoint requested) </span><span> </span><span>************* Path validation summary ************** </span><span>Response Time (ms) Location </span><span>Deferred srv* </span><span>Symbol search path is: srv* </span><span>Executable search path is: </span><span>Windows 10 Kernel Version 19041 MP (1 procs) Free x64 </span><span>Edition build lab: 19041.1.amd64fre.vb_release.191206-1406 </span><span>Machine Name: </span><span>Kernel base = 0xfffff800`14601000 PsLoadedModuleList = 0xfffff800`1522b3b0 </span><span>System Uptime: 0 days 0:00:00.000 </span><span>nt!DebugService2+0x5: </span><span>fffff800`149fe105 cc int 3 </span></code></pre> <h4 id="debugging-propolis-vm">Debugging Propolis VM</h4> <p>Ok, our Windows image is setup for kernel debugging, now to try launching it under Propolis. As mentioned, we're going to switch over to <code>propolis-standalone</code> so we can get at the serial port via a unix socket. For that, we're gonna need a slightly different <code>windows.toml</code>:</p> <pre data-lang="toml" style="background-color:#151515;color:#e8e8d3;" class="language-toml "><code class="language-toml" data-lang="toml"><span>[</span><span style="color:#ffb964;">main</span><span>] </span><span style="color:#ffb964;">name </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">wintest</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">cpus </span><span>= </span><span style="color:#cf6a4c;">4 </span><span style="color:#ffb964;">memory </span><span>= </span><span style="color:#cf6a4c;">2048 </span><span style="color:#ffb964;">bootrom </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">/path/to/$OVMF_CODE</span><span style="color:#556633;">&quot; </span><span> </span><span>[</span><span style="color:#ffb964;">block_dev</span><span>.</span><span style="color:#ffb964;">c_drive</span><span>] </span><span style="color:#ffb964;">type </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">file</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">path </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">$WIN_IMAGE</span><span style="color:#556633;">&quot; </span><span> </span><span>[</span><span style="color:#ffb964;">dev</span><span>.</span><span style="color:#ffb964;">block0</span><span>] </span><span style="color:#ffb964;">driver </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">pci-nvme</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">block_dev </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">c_drive</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">pci-path </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">0.5.0</span><span style="color:#556633;">&quot; </span><span> </span><span>[</span><span style="color:#ffb964;">dev</span><span>.</span><span style="color:#ffb964;">net0</span><span>] </span><span style="color:#ffb964;">driver </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">pci-virtio-viona</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">vnic </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">vnic0</span><span style="color:#556633;">&quot; </span><span style="color:#ffb964;">pci-path </span><span>= </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">0.6.0</span><span style="color:#556633;">&quot; </span></code></pre> <p>This is basically the same as our earlier TOML with the addition of the first 4 lines where we've hardcoded the name, vCPU count and memory instead of providing it via a REST API. With that we can run Propolis and we should see it paused waiting for a connection to the serial port (a unix socket created in the same directory named <code>ttya</code>):</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ sudo cargo run --release --bin propolis-standalone -- windows.toml </span><span>May 01 14:40:12.968 INFO VM created, name: wintest </span><span>- 1: lpc-bhyve-atpic </span><span>- 2: lpc-bhyve-atpit </span><span>- 3: lpc-bhyve-hpet </span><span>- 4: lpc-bhyve-ioapic </span><span>- 5: lpc-bhyve-rtc </span><span>- 6: chipset-i440fx </span><span> - 7: pci-piix4-hb </span><span> - 8: pci-piix3-lpc </span><span> - 9: pci-piix3-pm </span><span> - 10: lpc-bhyve-pmtimer </span><span>- 11: lpc-uart-com1 </span><span>- 12: lpc-uart-com2 </span><span>- 13: lpc-uart-com3 </span><span>- 14: lpc-uart-com4 </span><span>- 15: lpc-ps2ctrl </span><span>- 16: qemu-lpc-debug </span><span>- 17: pci-nvme-0.5.0 </span><span> - 18: block-file-/home/luqman/VMs/IMGs/windows.img </span><span>- 19: pci-virtio-viona-0.6.0 </span><span>- 20: qemu-fwcfg </span><span>- 21: qemu-ramfb </span><span>May 01 14:40:12.997 ERRO Waiting for a connection to ttya </span></code></pre> <p>We'll use <code>socat</code> to proxy the unix socket attached to Propolis' serial port to our debugger VM:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ sudo socat UNIX-CONNECT:./ttya TCP-CONNECT:host-or-ip-of-debugger-vm:9999 </span></code></pre> <p><code>sudo</code> here is needed since <code>ttya</code> was created by <code>propolis-standalone</code> which we also ran with <code>sudo</code>.</p> <p>(Ideally, it'd be nice to have propolis just learn how to make a direct connection rather than needing <code>socat</code>. But also, I love <code>socat</code>! It's one of my favourite tools.)</p> <p>With that, we should be in business and kernel debugging a Windows guest running in Propolis!</p> <h2 id="investigating">Investigating</h2> <p>We now have a kernel debugger, but where to begin? Well, why not see if we still hit the same bug check when running under the debugger. The debugger should be waiting for input so just type <code>g</code> (Go) to let the target continue running. We did request an initial break so it may stop again early in kernel initialization, so just hit <code>g</code> again.</p> <p>It is at this point we remember how slow kernel debugging makes things, not to mention over emulated serial and a VM at that.</p> <h3 id="bug-check">Bug Check</h3> <p>Huzzah! We hit the same error (believe me, it'd be worse if worked under the debugger!)</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; g </span><span>IOINIT: Built-in driver \Driver\sacdrv failed to initialize with status - 0xC0000037 </span><span>We are running at normal mode. </span><span>KDTARGET: Refreshing KD connection </span><span> </span><span>*** Fatal System Error: 0x0000007b </span><span> (0xFFFFF68C42606868,0xFFFFFFFFC0000034,0x0000000000000000,0x0000000000000001) </span><span> </span><span>Break instruction exception - code 80000003 (first chance) </span><span> </span><span>A fatal system error has occurred. </span><span>Debugger entered on first try; Bugcheck callbacks have not been invoked. </span><span> </span><span>A fatal system error has occurred. </span><span> </span><span>For analysis of this file, run !analyze -v </span><span>nt!DbgBreakPointWithStatus: </span><span>fffff803`27a040b0 cc int 3 </span></code></pre> <p>The first line about <code>sacdrv</code> failing to initialize is innocuous. That's just because we disabled the EMS console but left the SAC components installed.</p> <p>Other than that we've got a lot of the same info we saw before, but it does give us a command to run: <code>!analyze -v</code>. (WinDbg Preview even helpfully lets you just click on it)</p> <details> <summary><code>!analyze -v</code></summary> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; !analyze -v </span><span>Connected to Windows 10 19041 x64 target at (Sun May 1 07:59:23.729 2022 (UTC - 7:00)), ptr64 TRUE </span><span>Loading Kernel Symbols </span><span>......... </span><span> </span><span>Press ctrl-c (cdb, kd, ntsd) or ctrl-break (windbg) to abort symbol loads that take too long. </span><span>Run !sym noisy before .reload to track down problems loading symbols. </span><span> </span><span>...................................................... </span><span>................................................................ </span><span>.... </span><span>Loading User Symbols </span><span> </span><span>Loading unloaded module list </span><span>... </span><span>******************************************************************************* </span><span>* * </span><span>* Bugcheck Analysis * </span><span>* * </span><span>******************************************************************************* </span><span> </span><span>INACCESSIBLE_BOOT_DEVICE (7b) </span><span>During the initialization of the I/O system, it is possible that the driver </span><span>for the boot device failed to initialize the device that the system is </span><span>attempting to boot from, or it is possible for the file system that is </span><span>supposed to read that device to either fail its initialization or to simply </span><span>not recognize the data on the boot device as a file system structure that </span><span>it recognizes. In the former case, the argument (#1) is the address of a </span><span>Unicode string data structure that is the ARC name of the device from which </span><span>the boot was being attempted. In the latter case, the argument (#1) is the </span><span>address of the device object that could not be mounted. </span><span>If this is the initial setup of the system, then this error can occur if </span><span>the system was installed on an unsupported disk or SCSI controller. Note </span><span>that some controllers are supported only by drivers which are in the Windows </span><span>Driver Library (WDL) which requires the user to do a custom install. See </span><span>the Windows Driver Library for more information. </span><span>This error can also be caused by the installation of a new SCSI adapter or </span><span>disk controller or repartitioning the disk with the system partition. If </span><span>this is the case, on x86 systems the boot.ini file must be edited or on ARC </span><span>systems setup must be run. See the &quot;Advanced Server System Administrator&#39;s </span><span>User Guide&quot; for information on changing boot.ini. </span><span>If the argument is a pointer to an ARC name string, then the format of the </span><span>first two (and in this case only) longwords will be: </span><span> USHORT Length; </span><span> USHORT MaximumLength; </span><span> PWSTR Buffer; </span><span>That is, the first longword will contain something like 00800020 where 20 </span><span>is the actual length of the Unicode string, and the next longword will </span><span>contain the address of buffer. This address will be in system space, so </span><span>the high order bit will be set. </span><span>If the argument is a pointer to a device object, then the format of the first </span><span>word will be: </span><span> USHORT Type; </span><span>That is, the first word will contain a 0003, where the Type code will ALWAYS </span><span>be 0003. </span><span>Note that this makes it immediately obvious whether the argument is a pointer </span><span>to an ARC name string or a device object, since a Unicode string can never </span><span>have an odd number of bytes, and a device object will always have a Type </span><span>code of 3. </span><span>Arguments: </span><span>Arg1: fffff68c42606868, Pointer to the device object or Unicode string of ARC name </span><span>Arg2: ffffffffc0000034, (reserved) </span><span>Arg3: 0000000000000000, (reserved) </span><span>Arg4: 0000000000000001, (reserved) </span><span> </span><span>Debugging Details: </span><span>------------------ </span><span> </span><span>KEY_VALUES_STRING: 1 </span><span> Key : Analysis.CPU.mSec </span><span> Value: 1937 </span><span> Key : Analysis.DebugAnalysisManager </span><span> Value: Create </span><span> Key : Analysis.Elapsed.mSec </span><span> Value: 6825 </span><span> Key : Analysis.Init.CPU.mSec </span><span> Value: 5905 </span><span> Key : Analysis.Init.Elapsed.mSec </span><span> Value: 1713912 </span><span> Key : Analysis.Memory.CommitPeak.Mb </span><span> Value: 110 </span><span> Key : WER.OS.Branch </span><span> Value: vb_release </span><span> Key : WER.OS.Timestamp </span><span> Value: 2019-12-06T14:06:00Z </span><span> Key : WER.OS.Version </span><span> Value: 10.0.19041.1 </span><span> </span><span>BUGCHECK_CODE: 7b </span><span>BUGCHECK_P1: fffff68c42606868 </span><span>BUGCHECK_P2: ffffffffc0000034 </span><span>BUGCHECK_P3: 0 </span><span>BUGCHECK_P4: 1 </span><span> </span><span>PROCESS_NAME: System </span><span> </span><span>STACK_TEXT: </span><span>fffff68c`42606078 fffff803`27b18882 : fffff68c`426061e0 fffff803`27983940 00000000`00000000 00000000`00000000 : nt!DbgBreakPointWithStatus </span><span>fffff68c`42606080 fffff803`27b17e66 : 00000000`00000003 fffff68c`426061e0 fffff803`27a110c0 00000000`0000007b : nt!KiBugCheckDebugBreak+0x12 </span><span>fffff68c`426060e0 fffff803`279fc317 : fffff803`261dccd0 fffff803`27beef3e ffffffff`c0000034 00000000`000000c8 : nt!KeBugCheck2+0x946 </span><span>fffff68c`426067f0 fffff803`27aabe0e : 00000000`0000007b fffff68c`42606868 ffffffff`c0000034 00000000`00000000 : nt!KeBugCheckEx+0x107 </span><span>fffff68c`42606830 fffff803`2805b69d : ffffe08b`e44a09c0 fffff803`261dccd0 ffffffff`8000036c 00000000`00000001 : nt!PnpBootDeviceWait+0xf1eca </span><span>fffff68c`426068c0 fffff803`28042c20 : fffff803`00000000 fffff803`2824c700 00000000`00000006 fffff803`261dccd0 : nt!IopInitializeBootDrivers+0x511 </span><span>fffff68c`42606a70 fffff803`28067abd : fffff803`2ba6cfc0 fffff803`261dccd0 fffff803`27d9b6a0 fffff803`261dcc00 : nt!IoInitSystemPreDrivers+0xb24 </span><span>fffff68c`42606bb0 fffff803`27d9b6db : fffff803`261dccd0 fffff803`2824e068 fffff803`27d9b6a0 fffff803`261dccd0 : nt!IoInitSystem+0x15 </span><span>fffff68c`42606be0 fffff803`278a99a5 : ffffca03`daca95c0 fffff803`27d9b6a0 fffff803`261dccd0 00000000`00000000 : nt!Phase1Initialization+0x3b </span><span>fffff68c`42606c10 fffff803`27a03868 : fffff803`2654f180 ffffca03`daca95c0 fffff803`278a9950 00000000`00000000 : nt!PspSystemThreadStartup+0x55 </span><span>fffff68c`42606c60 00000000`00000000 : fffff68c`42607000 fffff68c`42601000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x28 </span><span> </span><span>SYMBOL_NAME: nt!PnpBootDeviceWait+f1eca </span><span>MODULE_NAME: nt </span><span>IMAGE_NAME: ntkrnlmp.exe </span><span>IMAGE_VERSION: 10.0.19041.630 </span><span>STACK_COMMAND: .cxr; .ecxr ; kb </span><span>BUCKET_ID_FUNC_OFFSET: f1eca </span><span>FAILURE_BUCKET_ID: 0x7B_nt!PnpBootDeviceWait </span><span>OS_VERSION: 10.0.19041.1 </span><span>BUILDLAB_STR: vb_release </span><span>OSPLATFORM_TYPE: x64 </span><span>OSNAME: Windows 10 </span><span>FAILURE_ID_HASH: {135d3c47-59ae-2dc5-ff32-063555fd22bf} </span><span>Followup: MachineOwner </span><span>--------- </span></code></pre> </details> <p>Well, that dumped a whole lotta info but no smoking gun exactly 😅 But we press on.</p> <p>Let's check out parameter 1 of our bug check, which remember will either be a device object or some string. If it is a string, it won't be your standard nul-terminated string but rather something of the type <code>UNICODE_STRING</code> which is a length-delimited string. We can use the <code>dt</code> (Display Type) command to see what it looks like:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; dt _UNICODE_STRING </span><span>nt!_UNICODE_STRING </span><span> +0x000 Length : Uint2B </span><span> +0x002 MaximumLength : Uint2B </span><span> +0x008 Buffer : Ptr64 Wchar </span></code></pre> <p><strong>Note</strong>: We gave <code>dt</code> an underscore-prefixed type name here because most Windows types are declared such that the struct name has an underscore and there's a typedef without, e.g.:</p> <pre data-lang="C" style="background-color:#151515;color:#e8e8d3;" class="language-C "><code class="language-C" data-lang="C"><span style="color:#8fbfdc;">typedef struct</span><span> _FOO { </span><span style="color:#888888;">/* fields */ </span><span>} FOO, *</span><span style="color:#ffb964;">PFOO</span><span>; </span></code></pre> <p>Looking at parameter 1 we see what does appear to be a valid string!</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; dS fffff68c42606868 </span><span>ffffe08b`e45a1ae0 &quot;\ArcName\multi(0)disk(0)rdisk(0)&quot; </span><span>ffffe08b`e45a1b20 &quot;partition(3)&quot; </span></code></pre> <p>Ok, cool. But doesn't really help us. It's not like we need to identify which disk is failing. There's only but the one 😛 So what now?</p> <p><strong>EDIT</strong>: If you're curious what <code>ArcName</code> is or the details of this path format, the really old Microsoft Knowledge Base Article <a href="https://jeffpar.github.io/kbarchive/kb/102/Q102873/"><code>Q102873: BOOT.INI and ARC Path Naming Conventions and Usage</code></a> (very kindly archived by <a href="https://twitter.com/jeffpar">@jeffpar</a>) has a fantastic explanation.</p> <h3 id="tracing">Tracing</h3> <p>When confronted with a problem in a new area I like to get all the information I possibly can. 99% of it might be useless but I don't know enough to know that just yet. To that end, tracing can be really helpful to get the lay of the land.</p> <p>In our case here, one place to start is any kernel and driver traces we can find. Kernel drivers on Windows often use <a href="https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-dbgprint"><code>DbgPrint</code></a>. We're going to try to tease out as much of those as we can. But first, we let's restart the VM to go back before the bug check.</p> <p>Once we're back in the debugger at the beginning again we can begin messing around. First step is enabling the global default mask for debug prints using the <code>ed</code> (Enter Value [Double Word]) command:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; ed nt!Kd_DEFAULT_Mask 0xFFFFFFFF </span></code></pre> <p>But we can also enable specific components which we can find like so:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; x nt!Kd_*_Mask </span></code></pre> <p>This will print a big ol list of which we select some relevant looking ones to start with:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; ed nt!Kd_PNPMGR_Mask 0xFFFFFFFF </span><span>kd&gt; ed nt!Kd_PCI_Mask 0xFFFFFFFF </span><span>kd&gt; ed nt!Kd_STORMINIPORT_Mask 0xFFFFFFFF </span><span>kd&gt; ed nt!Kd_STORPORT_Mask 0xFFFFFFFF </span></code></pre> <p>Already, this provides some gems:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>Intel Storage Driver Ver: 8.6.2.1019 </span><span> </span><span>totally need 0x6438 bytes for deviceExt memory </span><span>totally need 0x6438 bytes for deviceExt memory </span><span>totally need 0x6438 bytes for deviceExt memory </span><span>totally need 0x6438 bytes for deviceExt memory </span><span>totally need 0x6438 bytes for deviceExt memory </span><span>totally need 0x6438 bytes for deviceExt memory </span><span>totally need 0x6438 bytes for deviceExt memory </span><span>Enter DriverEntry(FFFF840C971148F0,FFFFF8077A974DE0) </span><span>Required extension size: max: 7976928 Min: 71960 </span><span>10156250 - STORMINI: Arcsas Driver entry rtnval = 0 </span></code></pre> <p>With some prior knowledge, we know that <code>storport</code> will log ETW traces as well. WinDbg comes with WMI tracing extensions that will let us collect those traces as well. Some Googling <a href="https://docs.microsoft.com/en-us/windows-hardware/drivers/storage/storport-event-log-extensions">leads</a> us to the <code>Microsoft-Windows-Storage-Storport</code> provider (w/ GUID <code>{c4636a1e-7986-4646-bf10-7bc3b4a76e8e}</code>). To start collecting those events, we create a new trace session:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; !wmitrace.start mylogger -kd </span><span>kd&gt; !wmitrace.enable mylogger c4636a1e-7986-4646-bf10-7bc3b4a76e8e -level 0xFF -flag 0xFFFFFFFF </span><span>kd&gt; !wmitrace.dynamicprint 1 </span></code></pre> <p>Now, just let 'er rip, <code>g⏎</code>. And jackpot! We got something:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; g </span><span>[0]0004.0008::1601-01-01T00:04:56.4564911Z [Microsoft-Windows-StorPort/212v1]Dispatching an IOCTL. </span><span>[1]0004.0008::1601-01-01T00:04:56.5254670Z [Microsoft-Windows-StorPort/22v1]Initial PORT_CONFIGURATION_INFORMATION data </span><span>[1]0004.0008::1601-01-01T00:04:56.5635048Z [Microsoft-Windows-StorPort/558v2]Miniport notifies device(Port = 4294967295, Path = 255, Target = 255, Lun = 255) failed. </span><span>Corresponding Class Disk Device Guid: {00000000-0000-0000-0000-000000000000} </span><span>Adapter Guid: {2473ba50-c9a3-11ec-94f1-806e6f6e6963} </span><span>Miniport driver name: stornvme </span><span>VendorId: </span><span>ProductId: </span><span>SerialNumber: </span><span>AdapterSerialNumber: </span><span>Fault Code: 4 </span><span>Fault Description: MLBAR/MUBAR is not valid </span><span>[1]0004.0008::1601-01-01T00:04:56.5849092Z [Microsoft-Windows-StorPort/23v1]Final PORT_CONFIGURATION_INFORMATION data </span></code></pre> <p>Looks like <code>stornvme</code> (the in-box Windows NVMe driver) is not happy about the BARs on the NVMe controller. Based on the NVMe spec, <code>MLBAR</code>/<code>MUBAR</code> should correspond to <code>BAR0</code>/<code>BAR1</code>. We can take a quick look at what they're set to since we know the Bus.Device.Function representing the NVMe controller (its hardcoded to <code>0.5.0</code> in the TOML we gave Propolis).</p> <p>We use the <code>!pci</code> extension to verbosely (<code>flags |= 0x1</code>) print the configuration space (<code>flags |= 0x100</code>) of the device at 0.5.0:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; !pci 0x101 0 5 0 </span><span> </span><span>PCI Configuration Space (Segment:0000 Bus:00 Device:05 Function:00) </span><span>Common Header: </span><span> 00: VendorID 01de </span><span> 02: DeviceID 1000 </span><span> 04: Command 0400 InterruptDis </span><span> 06: Status 0010 CapList </span><span> 08: RevisionID 00 </span><span> 09: ProgIF 02 </span><span> 0a: SubClass 08 </span><span> 0b: BaseClass 01 </span><span> 0c: CacheLineSize 0000 </span><span> 0d: LatencyTimer 00 </span><span> 0e: HeaderType 00 </span><span> 0f: BIST 00 </span><span> 10: BAR0 fedfe004 </span><span> 14: BAR1 00000000 </span><span> 18: BAR2 00000000 </span><span> 1c: BAR3 00000000 </span><span> 20: BAR4 80000000 </span><span> 24: BAR5 00000000 </span><span> 28: CBCISPtr 00000000 </span><span> 2c: SubSysVenID 01de </span><span> 2e: SubSysID 1000 </span><span> 30: ROMBAR 00000000 </span><span> 34: CapPtr 40 </span><span> 3c: IntLine 00 </span><span> 3d: IntPin 00 </span><span> 3e: MinGnt 00 </span><span> 3f: MaxLat 00 </span><span>Device Private: </span><span> 40: 03ff0011 00000004 00004004 ffffffff </span><span> 50: ffffffff ffffffff ffffffff ffffffff </span><span> 60: ffffffff ffffffff ffffffff ffffffff </span><span> 70: ffffffff ffffffff ffffffff ffffffff </span><span> 80: ffffffff ffffffff ffffffff ffffffff </span><span> 90: ffffffff ffffffff ffffffff ffffffff </span><span> a0: ffffffff ffffffff ffffffff ffffffff </span><span> b0: ffffffff ffffffff ffffffff ffffffff </span><span> c0: ffffffff ffffffff ffffffff ffffffff </span><span> d0: ffffffff ffffffff ffffffff ffffffff </span><span> e0: ffffffff ffffffff ffffffff ffffffff </span><span> f0: ffffffff ffffffff ffffffff ffffffff </span><span>Capabilities: </span><span> 40: CapID 11 MSI-X Capability </span><span> 41: NextPtr 00 </span><span> 42: MsgCtrl TableSize:0x3ff FuncMask:0 MSIXEnable:0 </span><span> 44: MSIXTable 00000004 ( BIR:4 Offset:0x0 ) </span><span> 48: PBATable 00004004 ( BIR:4 Offset:0x4000 ) </span></code></pre> <p>As expected, that is our NVMe controller (vendor = <code>0x1de</code> and device = <code>0x1000</code>). Furthermore, we have what seems like a reasonable <code>BAR0</code> of <code>fedfe004</code> and in fact, the last byte being <code>4</code> means we should treat <code>BAR0</code> and <code>BAR1</code> as a single 64-bit address (lower and upper 32-bits, respectively). We also have another entry at <code>BAR4</code>.</p> <p>This all certainly tracks with how the <a href="https://github.com/oxidecomputer/propolis/blob/4c9fbd1b3cd75896308264a60e6df3a011797807/propolis/src/hw/nvme/mod.rs#L541-L545">code</a> is setup in Propolis. <code>BAR0</code>/<code>BAR1</code> are used for the NVMe controller registers and IO doorbells whereas <code>BAR4</code> holds the MSI-X Table and Pending Bit Array.</p> <h3 id="code-inspection">Code Inspection</h3> <p>At this point we've enabled some tracing and gotten some hints but still not quite enough to really figure it out. We could try searching for more traces to enable (idea: mess with <code>storport</code>/<code>stornvme</code>'s WPP tracing control block?) but let's try a different tact now.</p> <h4 id="driverentry">DriverEntry</h4> <p>We know that <code>stornvme</code> is responsible for NVMe devices but since it is a miniport driver, it works with the <code>storport</code> port driver to accomplish a lot of tasks. From my own experience, it is common for the miniport driver to pass callbacks for adding and starting devices to its port driver. To that end, let's restart the target and try to break on <code>stornvme</code>'s entrypoint:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; bu stornvme!DriverEntry </span><span>kd&gt; g </span><span>IOINIT: Built-in driver \Driver\sacdrv failed to initialize with status - 0xC0000037 </span><span>We are running at normal mode. </span><span>Breakpoint 0 hit </span><span>stornvme!DriverEntry: </span><span>fffff801`24928fdc 48895c2408 mov qword ptr [rsp+8],rbx </span></code></pre> <p>Great, we're at the start of the NVMe driver, let's take a peek at what it does with the <code>uf</code> (Unassemble Function) command:</p> <details> <summary><code>kd> uf stornvme!DriverEntry</code></summary> <pre data-lang="asm" style="background-color:#151515;color:#e8e8d3;" class="language-asm "><code class="language-asm" data-lang="asm"><span style="color:#fad07a;">fffff801`24928fdc 48895c2408 </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rsp</span><span>+8],</span><span style="color:#ffb964;">rbx </span><span style="color:#fad07a;">fffff801`24928fe1 48897c2410 </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rsp</span><span>+10h],</span><span style="color:#ffb964;">rdi </span><span style="color:#fad07a;">fffff801`24928fe6 </span><span>55 </span><span style="color:#8fbfdc;">push </span><span style="color:#ffb964;">rbp </span><span style="color:#fad07a;">fffff801`24928fe7 488d6c24a9 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rbp</span><span>,[</span><span style="color:#ffb964;">rsp</span><span>-57h] </span><span style="color:#fad07a;">fffff801`24928fec 4881ecf0000000 </span><span style="color:#8fbfdc;">sub </span><span style="color:#ffb964;">rsp</span><span>,0F0h </span><span style="color:#fad07a;">fffff801`24928ff3 488bda </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">rbx</span><span>,</span><span style="color:#ffb964;">rdx </span><span style="color:#fad07a;">fffff801`24928ff6 488bf9 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">rdi</span><span>,</span><span style="color:#ffb964;">rcx </span><span style="color:#fad07a;">fffff801`24928ff9 33d2 </span><span style="color:#8fbfdc;">xor </span><span style="color:#ffb964;">edx</span><span>,</span><span style="color:#ffb964;">edx </span><span style="color:#fad07a;">fffff801`24928ffb 488d4d87 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rcx</span><span>,[</span><span style="color:#ffb964;">rbp</span><span>-79h] </span><span style="color:#fad07a;">fffff801`24928fff 41b8d0000000 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">r8d</span><span>,0D0h </span><span style="color:#fad07a;">fffff801`</span><span>24929005 </span><span style="color:#fad07a;">e8b6c4ffff </span><span style="color:#8fbfdc;">call </span><span style="color:#fad07a;">stornvme!memset (fffff801`249254c0) </span><span style="color:#fad07a;">fffff801`2492900a 814d3fb8110000 </span><span style="color:#8fbfdc;">or </span><span>dword ptr [</span><span style="color:#ffb964;">rbp</span><span>+3Fh],11B8h </span><span style="color:#fad07a;">fffff801`</span><span>24929011 </span><span style="color:#fad07a;">488d05b8070000 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwInitialize (fffff801`249297d0)</span><span>] </span><span style="color:#fad07a;">fffff801`</span><span>24929018 </span><span style="color:#fad07a;">4889458f </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>-71h],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`2492901c 4c8d4587 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">r8</span><span>,[</span><span style="color:#ffb964;">rbp</span><span>-79h] </span><span style="color:#fad07a;">fffff801`</span><span>24929020 </span><span style="color:#fad07a;">488d05e990ffff </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwStartIo (fffff801`</span><span>24922110</span><span style="color:#fad07a;">)</span><span>] </span><span style="color:#fad07a;">fffff801`</span><span>24929027 </span><span style="color:#fad07a;">c74587d0000000 </span><span style="color:#8fbfdc;">mov </span><span>dword ptr [</span><span style="color:#ffb964;">rbp</span><span>-79h],0D0h </span><span style="color:#fad07a;">fffff801`2492902e </span><span>48894597 </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>-69h],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`</span><span>24929032 </span><span style="color:#fad07a;">4533c9 </span><span style="color:#8fbfdc;">xor </span><span style="color:#ffb964;">r9d</span><span>,</span><span style="color:#ffb964;">r9d </span><span style="color:#fad07a;">fffff801`</span><span>24929035 </span><span style="color:#fad07a;">488d05f4070000 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwInterrupt (fffff801`</span><span>24929830</span><span style="color:#fad07a;">)</span><span>] </span><span style="color:#fad07a;">fffff801`2492903c c745df02010101 </span><span style="color:#8fbfdc;">mov </span><span>dword ptr [</span><span style="color:#ffb964;">rbp</span><span>-21h],1010102h </span><span style="color:#fad07a;">fffff801`</span><span>24929043 </span><span style="color:#fad07a;">4889459f </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>-61h],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`</span><span>24929047 </span><span style="color:#fad07a;">488bd3 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">rdx</span><span>,</span><span style="color:#ffb964;">rbx </span><span style="color:#fad07a;">fffff801`2492904a 488d059f010000 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwFindAdapter (fffff801`249291f0)</span><span>] </span><span style="color:#fad07a;">fffff801`</span><span>24929051 </span><span style="color:#fad07a;">c7458b05000000 </span><span style="color:#8fbfdc;">mov </span><span>dword ptr [</span><span style="color:#ffb964;">rbp</span><span>-75h],5 </span><span style="color:#fad07a;">fffff801`</span><span>24929058 </span><span style="color:#fad07a;">488945a7 </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>-59h],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`2492905c 488bcf </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">rcx</span><span>,</span><span style="color:#ffb964;">rdi </span><span style="color:#fad07a;">fffff801`2492905f 488d05fa080000 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwResetBus (fffff801`</span><span>24929960</span><span style="color:#fad07a;">)</span><span>] </span><span style="color:#fad07a;">fffff801`</span><span>24929066 </span><span style="color:#fad07a;">c645e301 </span><span style="color:#8fbfdc;">mov </span><span>byte ptr [</span><span style="color:#ffb964;">rbp</span><span>-1Dh],1 </span><span style="color:#fad07a;">fffff801`2492906a 488945af </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>-51h],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`2492906e 488d051bbaffff </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwAdapterControl (fffff801`24924a90)</span><span>] </span><span style="color:#fad07a;">fffff801`</span><span>24929075 </span><span style="color:#fad07a;">488945ff </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>-1],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`</span><span>24929079 </span><span style="color:#fad07a;">488d05e0a8ffff </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwBuildIo (fffff801`</span><span>24923960</span><span style="color:#fad07a;">)</span><span>] </span><span style="color:#fad07a;">fffff801`</span><span>24929080 48894507 </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>+7],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`</span><span>24929084 </span><span style="color:#fad07a;">488d0535c0ffff </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwTracingEnabled (fffff801`249250c0)</span><span>] </span><span style="color:#fad07a;">fffff801`2492908b </span><span>48894537 </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>+37h],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`2492908f 488d050a090000 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">rax</span><span>,[</span><span style="color:#fad07a;">stornvme!NVMeHwUnitControl (fffff801`249299a0)</span><span>] </span><span style="color:#fad07a;">fffff801`</span><span>24929096 </span><span style="color:#fad07a;">4889454f </span><span style="color:#8fbfdc;">mov </span><span>qword ptr [</span><span style="color:#ffb964;">rbp</span><span>+4Fh],</span><span style="color:#ffb964;">rax </span><span style="color:#fad07a;">fffff801`2492909a b802000000 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">eax</span><span>,2 </span><span style="color:#fad07a;">fffff801`2492909f 8945d3 </span><span style="color:#8fbfdc;">mov </span><span>dword ptr [</span><span style="color:#ffb964;">rbp</span><span>-2Dh],</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;">fffff801`249290a2 </span><span>894543 </span><span style="color:#8fbfdc;">mov </span><span>dword ptr [</span><span style="color:#ffb964;">rbp</span><span>+43h],</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;">fffff801`249290a5 c745c7980f0000 </span><span style="color:#8fbfdc;">mov </span><span>dword ptr [</span><span style="color:#ffb964;">rbp</span><span>-39h],0F98h </span><span style="color:#fad07a;">fffff801`249290ac c745cfa0200000 </span><span style="color:#8fbfdc;">mov </span><span>dword ptr [</span><span style="color:#ffb964;">rbp</span><span>-31h],20A0h </span><span style="color:#fad07a;">fffff801`249290b3 4c8b155ecf0100 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">r10</span><span>,qword ptr [</span><span style="color:#fad07a;">stornvme!_imp_StorPortInitialize (fffff801`</span><span>24946018</span><span style="color:#fad07a;">)</span><span>] </span><span style="color:#fad07a;">fffff801`249290ba e821a792ff </span><span style="color:#8fbfdc;">call </span><span style="color:#fad07a;">storport!StorPortInitialize (fffff801`242537e0) </span><span style="color:#fad07a;">fffff801`249290bf 4c8d9c24f0000000 </span><span style="color:#8fbfdc;">lea </span><span style="color:#ffb964;">r11</span><span>,[</span><span style="color:#ffb964;">rsp</span><span>+0F0h] </span><span style="color:#fad07a;">fffff801`249290c7 498b5b10 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">rbx</span><span>,qword ptr [</span><span style="color:#ffb964;">r11</span><span>+10h] </span><span style="color:#fad07a;">fffff801`249290cb 498b7b18 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">rdi</span><span>,qword ptr [</span><span style="color:#ffb964;">r11</span><span>+18h] </span><span style="color:#fad07a;">fffff801`249290cf 498be3 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">rsp</span><span>,</span><span style="color:#ffb964;">r11 </span><span style="color:#fad07a;">fffff801`249290d2 5d </span><span style="color:#8fbfdc;">pop </span><span style="color:#ffb964;">rbp </span><span style="color:#fad07a;">fffff801`249290d3 c3 </span><span style="color:#8fbfdc;">ret </span></code></pre> </details> <p>From a quick glance, we can see it's not too different from the miniport model we had in mind. It seems to save a couple of function pointers before passing them off to the port driver (<code>storport!StorPortInitialize</code>). This is further confirmed by Microsoft's own <a href="https://docs.microsoft.com/en-us/windows-hardware/drivers/storage/hardware-initialization-with-storport">docs</a>.</p> <p>In fact, the docs stopped me from making the (totally reasonable in my opinion) assumption that <code>NVMeHwInitialize</code> would be the first callback invoked:</p> <blockquote> <p>Later, when the PnP manager calls the port driver's <code>StartIo</code> routine, the port driver calls the miniport driver's <code>HwStorFindAdapter</code> routine with a <code>PORT_CONFIGURATION_INFORMATION</code> (<code>STORPORT</code>) structure, followed by a call to the miniport driver's <code>HwStorInitialize</code> routine to initialize the adapter.</p> </blockquote> <p>So yea, looks like it is <code>NVMeHwFindAdapter</code> (<code>HwStorFindAdapter</code>) first and then <code>NVMeHwInitialize</code> (<code>HwStorInitialize</code>). But that gives us our next target to break on <code>bu stornvme!NVMeHwFindAdapter</code>.</p> <h4 id="nvmehwfindadapter">NVMeHwFindAdapter</h4> <p>We soon enough hit our breakpoint:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>Breakpoint 1 hit </span><span>stornvme!NVMeHwFindAdapter: </span><span>fffff801`249291f0 48895c2410 mov qword ptr [rsp+10h],rbx </span></code></pre> <p>We know the prototype of this function should look like:</p> <pre data-lang="C" style="background-color:#151515;color:#e8e8d3;" class="language-C "><code class="language-C" data-lang="C"><span>ULONG </span><span style="color:#fad07a;">HwFindAdapter</span><span>( </span><span> PVOID </span><span style="color:#ffb964;">DeviceExtension</span><span>, </span><span> [in] PVOID </span><span style="color:#ffb964;">HwContext</span><span>, </span><span> [in] PVOID </span><span style="color:#ffb964;">BusInformation</span><span>, </span><span> [in] PCHAR </span><span style="color:#ffb964;">ArgumentString</span><span>, </span><span> [in/out] PPORT_CONFIGURATION_INFORMATION </span><span style="color:#ffb964;">ConfigInfo</span><span>, </span><span> [in] PBOOLEAN Reserved3 </span><span>) </span><span>{...} </span></code></pre> <p>If you recall, our tracing escapades made mention of <code>PORT_CONFIGURATION_INFORMATION</code> so I'm interested in what it says. The Windows x64 calling convention is <code>RCX, RDX, R8, R9</code> with additional arguments on the stack. Sadly for us we need to go trawling through the stack. We skip past the function prologue with the <code>p</code> (Step) command a couple times and use <code>kv</code> (Display Stack Backtrace) to find it. <code>dt</code> (Display Type) then lets us interpret it as the specified type.</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; p </span><span>stornvme!NVMeHwFindAdapter+0x5: </span><span>fffff801`249291f5 4889742418 mov qword ptr [rsp+18h],rsi </span><span>[...snip...] </span><span>kd&gt; p </span><span>stornvme!NVMeHwFindAdapter+0x17: </span><span>fffff801`24929207 4881ec10010000 sub rsp,110h </span><span> </span><span>kd&gt; kv </span><span> # Child-SP RetAddr : Args to Child : Call Site </span><span>00 ffff808d`35e06160 fffff801`242546df : 00000000`00000000 ffff8705`c4c7b2d8 00000000`00000003 ffff8705`c4c7c4b0 : stornvme!NVMeHwFindAdapter+0x17 </span><span> ^^ Reserved3 ^^ ^^ ConfigInfo ^^ </span><span> </span><span>kd&gt; dt storport!_PORT_CONFIGURATION_INFORMATION ffff8705`c4c7b2d8 </span><span> +0x000 Length : 0xe0 </span><span> +0x004 SystemIoBusNumber : 0 </span><span> +0x008 AdapterInterfaceType : 5 ( PCIBus ) </span><span> +0x00c BusInterruptLevel : 0 </span><span> +0x010 BusInterruptVector : 0xfffffffa </span><span> +0x014 InterruptMode : 1 ( Latched ) </span><span> +0x018 MaximumTransferLength : 0xffffffff </span><span> +0x01c NumberOfPhysicalBreaks : 0x11 </span><span> +0x020 DmaChannel : 0xffffffff </span><span> +0x024 DmaPort : 0xffffffff </span><span> +0x028 DmaWidth : 0 ( Width8Bits ) </span><span> +0x02c DmaSpeed : 0 ( Compatible ) </span><span> +0x030 AlignmentMask : 0 </span><span> +0x034 NumberOfAccessRanges : 2 </span><span> +0x038 AccessRanges : 0xffff8705`c4ca1960 [0] _ACCESS_RANGE </span><span> +0x040 MiniportDumpData : (null) </span><span> +0x048 NumberOfBuses : 0 &#39;&#39; </span><span> +0x049 InitiatorBusId : [8] &quot;???&quot; </span><span> +0x051 ScatterGather : 0x1 &#39;&#39; </span><span> +0x052 Master : 0x1 &#39;&#39; </span><span> +0x053 CachesData : 0 &#39;&#39; </span><span> +0x054 AdapterScansDown : 0 &#39;&#39; </span><span> +0x055 AtdiskPrimaryClaimed : 0 &#39;&#39; </span><span> +0x056 AtdiskSecondaryClaimed : 0 &#39;&#39; </span><span> +0x057 Dma32BitAddresses : 0x1 &#39;&#39; </span><span> +0x058 DemandMode : 0 &#39;&#39; </span><span> +0x059 MapBuffers : 0x2 &#39;&#39; </span><span> +0x05a NeedPhysicalAddresses : 0x1 &#39;&#39; </span><span> +0x05b TaggedQueuing : 0x1 &#39;&#39; </span><span> +0x05c AutoRequestSense : 0x1 &#39;&#39; </span><span> +0x05d MultipleRequestPerLu : 0x1 &#39;&#39; </span><span> +0x05e ReceiveEvent : 0 &#39;&#39; </span><span> +0x05f RealModeInitialized : 0 &#39;&#39; </span><span> +0x060 BufferAccessScsiPortControlled : 0x1 &#39;&#39; </span><span> +0x061 MaximumNumberOfTargets : 0x80 &#39;&#39; </span><span> +0x062 SrbType : 0x1 &#39;&#39; </span><span> +0x063 AddressType : 0 &#39;&#39; </span><span> +0x064 SlotNumber : 5 </span><span> +0x068 BusInterruptLevel2 : 0 </span><span> +0x06c BusInterruptVector2 : 0 </span><span> +0x070 InterruptMode2 : 0 ( LevelSensitive ) </span><span> +0x074 DmaChannel2 : 0 </span><span> +0x078 DmaPort2 : 0 </span><span> +0x07c DmaWidth2 : 0 ( Width8Bits ) </span><span> +0x080 DmaSpeed2 : 0 ( Compatible ) </span><span> +0x084 DeviceExtensionSize : 0 </span><span> +0x088 SpecificLuExtensionSize : 0 </span><span> +0x08c SrbExtensionSize : 0x20a0 </span><span> +0x090 Dma64BitAddresses : 0x80 &#39;&#39; </span><span> +0x091 ResetTargetSupported : 0 &#39;&#39; </span><span> +0x092 MaximumNumberOfLogicalUnits : 0x8 &#39;&#39; </span><span> +0x093 WmiDataProvider : 0x1 &#39;&#39; </span><span> +0x094 SynchronizationModel : 0 ( StorSynchronizeHalfDuplex ) </span><span> +0x098 HwMSInterruptRoutine : (null) </span><span> +0x0a0 InterruptSynchronizationMode : 0 ( InterruptSupportNone ) </span><span> +0x0a8 DumpRegion : _MEMORY_REGION </span><span> +0x0c0 RequestedDumpBufferSize : 0 </span><span> +0x0c4 VirtualDevice : 0 &#39;&#39; </span><span> +0x0c5 DumpMode : 0 &#39;&#39; </span><span> +0x0c6 DmaAddressWidth : 0 &#39;&#39; </span><span> +0x0c8 ExtendedFlags1 : 0 </span><span> +0x0cc MaxNumberOfIO : 0x3e8 </span><span> +0x0d0 MaxIOsPerLun : 0xff </span><span> +0x0d4 InitialLunQueueDepth : 0x14 </span><span> +0x0d8 BusResetHoldTime : 0x3d0900 </span><span> +0x0dc FeatureSupport : 0 </span></code></pre> <p>Looks like we got the right value! <code>SystemIoBusNumber</code> (0), <code>SlotNumber</code> (5) would certainly match up with our setup.</p> <p><code>AccessRanges</code> sounds interesting:</p> <blockquote> <p>Contains a physical address that specifies the bus-relative base address of a range used by the HBA.</p> </blockquote> <p>That sounds basically like our PCI Bar and there's 2 of them as we would expect. We can use <code>dt</code> to print out subfields and arrays:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; dt storport!_PORT_CONFIGURATION_INFORMATION -a2 AccessRanges AccessRanges. ffff8705`c4c7b2d8 </span><span> +0x038 AccessRanges : 0xffff8705`c4ca1960 </span><span> [00] _ACCESS_RANGE </span><span> +0x000 RangeStart : _LARGE_INTEGER 0xfedfe000 </span><span> +0x008 RangeLength : 0x2000 </span><span> +0x00c RangeInMemory : 0x1 &#39;&#39; </span><span> [01] </span><span> +0x000 RangeStart : _LARGE_INTEGER 0x80000000 </span><span> +0x008 RangeLength : 0x8000 </span><span> +0x00c RangeInMemory : 0x1 &#39;&#39; </span></code></pre> <p>A memory range starting at <code>0xfedfe000</code> that's <code>0x2000</code> bytes? Certainly sounds like our <code>BAR0/1</code>. Note it doesn't have 4 in the LSB because this is the actual combined <code>BAR0</code> &amp; <code>BAR1</code> address. The <code>BAR4</code> equivalent looks correct too. Ok, so all good so far. Let us keep stepping through.</p> <p>We get to a call made to <code>StorPortGetBusData</code> which is a way for the driver to get bus-specific information it needs while initializing. In our case, the bus is PCI. So in pseudo code we have:</p> <pre data-lang="C" style="background-color:#151515;color:#e8e8d3;" class="language-C "><code class="language-C" data-lang="C"><span>_PCI_COMMON_HEADER pci_cfg[</span><span style="color:#cf6a4c;">0x40</span><span>] = {</span><span style="color:#cf6a4c;">0</span><span>} </span><span>pci_cfg_len = </span><span style="color:#ffb964;">StorPortGetBusData</span><span>( </span><span> adapt_ext, </span><span style="color:#888888;">// driver-specific per drive data </span><span> PCIConfiguration, </span><span style="color:#888888;">// We&#39;re asking specifically for bus type PCI here </span><span> port_cfg-&gt;SystemIoBusNumber, </span><span style="color:#888888;">// PCI Bus </span><span> port_cfg-&gt;SlotNumber, </span><span style="color:#888888;">// PCI Device </span><span> pci_cfg, </span><span style="color:#888888;">// Output buffer for PCI configuration info </span><span> </span><span style="color:#cf6a4c;">0x40 </span><span style="color:#888888;">// Output buffer len </span><span>) </span><span style="color:#8fbfdc;">if </span><span>(pci_cfg_len != </span><span style="color:#cf6a4c;">0x40</span><span>) </span><span> </span><span style="color:#888888;">// goto error handling </span></code></pre> <p>Let's step past that and look at the return value:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>stornvme!NVMeHwFindAdapter+0x16b: </span><span>fffff801`2492935b e870da92ff call storport!StorPortGetBusData (fffff801`24256dd0) </span><span>kd&gt; p </span><span>stornvme!NVMeHwFindAdapter+0x170: </span><span>fffff801`24929360 488bcb mov rcx,rbx </span><span>kd&gt; r eax </span><span>eax=40 </span></code></pre> <p>Looks like it completed successfully and the returned length was as expected <code>0x40</code>. <code>pci_cfg</code> from the pseudo-code corresponds to <code>rbp-0x40</code> so let's print the PCI config structure we got back:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; dt _PCI_COMMON_HEADER @rbp-40h </span><span>storport!_PCI_COMMON_HEADER </span><span> +0x000 VendorID : 0x1de </span><span> +0x002 DeviceID : 0x1000 </span><span> +0x004 Command : 0x406 </span><span> +0x006 Status : 0x10 </span><span> +0x008 RevisionID : 0 &#39;&#39; </span><span> +0x009 ProgIf : 0x2 &#39;&#39; </span><span> +0x00a SubClass : 0x8 &#39;&#39; </span><span> +0x00b BaseClass : 0x1 &#39;&#39; </span><span> +0x00c CacheLineSize : 0 &#39;&#39; </span><span> +0x00d LatencyTimer : 0 &#39;&#39; </span><span> +0x00e HeaderType : 0 &#39;&#39; </span><span> +0x00f BIST : 0 &#39;&#39; </span><span> +0x010 u : &lt;anonymous-tag&gt; </span></code></pre> <p>Looks legit, the Vendor and Device IDs match. Let's peer into the union at the end there. Our NVMe controller is not a PCI bridge so we dig into <code>type0</code> here.</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; dt _PCI_COMMON_HEADER -a u.type0. @rbp-40h </span><span>storport!_PCI_COMMON_HEADER </span><span> +0x010 u : </span><span> +0x000 type0 : </span><span> +0x000 BaseAddresses : </span><span> [00] 0xfedfe004 </span><span> [01] 0 </span><span> [02] 0 </span><span> [03] 0 </span><span> [04] 0x80000000 </span><span> [05] 0 </span><span> +0x018 CIS : 0 </span><span> +0x01c SubVendorID : 0x1de </span><span> +0x01e SubSystemID : 0x1000 </span><span> +0x020 ROMBaseAddress : 0 </span><span> +0x024 CapabilitiesPtr : 0x40 &#39;@&#39; </span><span> +0x025 Reserved1 : &quot;&quot; </span><span> [00] 0 &#39;&#39; </span><span> [01] 0 &#39;&#39; </span><span> [02] 0 &#39;&#39; </span><span> +0x028 Reserved2 : 0 </span><span> +0x02c InterruptLine : 0 &#39;&#39; </span><span> +0x02d InterruptPin : 0 &#39;&#39; </span><span> +0x02e MinimumGrant : 0 &#39;&#39; </span><span> +0x02f MaximumLatency : 0 &#39;&#39; </span></code></pre> <p>We found our BARs again! They match exactly with what the <code>!pci</code> command gave us. So, still not seeing what's wrong with them.</p> <h4 id="getnvmeregisteraddress">GetNVMeRegisterAddress</h4> <p>Looking forward a bit we see a promising lead, a call to <code>GetNVMeRegisterAddress</code>. That certainly sounds like something that would be related to the BARs as that is how one would interact with them. So the driver needs to be able to get a virtual mapping for those bus addresses. Let's take a peek at what this function does:</p> <details> <summary>kd> uf stornvme!GetNVMeRegisterAddress</summary> <pre data-lang="asm" style="background-color:#151515;color:#e8e8d3;" class="language-asm "><code class="language-asm" data-lang="asm"><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress: </span><span style="color:#fad07a;">fffff801`2493bccc </span><span>4053 </span><span style="color:#8fbfdc;">push </span><span style="color:#ffb964;">rbx </span><span style="color:#fad07a;">fffff801`2493bcce 4883ec30 </span><span style="color:#8fbfdc;">sub </span><span style="color:#ffb964;">rsp</span><span>,30h </span><span style="color:#fad07a;">fffff801`2493bcd2 8b5a34 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">ebx</span><span>,dword ptr [</span><span style="color:#ffb964;">rdx</span><span>+34h] </span><span style="color:#fad07a;">fffff801`2493bcd5 33c0 </span><span style="color:#8fbfdc;">xor </span><span style="color:#ffb964;">eax</span><span>,</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;">fffff801`2493bcd7 4c8bd2 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">r10</span><span>,</span><span style="color:#ffb964;">rdx </span><span style="color:#fad07a;">fffff801`2493bcda 85db </span><span style="color:#8fbfdc;">test </span><span style="color:#ffb964;">ebx</span><span>,</span><span style="color:#ffb964;">ebx </span><span style="color:#fad07a;">fffff801`2493bcdc 744a </span><span style="color:#8fbfdc;">je </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x5c </span><span style="color:#fad07a;">(fffff801`2493bd28) Branch </span><span> </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x12</span><span style="color:#fad07a;">: </span><span style="color:#fad07a;">fffff801`2493bcde 4c8b5a38 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">r11</span><span>,qword ptr [</span><span style="color:#ffb964;">rdx</span><span>+38h] </span><span style="color:#fad07a;">fffff801`2493bce2 448bc8 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">r9d</span><span>,</span><span style="color:#ffb964;">eax </span><span> </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x19</span><span style="color:#fad07a;">: </span><span style="color:#fad07a;">fffff801`2493bce5 418bd1 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">edx</span><span>,</span><span style="color:#ffb964;">r9d </span><span style="color:#fad07a;">fffff801`2493bce8 4803d2 </span><span style="color:#8fbfdc;">add </span><span style="color:#ffb964;">rdx</span><span>,</span><span style="color:#ffb964;">rdx </span><span style="color:#fad07a;">fffff801`2493bceb 4d3904d3 </span><span style="color:#8fbfdc;">cmp </span><span>qword ptr [</span><span style="color:#ffb964;">r11</span><span>+</span><span style="color:#ffb964;">rdx</span><span>*8],</span><span style="color:#ffb964;">r8 </span><span style="color:#fad07a;">fffff801`2493bcef 740a </span><span style="color:#8fbfdc;">je </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x2f </span><span style="color:#fad07a;">(fffff801`2493bcfb) Branch </span><span> </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x25</span><span style="color:#fad07a;">: </span><span style="color:#fad07a;">fffff801`2493bcf1 41ffc1 </span><span style="color:#8fbfdc;">inc </span><span style="color:#ffb964;">r9d </span><span style="color:#fad07a;">fffff801`2493bcf4 443bcb </span><span style="color:#8fbfdc;">cmp </span><span style="color:#ffb964;">r9d</span><span>,</span><span style="color:#ffb964;">ebx </span><span style="color:#fad07a;">fffff801`2493bcf7 72ec </span><span style="color:#8fbfdc;">jb </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x19 </span><span style="color:#fad07a;">(fffff801`2493bce5) Branch </span><span> </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x2d</span><span style="color:#fad07a;">: </span><span style="color:#fad07a;">fffff801`2493bcf9 eb2d </span><span style="color:#8fbfdc;">jmp </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x5c </span><span style="color:#fad07a;">(fffff801`2493bd28) Branch </span><span> </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x2f</span><span style="color:#fad07a;">: </span><span style="color:#fad07a;">fffff801`2493bcfb 413844d30c </span><span style="color:#8fbfdc;">cmp </span><span>byte ptr [</span><span style="color:#ffb964;">r11</span><span>+</span><span style="color:#ffb964;">rdx</span><span>*8+0Ch],</span><span style="color:#ffb964;">al </span><span style="color:#fad07a;">fffff801`2493bd00 4d8b0cd3 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">r9</span><span>,qword ptr [</span><span style="color:#ffb964;">r11</span><span>+</span><span style="color:#ffb964;">rdx</span><span>*8] </span><span style="color:#fad07a;">fffff801`2493bd04 458b4204 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">r8d</span><span>,dword ptr [</span><span style="color:#ffb964;">r10</span><span>+4] </span><span style="color:#fad07a;">fffff801`2493bd08 0f94c0 </span><span style="color:#8fbfdc;">sete </span><span style="color:#ffb964;">al </span><span style="color:#fad07a;">fffff801`2493bd0b </span><span>88442428 </span><span style="color:#8fbfdc;">mov </span><span>byte ptr [</span><span style="color:#ffb964;">rsp</span><span>+28h],</span><span style="color:#ffb964;">al </span><span style="color:#fad07a;">fffff801`2493bd0f 418b44d308 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">eax</span><span>,dword ptr [</span><span style="color:#ffb964;">r11</span><span>+</span><span style="color:#ffb964;">rdx</span><span>*8+8] </span><span style="color:#fad07a;">fffff801`2493bd14 418b5208 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">edx</span><span>,dword ptr [</span><span style="color:#ffb964;">r10</span><span>+8] </span><span style="color:#fad07a;">fffff801`2493bd18 </span><span>89442420 </span><span style="color:#8fbfdc;">mov </span><span>dword ptr [</span><span style="color:#ffb964;">rsp</span><span>+20h],</span><span style="color:#ffb964;">eax </span><span style="color:#fad07a;">fffff801`2493bd1c 4c8b1565a30000 </span><span style="color:#8fbfdc;">mov </span><span style="color:#ffb964;">r10</span><span>,qword ptr [</span><span style="color:#fad07a;">stornvme!_imp_StorPortGetDeviceBase (fffff801`</span><span>24946088</span><span style="color:#fad07a;">)</span><span>] </span><span style="color:#fad07a;">fffff801`2493bd23 e8e8b091ff </span><span style="color:#8fbfdc;">call </span><span style="color:#fad07a;">storport!StorPortGetDeviceBase (fffff801`24256e10) </span><span> </span><span style="color:#fad07a;">stornvme!GetNVMeRegisterAddress</span><span>+0x5c</span><span style="color:#fad07a;">: </span><span style="color:#fad07a;">fffff801`2493bd28 4883c430 </span><span style="color:#8fbfdc;">add </span><span style="color:#ffb964;">rsp</span><span>,30h </span><span style="color:#fad07a;">fffff801`2493bd2c 5b </span><span style="color:#8fbfdc;">pop </span><span style="color:#ffb964;">rbx </span><span style="color:#fad07a;">fffff801`2493bd2d c3 </span><span style="color:#8fbfdc;">ret </span></code></pre> </details> <p>A fairly short function, there's enough context to get it pretty readable in pseudo-code:</p> <pre data-lang="C" style="background-color:#151515;color:#e8e8d3;" class="language-C "><code class="language-C" data-lang="C"><span>PVOID </span><span style="color:#fad07a;">GetNVMeRegisterAddress</span><span>(</span><span style="color:#ffb964;">adapt_ext</span><span>, </span><span style="color:#ffb964;">port_cfg</span><span>, </span><span style="color:#ffb964;">addr</span><span>) { </span><span> num_ranges = port_cfg-&gt;NumberOfAccessRanges; </span><span> </span><span style="color:#8fbfdc;">if </span><span>(num_ranges == </span><span style="color:#cf6a4c;">0</span><span>) </span><span> </span><span style="color:#8fbfdc;">return </span><span>NULL </span><span> </span><span> ranges = port_cfg-&gt;AccessRanges </span><span> </span><span style="color:#8fbfdc;">for </span><span>(i = </span><span style="color:#cf6a4c;">0</span><span>; i &lt; num_ranges; i++) </span><span> </span><span style="color:#8fbfdc;">if </span><span>(ranges[i] == addr) { </span><span> range_in_io_space = !ranges[i].</span><span style="color:#ffb964;">RangeInMemory </span><span> range_start = ranges[i].</span><span style="color:#ffb964;">RangeStart </span><span> bus_num = port_cfg-&gt;SystemIoBusNumber </span><span> range_len = ranges[i].</span><span style="color:#ffb964;">RangeLength </span><span> bus_type = port_cfg-&gt;AdapterInterfaceType </span><span> </span><span style="color:#8fbfdc;">return </span><span style="color:#ffb964;">StorPortGetDeviceBase</span><span>(adapt_ext, bus_type, bus_num, range_start, range_len, range_in_io_space) </span><span> } </span><span> } </span><span> </span><span style="color:#8fbfdc;">return </span><span>NULL </span><span>} </span></code></pre> <p>It does exactly as expected: map a bus address into kernel virtual memory (system space) but with some checks against the memory ranges popoulated in the port configuration struct.</p> <p>So let's see what happens after running this. Stepping forward to that call (the <code>pct</code> command is great for stepping until the next <code>call</code> or <code>ret</code> instruction):</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; pct </span><span>stornvme!NVMeHwFindAdapter+0x1d1: </span><span>fffff801`249293c1 e806290100 call stornvme!GetNVMeRegisterAddress (fffff801`2493bccc) </span><span>kd&gt; p; r rax </span><span>rax=0000000000000000 </span></code></pre> <p>Oh no, it returned <code>NULL</code> 🥲. On the bright side I guess we found the problem.</p> <p>The driver falls off the happy path here and begins notifying the port driver about the error. We also end up finding where the trace message we saw earlier originated from:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; pct </span><span>stornvme!NVMeHwFindAdapter+0x243: </span><span>fffff801`24929433 e8b83d90ff call storport!StorPortNotification (fffff801`2422d1f0) </span><span>kd&gt; du @rax </span><span>fffff801`2493ee00 &quot;MLBAR/MUBAR is not valid&quot; </span></code></pre> <p>This causes <code>NVMeHwFindAdapter</code> to return <code>SP_RETURN_BAD_CONFIG</code> (3) which will eventually get bubbled up as <code>STATUS_DEVICE_CONFIGURATION_ERROR</code> (<code>0xC0000182</code>).</p> <p>So what went wrong? <code>GetNVMeRegisterAddress</code> just looks to see if the given address matches any of the regions in the port config. So that means none of them matched? Let's look at the calling code to get a better idea:</p> <pre data-lang="C" style="background-color:#151515;color:#e8e8d3;" class="language-C "><code class="language-C" data-lang="C"><span style="color:#ffb964;">NVMeHwFindAdapter</span><span>() { </span><span> </span><span style="color:#888888;">// ...snip... </span><span> </span><span> </span><span style="color:#888888;">// Earlier pci_cfg lookup we saw </span><span> </span><span style="color:#8fbfdc;">if </span><span>(</span><span style="color:#ffb964;">StorPortGetBusData</span><span>(...) != </span><span style="color:#cf6a4c;">0x40</span><span>) </span><span> </span><span style="color:#8fbfdc;">goto</span><span> err_notification_path </span><span> </span><span> *(adapt_ext + </span><span style="color:#cf6a4c;">4</span><span>) = pci_cfg-&gt;VendorID </span><span> *(adapt_ext + </span><span style="color:#cf6a4c;">6</span><span>) = pci_cfg-&gt;DeviceID </span><span> *(adapt_ext + </span><span style="color:#cf6a4c;">8</span><span>) = pci_cfg-&gt;RevisionID </span><span> </span><span> </span><span style="color:#8fbfdc;">if </span><span>(</span><span style="color:#ffb964;">IsIntelChatham</span><span>(adapt_ext)) { </span><span> mlbar = pci_cfg-&gt;u.</span><span style="color:#ffb964;">type0</span><span>.</span><span style="color:#ffb964;">BarAddresses</span><span>[</span><span style="color:#cf6a4c;">2</span><span>] </span><span> mubar = pci_cfg-&gt;u.</span><span style="color:#ffb964;">type0</span><span>.</span><span style="color:#ffb964;">BarAddresses</span><span>[</span><span style="color:#cf6a4c;">3</span><span>] </span><span> mask = </span><span style="color:#cf6a4c;">0x0FFFFF000 </span><span> } </span><span style="color:#8fbfdc;">else </span><span>{ </span><span> mlbar = pci_cfg-&gt;u.</span><span style="color:#ffb964;">type0</span><span>.</span><span style="color:#ffb964;">BarAddresses</span><span>[</span><span style="color:#cf6a4c;">0</span><span>] </span><span> mubar = pci_cfg-&gt;u.</span><span style="color:#ffb964;">type0</span><span>.</span><span style="color:#ffb964;">BarAddresses</span><span>[</span><span style="color:#cf6a4c;">1</span><span>] </span><span> mask = </span><span style="color:#cf6a4c;">0xFFFFC000 </span><span> } </span><span> </span><span> addr = (mubar &lt;&lt; </span><span style="color:#cf6a4c;">32</span><span>) | (mlbar &amp; mask) </span><span> *(adapt_ext + </span><span style="color:#cf6a4c;">0x90</span><span>) = addr </span><span> p_addr = </span><span style="color:#ffb964;">GetNVMeRegisterAddress</span><span>(adapt_ext, port_cfg, addr) </span><span> </span><span style="color:#8fbfdc;">if </span><span>(p_addr == NULL) </span><span> </span><span style="color:#8fbfdc;">goto</span><span> slighty_diff_err_notification_path </span><span> </span><span> </span><span style="color:#888888;">// ...snip... </span><span>} </span></code></pre> <p>First off the bat we have some sort of non-standard behaviour if the device matches whatever an Intel Chatham is: we use different BARs than standard. Google says Chatham was a prototype NVMe device so I suppose it makes sense it might've predated the current spec-mandated BAR choices. I guess it makes sense why they consider the <code>RevisionID</code> as well then.</p> <p>In any case, we clearly aren't an Intel Chatham device so we fall into the other arm of the conditional. We grab <code>BAR0</code> and <code>BAR1</code>, combine them to form the 64-bit physical address but inexplicably seem to mask off the bottom 14 bits?</p> <p>Recall, <code>BAR0</code> was <code>0xFEDFE004</code>, which means the final address that gets passed to <code>GetNVMeRegisterAddress</code> is <code>0xFEDFC000</code>. No wonder it fails.</p> <h3 id="pci-bars">PCI BARs</h3> <p>Ok, so we've found what looks to be the problem. But let's back up a bit first. What is a BAR to begin with? This is based on my own rudimentary understanding but a <code>Base Address Register</code> entry is how the software on your machine can interact with PCI devices connected to it. The BIOS/firmware or OS will assign addresses for every valid BAR on each PCI device. So, when a driver or what have you attempts to read or write a certain address, the corresponding PCI device will recognize such an access and service it appropriately. Note, this is papering over Port I/O vs Memory I/O and mostly assuming the latter.</p> <h4 id="bar-size">BAR Size</h4> <p>Looking through the PCI configuration space, one will notice there's no mention of the size of the region described by a particular BAR. Instead, it's done with a pretty cool trick. To figure out the size the steps are:</p> <ol> <li>Read and save the current BAR value.</li> <li>Write a new BAR value of all 1s set.</li> <li>Read back the BAR, the least significant bit set (ignoring lower 4 bits) gives you the region size.</li> <li>Restore the old value.</li> </ol> <p>A worked example using the BAR we saw for the propolis NVMe controller is:</p> <ol> <li>read(<code>BAR0</code>) = <code>0xFEDFE004</code></li> <li>write(<code>BAR0</code>, <code>0xFFFFFFFF</code>)</li> <li>read(<code>BAR0</code>) = <code>0xFFFFE004</code> = (ignoring lower 4 bits) =&gt; LSB = bit 13 =&gt; size = <code>0x2000</code></li> <li>write(<code>BAR0</code>, <code>0xFEDFE000</code>)</li> </ol> <p>This is exactly what Propolis <a href="https://github.com/oxidecomputer/propolis/blob/4c9fbd1b3cd75896308264a60e6df3a011797807/propolis/src/hw/pci/bar.rs#L127-L130">does</a>.</p> <h4 id="bars-on-boot">BARs on Boot</h4> <p>One question I had was is it the firmware assigning these BARs or Windows itself? If the former, then why does QEMU not trip over this as we used the same OVMF blob in both cases. To answer that, WinDbg can at least tell us what the BARs were at boot.</p> <p>But first, we can use the <code>!arbiter</code> extension to see if there's any clashes or such going on:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; !arbiter 2 </span><span>DEVNODE ffff8705c4572a20 (HTREE\ROOT\0) </span><span> Memory Arbiter &quot;RootMemory&quot; at fffff80122845780 </span><span> Allocated ranges: </span><span> 0000000000000000 - 0000000000000fff 00000000 &lt;Not on bus&gt; </span><span> 00000000000a0000 - 00000000000bffff S ffff8705c449f8f0 (pci) </span><span> 0000000080000000 - 00000000feefffff </span><span> 0000000080000000 - 00000000feefffff SC ffff8705c449f8f0 (pci) </span><span> 00000000fec00000 - 00000000fec003ff CB ffff8705c45dbda0 </span><span> 00000000fee00000 - 00000000fee003ff CB ffff8705c45dbda0 </span><span> 0001000000000000 - ffffffffffffffff 00000000 &lt;Not on bus&gt; </span><span> Possible allocation: </span><span> &lt; none &gt; </span><span> </span><span> DEVNODE ffff8705c44c9ca0 (ACPI\PNP0A03\0) </span><span> Memory Arbiter &quot;PCI Memory (b=0)&quot; at ffff8705c4c52610 </span><span> Allocated ranges: </span><span> 0000000000000000 - 000000000009ffff 00000000 &lt;Not on bus&gt; </span><span> 00000000000c0000 - 000000007fffffff 00000000 &lt;Not on bus&gt; </span><span> 0000000080000000 - 0000000080007fff ffff8705c4c5c360 (stornvme) </span><span> 00000000fec00000 - 00000000fec00fff BA ffff8705c4c19a70 </span><span> 00000000fedfe000 - 00000000fedfffff ffff8705c4c5c360 (stornvme) </span><span> 00000000fee00000 - 00000000feefffff BA ffff8705c4c19a70 </span><span> 00000000fef00000 - ffffffffffffffff </span><span> 00000000fef00000 - ffffffffffffffff C 00000000 &lt;Not on bus&gt; </span><span> 0001000000000000 - ffffffffffffffff C 00000000 &lt;Not on bus&gt; </span><span> Possible allocation: </span><span> &lt; none &gt; </span></code></pre> <p>We pass (2) to specifically look at memory arbiters. Ok, while there do seem to be some conflicts going on (see the lines with <code>C</code> in the third column), none are with the BARs on our NVMe. Curiously though, we don't see <code>B</code> (Boot Allocated) for either of the ranges on the NVMe. Helpfully, it includes the address of the PDO (Physical Device Object) created by the PCI bus driver for the device. That with the <code>!devobj</code> extension gives us the Device Node. Once we have the Device Node address we can use <code>!devnode &lt;addr&gt; 0x2</code> to get the resources allocated to the device.</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; !devobj ffff8705c4c5c360 </span><span>Device object (ffff8705c4c5c360) is for: </span><span> NTPNP_PCI0003 \Driver\pci DriverObject ffff8705c452ace0 </span><span>Current Irp 00000000 RefCount 0 Type 00000004 Flags 00001040 </span><span>SecurityDescriptor ffffe3819b9f8c60 DevExt ffff8705c4c5c4b0 DevObjExt ffff8705c4c5cbd8 DevNode ffff8705c4c5cca0 </span><span>ExtensionFlags (0x00000010) DOE_START_PENDING ^^^^^^^^^^^^^^^^ </span><span>Characteristics (0x00000100) FILE_DEVICE_SECURE_OPEN </span><span>AttachedDevice (Upper) ffff8705c4c7b050 \Driver\stornvme </span><span>Device queue is not busy. </span><span> </span><span>kd&gt; !devnode ffff8705c4c5cca0 0x2 </span><span>DevNode 0xffff8705c4c5cca0 for PDO 0xffff8705c4c5c360 </span><span> Parent 0xffff8705c44c9ca0 Sibling 0xffff8705c4c5eca0 Child 0000000000 </span><span> InstancePath is &quot;PCI\VEN_01DE&amp;DEV_1000&amp;SUBSYS_100001DE&amp;REV_00\3&amp;267a616a&amp;0&amp;28&quot; </span><span> ServiceName is &quot;stornvme&quot; </span><span> State = DeviceNodeStartPending (0x305) </span><span> Previous State = DeviceNodeResourcesAssigned (0x304) </span><span> StateHistory[03] = DeviceNodeResourcesAssigned (0x304) </span><span> StateHistory[02] = DeviceNodeDriversAdded (0x303) </span><span> StateHistory[01] = DeviceNodeInitialized (0x302) </span><span> StateHistory[00] = DeviceNodeUninitialized (0x301) </span><span> StateHistory[19] = Unknown State (0x0) </span><span> StateHistory[18] = Unknown State (0x0) </span><span> StateHistory[17] = Unknown State (0x0) </span><span> StateHistory[16] = Unknown State (0x0) </span><span> StateHistory[15] = Unknown State (0x0) </span><span> StateHistory[14] = Unknown State (0x0) </span><span> StateHistory[13] = Unknown State (0x0) </span><span> StateHistory[12] = Unknown State (0x0) </span><span> StateHistory[11] = Unknown State (0x0) </span><span> StateHistory[10] = Unknown State (0x0) </span><span> StateHistory[09] = Unknown State (0x0) </span><span> StateHistory[08] = Unknown State (0x0) </span><span> StateHistory[07] = Unknown State (0x0) </span><span> StateHistory[06] = Unknown State (0x0) </span><span> StateHistory[05] = Unknown State (0x0) </span><span> StateHistory[04] = Unknown State (0x0) </span><span> Flags (0x6c0000f0) DNF_ENUMERATED, DNF_IDS_QUERIED, </span><span> DNF_HAS_BOOT_CONFIG, DNF_BOOT_CONFIG_RESERVED, </span><span> DNF_NO_LOWER_DEVICE_FILTERS, DNF_NO_LOWER_CLASS_FILTERS, </span><span> DNF_NO_UPPER_DEVICE_FILTERS, DNF_NO_UPPER_CLASS_FILTERS </span><span> CapabilityFlags (0x00400000) </span><span> Unknown flags 0x00400000 </span><span> CmResourceList at 0xffffe3819bc0b640 Version 1.1 Interface 0x5 Bus #0 </span><span> Entry 0 - Memory (0x3) Device Exclusive (0x1) </span><span> Flags ( </span><span> Range starts at 0x00000000fedfe000 for 0x2000 bytes </span><span> Entry 1 - DevicePrivate (0x81) Device Exclusive (0x1) </span><span> Flags ( </span><span> Data - {0x00000001, 0000000000, 0000000000} </span><span> Entry 2 - Memory (0x3) Device Exclusive (0x1) </span><span> Flags ( </span><span> Range starts at 0x0000000080000000 for 0x8000 bytes </span><span> Entry 3 - DevicePrivate (0x81) Device Exclusive (0x1) </span><span> Flags ( </span><span> Data - {0x00000001, 0x00000004, 0000000000} </span><span> Entry 4 - Interrupt (0x2) Device Exclusive (0x1) </span><span> Flags (LATCHED MESSAGE </span><span> Message Count 1, Vector 0xfffffffe, Group 0, Affinity 0x1 </span><span> Entry 5 - Interrupt (0x2) Device Exclusive (0x1) </span><span> Flags (LATCHED MESSAGE </span><span> Message Count 1, Vector 0xfffffffd, Group 0, Affinity 0x2 </span><span> Entry 6 - Interrupt (0x2) Device Exclusive (0x1) </span><span> Flags (LATCHED MESSAGE </span><span> Message Count 1, Vector 0xfffffffc, Group 0, Affinity 0x4 </span><span> Entry 7 - Interrupt (0x2) Device Exclusive (0x1) </span><span> Flags (LATCHED MESSAGE </span><span> Message Count 1, Vector 0xfffffffb, Group 0, Affinity 0x8 </span><span> Entry 8 - Interrupt (0x2) Device Exclusive (0x1) </span><span> Flags (LATCHED MESSAGE </span><span> Message Count 1, Vector 0xfffffffa, Group 0, Affinity 0x1 </span><span> </span><span> BootResourcesList at 0xffffe3819b92aec0 Version 1.1 Interface 0x5 Bus #0 </span><span> Entry 0 - Memory (0x3) Device Exclusive (0x1) </span><span> Flags ( </span><span> Range starts at 0x0000000800000000 for 0x2000 bytes </span><span> Entry 1 - Memory (0x3) Device Exclusive (0x1) </span><span> Flags ( </span><span> Range starts at 0x0000000080000000 for 0x8000 bytes </span></code></pre> <p>Ok, Entry 0 and Entry 2 clearly correspond to our BARs but take a look at the entries under <code>BootResourcesList</code>. We have 2 regions, Entry 1 there is clearly our <code>BAR4</code> which matches Entry 2 in <code>CmResourceList</code>. That means boot Entry 0 would be our <code>BAR0</code>/<code>BAR1</code> which was apparantly mapped at <code>0x0000000800000000</code> by the firmware but seems like Windows had a different idea about where it should go.</p> <p>AFAIK, Windows may remap things but it still has to do so within the confines of the ranges assigned to the root PCI bus. I guess that tracks as it doesn't even seem to be in the list of allocated regions <code>!arbiter</code> told us about.</p> <h3 id="nvme-bars">NVMe BARs</h3> <p>Taking a look at the issue from another perspective, is this behaviour spec mandated?. Breaking out the NVMe spec (1.0e which is what Propolis mostly implements), we find:</p> <table><thead><tr><th style="text-align: center"><strong>Bits</strong></th><th style="text-align: center"><strong>Type</strong></th><th style="text-align: center"><strong>Reset</strong></th><th style="text-align: center"><strong>Description</strong></th></tr></thead><tbody> <tr><td style="text-align: center">31:14</td><td style="text-align: center">RW</td><td style="text-align: center">0h</td><td style="text-align: center"><strong>Base Address (BA)</strong>: Base address of register memory space. For controllers that support a larger number of doorbell registers or have vendor specific space following the doorbell registers, more bits are allowed to be RO such that more memory space is consumed.</td></tr> <tr><td style="text-align: center">13:04</td><td style="text-align: center">RO</td><td style="text-align: center">0h</td><td style="text-align: center">Reserved</td></tr> </tbody></table> <p>I guess Propolis is technically afoul of the spec then. Well, at least I know who to blame. 😅</p> <figure class="center" > <img src="images/git-blame.jpg" /> <figcaption class="center">Spiderman Pointing at Himself Meme: Git Blame Edition</figcaption> </figure> <h2 id="solution">Solution</h2> <p>Is it truly that easy? Do we just need to double our reported BAR size? Everything leading up to this seems to imply so. Let's give it a try:</p> <pre data-lang="diff" style="background-color:#151515;color:#e8e8d3;" class="language-diff "><code class="language-diff" data-lang="diff"><span>diff --git a/propolis/src/hw/nvme/mod.rs b/propolis/src/hw/nvme/mod.rs </span><span>index 59469fc..1fbca34 100644 </span><span style="background-color:#4e738a;color:#ffffff;">--- a/propolis/src/hw/nvme/mod.rs </span><span style="background-color:#4e738a;color:#ffffff;">+++ b/propolis/src/hw/nvme/mod.rs </span><span style="font-style:italic;color:#888888;">@@ -976,7 +976,7 @@ </span><span style="font-style:italic;color:#ffb964;">enum CtrlrReg { </span><span> } </span><span> </span><span> /// Size of the Controller Register space </span><span style="color:#a1000d;">-const CONTROLLER_REG_SZ: usize = 0x2000; </span><span style="color:#558f1f;">+const CONTROLLER_REG_SZ: usize = 0x4000; </span><span> </span><span> lazy_static! { </span><span> static ref CONTROLLER_REGS: (RegMap&lt;CtrlrReg&gt;, usize) = { </span><span style="font-style:italic;color:#888888;">@@ -1005,7 +1005,7 @@ </span><span style="font-style:italic;color:#ffb964;">lazy_static! { </span><span> // Pad out to the next power of two </span><span> let regs_sz = layout.iter().map(|(_, sz)| sz).sum::&lt;usize&gt;(); </span><span> assert!(regs_sz.next_power_of_two() &lt;= CONTROLLER_REG_SZ); </span><span style="color:#a1000d;">- layout.last_mut().unwrap().1 = regs_sz.next_power_of_two() - regs_sz; </span><span style="color:#558f1f;">+ layout.last_mut().unwrap().1 = CONTROLLER_REG_SZ - regs_sz; </span><span> </span><span> // Find the offset of IOQueueDoorBells </span><span> let db_offset = layout </span><span> </span></code></pre> <p>A quick detour through <code>cargo build --release</code> and we're ready to get things going!</p> <p>We're just gonna jump straight to the meat of it and set a breakpoint at <code>GetNVMeRegisterAddress</code>. But before we go to it, let's take a peek at our BARs:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; !pci 1 0 5 0 </span><span>PCI Segment 0 Bus 0 </span><span>05:0 01de:1000.00 Cmd[0406:.mb...] Sts[0010:c....] Class:1:8:2 SubID:01de:1000 </span><span> cf8:80002800 IntPin:0 IntLine:0 Rom:0 cis:0 cap:40 </span><span> MEM[0]:fedfc004 MEM[4]:80000000 </span></code></pre> <p>That's promising, our BAR now reads <code>fedfc004</code> instead of <code>fedfe004</code>.</p> <p>Ok, let's see if <code>stornvme</code> is happy this time.</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; g </span><span>Breakpoint 0 hit </span><span>stornvme!GetNVMeRegisterAddress: </span><span>fffff801`7b33bccc 4053 push rbx </span><span>kd&gt; gu </span><span>stornvme!NVMeHwFindAdapter+0x1d6: </span><span>fffff801`7b3293c6 48898398000000 mov qword ptr [rbx+98h],rax </span><span>kd&gt; r @rax </span><span>rax=ffffc781fbb60000 </span></code></pre> <p>🎉! No more <code>NULL</code> pointer! And since <code>rax</code> here is a kernel virtual address mapped to our NVMe controller registers, we should be able to read the version field (@ offset 8) from it:</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; dd @rax+8 L1 </span><span>ffffc781`fbb60008 00010000 </span></code></pre> <p><code>00010000</code> is exactly what <a href="https://github.com/oxidecomputer/propolis/blob/4c9fbd1b3cd75896308264a60e6df3a011797807/propolis/src/hw/nvme/mod.rs#L612">Propolis gives us</a> which is <code>NVME_VER_1_0 = 0x10000</code>.</p> <p>Ok, but to be 100% sure, <code>NVMeHwFindAdapter</code> should no longer return <code>SP_RETURN_BAD_CONFIG</code> (3) but <code>SP_RETURN_FOUND</code> (1):</p> <pre data-lang="WinDbg" style="background-color:#151515;color:#e8e8d3;" class="language-WinDbg "><code class="language-WinDbg" data-lang="WinDbg"><span>kd&gt; gu </span><span>storport!RaCallMiniportFindAdapter+0x193: </span><span>fffff801`7ac546df 8bf0 mov esi,eax </span><span>kd&gt; r rax </span><span>rax=0000000000000001 </span></code></pre> <p>Woo! 🛳 it! If we let it go on its merry way now it just boots up fine. We're even able to RDP in (albeit it drops sporadically):</p> <figure class="center" > <img src="images/rdp.png" /> <figcaption class="center">RDP Session to a Windows Guest running as a Propolis VM</figcaption> </figure> <p><strong>EDIT</strong>: My colleague dug in and figured out the reason for the RDP instability [<a href="https://www.illumos.org/issues/14668#note-1">details</a>].</p> <h3 id="why-does-qemu-work">Why does QEMU work?</h3> <p>Propolis is not alone in this. QEMU was actually susceptible to this bug as well at one point. In fact the version (<code>QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.21)</code>) I was testing with is buggy in the same way. I just happened to get lucky about where the BARs got placed. After creating another test VM with a bunch of emulated NVMe devices attached I immediately ran into the issue (the NVMe devices that landed with unlucky BAR allotments failed to initialize in the same way as Propolis).</p> <p>Up until QEMU v6.0 (so just last year!), the NVMe device it emulated exposed a <code>BAR0</code>/<code>BAR1</code> of size <code>0x2000</code> just like us. If you search, you'll find strangely reminiscent bugs like this: <a href="https://bugs.launchpad.net/qemu/+bug/1576347"> Only one NVMe device is usable in Windows (10) guest</a>.</p> <p>Reading the bug reveals device manager complaining:</p> <blockquote> <p>The I/O device is configured incorrectly or the configuration parameters to the driver are incorrect.</p> </blockquote> <p>I wonder where we've heard that 🤔. That's just the corresponding text for <code>STATUS_DEVICE_CONFIGURATION_ERROR</code>.</p> <p>Then, how did QEMU v6.0 fix it? Well, as far as I can tell: accidentally? The <a href="https://github.com/qemu/qemu/commit/1901b4967c3fdd47e59d9023aea2285d94f3998a">change</a> that effectively bumped up the size was about free'ing up <code>BAR4</code> by just moving the MSI-X Table over to <code>BAR0/1</code>.</p> <h3 id="why-does-linux-work">Why does Linux work?</h3> <p>(<strong>EDIT</strong>: Added this section.)</p> <p>Up until this point we've been testing Propolis primarily with Linux and never ran into this issue. What gives? Well, Linux <a href="https://github.com/torvalds/linux/blob/f47c960e9395743a8aa3bd939d4d3a0f582f565e/drivers/nvme/host/pci.c#L2992-L2993">is not as stringent</a> as Windows about the size of <code>BAR0</code> initially: it only asks for a modest <code>0x2000</code> bytes to begin with and that's exactly what Propolis provided.</p> <p>Note that this doesn't mean Linux doesn't support NVMe controllers that need more space (say due to larger strides between each I/O queue doorbell register). It will attempt to remap the BAR before initializing each queue to make sure everything is within bounds.</p> <h2 id="conclusion">Conclusion</h2> <p>It's a wonder things work sometimes.</p> <p><strong>EDIT</strong>: <a href="https://github.com/oxidecomputer/propolis/pull/126">Fix</a> landed in Propolis.</p> <p><sub>Thanks to Jon for helping proofread.</sub></p> Achievement Unlocked: rustc segfault 2022-04-09T15:00:00+00:00 2022-04-24T15:00:00+00:00 https://luqman.ca/blog/achievement-unlocked-rustc-segfault/ <p><em>Originally posted as a <a href="https://gist.github.com/luqmana/be1af5b64d2cda5a533e3e23a7830b44">gist</a>.</em></p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ cargo build --example basic --features usdt-probes </span><span>[...snip...] </span><span>error: could not compile `dropshot` </span><span> </span><span>Caused by: </span><span> process didn&#39;t exit successfully: `rustc [...snip...]` (signal: 11, SIGSEGV: invalid memory reference) </span></code></pre> <p><strong>Achievement unlocked: <code>rustc</code> segfault.</strong></p> <span id="continue-reading"></span><details> <summary>Stack trace</summary> <pre style="background-color:#151515;color:#e8e8d3;"><code><span>fffffc7fce3fcbc0 librustc_driver-77cef3efbfa7284c.so`llvm::BranchProbabilityInfo::computeEestimateBlockWeight(llvm::Function const&amp;, llvm::DominatorTree*, llvm::PostDominatorTree*)+0xd84() </span><span>fffffc7fce3fd370 librustc_driver-77cef3efbfa7284c.so`llvm::BranchProbabilityInfo::calculate(llvm::Function const&amp;, llvm::LoopInfo const&amp;, llvm::TargetLibraryInfo const*, llvm::DominatorTree*, llvm::PostDominatorTree*)+0x131() </span><span>fffffc7fce3fd3c0 librustc_driver-77cef3efbfa7284c.so`llvm::BranchProbabilityAnalysis::run(llvm::Function&amp;, llvm::AnalysisManager&lt;llvm::Function&gt;&amp;)+0x134() </span><span>fffffc7fce3fd5f0 librustc_driver-77cef3efbfa7284c.so`llvm::detail::AnalysisPassModel&lt;llvm::Function, llvm::BranchProbabilityAnalysis, llvm::PreservedAnalyses, llvm::AnalysisManager&lt;llvm::Function&gt;::Invalidator&gt;::run(llvm::Function&amp;, llvm::AnalysisManager&lt;llvm::Function&gt;&amp;)+0x2f() </span><span>fffffc7fce3fd6a0 librustc_driver-77cef3efbfa7284c.so`llvm::AnalysisManager&lt;llvm::Function&gt;::getResultImpl(llvm::AnalysisKey*, llvm::Function&amp;)+0x2de() </span><span>fffffc7fce3fd6d0 librustc_driver-77cef3efbfa7284c.so`llvm::BlockFrequencyAnalysis::run(llvm::Function&amp;, llvm::AnalysisManager&lt;llvm::Function&gt;&amp;)+0x3f() </span><span>fffffc7fce3fd710 librustc_driver-77cef3efbfa7284c.so`llvm::detail::AnalysisPassModel&lt;llvm::Function, llvm::BlockFrequencyAnalysis, llvm::PreservedAnalyses, llvm::AnalysisManager&lt;llvm::Function&gt;::Invalidator&gt;::run(llvm::Function&amp;, llvm::AnalysisManager&lt;llvm::Function&gt;&amp;)+0x26() </span><span>fffffc7fce3fd7c0 librustc_driver-77cef3efbfa7284c.so`llvm::AnalysisManager&lt;llvm::Function&gt;::getResultImpl(llvm::AnalysisKey*, llvm::Function&amp;)+0x2de() </span><span>fffffc7fce3fdc30 librustc_driver-77cef3efbfa7284c.so`llvm::AlwaysInlinerPass::run(llvm::Module&amp;, llvm::AnalysisManager&lt;llvm::Module&gt;&amp;)+0xa2c() </span><span>fffffc7fce3fdc50 librustc_driver-77cef3efbfa7284c.so`llvm::detail::PassModel&lt;llvm::Module, llvm::AlwaysInlinerPass, llvm::PreservedAnalyses, llvm::AnalysisManager&lt;llvm::Module&gt;&gt;::run(llvm::Module&amp;, llvm::AnalysisManager&lt;llvm::Module&gt;&amp;)+0x15() </span><span>fffffc7fce3fddc0 librustc_driver-77cef3efbfa7284c.so`llvm::PassManager&lt;llvm::Module, llvm::AnalysisManager&lt;llvm::Module&gt;&gt;::run(llvm::Module&amp;, llvm::AnalysisManager&lt;llvm::Module&gt;&amp;)+0x4b5() </span><span>fffffc7fce3ff170 librustc_driver-77cef3efbfa7284c.so`LLVMRustOptimizeWithNewPassManager+0x7f2() </span><span>fffffc7fce3ff3a0 librustc_driver-77cef3efbfa7284c.so`rustc_codegen_llvm::back::write::optimize_with_new_llvm_pass_manager+0x372() </span><span>fffffc7fce3ff5b0 librustc_driver-77cef3efbfa7284c.so`rustc_codegen_llvm::back::write::optimize+0x388() </span><span>fffffc7fce3ff900 librustc_driver-77cef3efbfa7284c.so`rustc_codegen_ssa::back::write::execute_work_item::&lt;rustc_codegen_llvm::LlvmCodegenBackend&gt;+0x1f3() </span><span>fffffc7fce3ffdb0 librustc_driver-77cef3efbfa7284c.so`std::sys_common::backtrace::__rust_begin_short_backtrace::&lt;&lt;rustc_codegen_llvm::LlvmCodegenBackend as rustc_codegen_ssa::traits::backend::ExtraBackendMethods&gt;::spawn_named_thread&lt;rustc_codegen_ssa::back::write::spawn_work&lt;rustc_codegen_llvm::LlvmCodegenBackend&gt;::{closure#0}, ()&gt;::{closure#0}, ()&gt;+0xf7() </span><span>fffffc7fce3fff60 librustc_driver-77cef3efbfa7284c.so`&lt;&lt;std::thread::Builder&gt;::spawn_unchecked_&lt;&lt;rustc_codegen_llvm::LlvmCodegenBackend as rustc_codegen_ssa::traits::backend::ExtraBackendMethods&gt;::spawn_named_thread&lt;rustc_codegen_ssa::back::write::spawn_work&lt;rustc_codegen_llvm::LlvmCodegenBackend&gt;::{closure#0}, ()&gt;::{closure#0}, ()&gt;::{closure#1} as core::ops::function::FnOnce&lt;()&gt;&gt;::call_once::{shim:vtable#0}+0xa9() </span><span>fffffc7fce3fffb0 libstd-ef15f81a900bedf3.so`std::sys::unix::thread::Thread::new::thread_start::h24133bfe318082b5+0x27() </span><span>fffffc7fce3fffe0 libc.so.1`_thrp_setup+0x6c(fffffc7fed642280) </span><span>fffffc7fce3ffff0 libc.so.1`_lwp_start() </span></code></pre> </details> <p>Ok, so we're faulting somewhere in LLVM it seems like. From Cliff's initial investigation:</p> <blockquote> <p>Anyway, yeah, something about the CFG construction there is generating either an empty basic block or a basic block ending in an unexpected type of instruction (something that is not an LLVM IR terminator instruction) and triggering https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/BasicBlock.h#L121</p> </blockquote> <p>First order of business then is to just check if the IR is valid. LLVM has a pass to do just that and we can ask <code>rustc</code> to run it first by passing <code>-Z verify-llvm-ir=yes</code> (note we need to switch to nightly to use <code>-Z</code> flags):</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ RUSTFLAGS=&quot;-Z verify-llvm-ir=yes&quot; cargo +nightly build --example basic --features usdt-probes </span></code></pre> <p>Haha, nope:</p> <pre style="background-color:#151515;color:#e8e8d3;"><code><span>Basic Block in function &#39;_ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17h503b14ddd4edd1deE&#39; does not have terminator! </span><span>label %bb24 </span><span>LLVM ERROR: Broken module found, compilation aborted! </span><span> </span><span># Demangle w/ rustfilt (c++filt works well enough too) </span><span># Single quotes important to not misinterpret $ as shell vars! </span><span> </span><span>$ rustfilt &#39;_ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17h503b14ddd4edd1deE&#39; </span><span>dropshot::server::http_request_handle_wrap::{{closure}} </span></code></pre> <p>The IR generated for a closure in <code>dropshot::server::http_request_handle_wrap</code> is invalid—some basic block is missing a terminator.</p> <p>Ok, is it rustc generating the bad IR directly or the result of some transformation pass miscompiling it?</p> <p>But first, let's cheat and just get the final failing <code>rustc</code> command so we don't need to rebuild all the deps anytime we change <code>RUSTFLAGS</code>. Re-running the failing <code>cargo</code> command should just output the failing <code>rustc</code> invocation:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ cargo +nightly build --example basic --features usdt-probes </span><span> Compiling dropshot v0.6.1-dev (/src/dropshot/dropshot) </span><span>error: could not compile `dropshot` </span><span> </span><span>Caused by: </span><span> process didn&#39;t exit successfully: `rustc [...snip...]` (signal: 11, SIGSEGV: invalid memory reference) </span></code></pre> <p>From this point we can just directly run the <code>rustc</code> command as outputted with a few modifications:</p> <ul> <li>add <code>+nightly</code> otherwise the <code>rustc</code> wrapper will attempt to use the rust version mentioned in <code>rust-toolchain.toml</code></li> <li>remove the <code>--error-format=json</code> and <code>--json=...</code> flags for human-readable output</li> <li>add <code>-Z verify-llvm-ir=yes</code></li> <li>change the <code>--emit</code> argument to <code>--emit=llvm-ir</code> because that should be enough to trigger the issue and we'd like to look at the IR later</li> </ul> <p>Stick this in a simple shell script to easily modify it and run it; call it <code>repro.sh</code>. Verify it still fails as expected:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ ./repro.sh </span><span>Basic Block in function &#39;_ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17h503b14ddd4edd1deE&#39; does not have terminator! </span><span>label %bb24 </span><span>LLVM ERROR: Broken module found, compilation aborted! </span></code></pre> <p>Now back to figuring out where this invalid IR is coming from. Even though we're doing a debug build, there are still some LLVM passes that get run. So if we want to verify the IR that <code>rustc</code> directly generated, we need to make sure no LLVM passes are run at all (aside from the <code>verify</code> pass itself). The way to do that is via <code>-C no-prepopulate-passes</code> so let's edit our <code>repro.sh</code> and run it again:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ ./repro.sh </span><span>$ echo $? </span><span>0 </span></code></pre> <p>Ok <code>rustc</code> has been proven innocent. Looks like some LLVM pass generates invalid IR which really shouldn't happen! ⚠️</p> <p>Well, now what? Let's try to find out what pass is responsible!</p> <p>Our first attempt is by asking LLVM to print the IR after each pass—maybe we'll get lucky and see the offending pass last. We do this by modifying <code>repro.sh</code> again:</p> <ul> <li>remove <code>-C no-prepopulate-passes</code> &amp; <code>-Z verify-llvm-ir=yes</code></li> <li>add <code>-C llvm-args=--print-after-all</code> to print the IR after every pass</li> <li>add <code>-C codegen-units=1 -Z no-parallel-llvm</code> to make the output a bit more readable</li> </ul> <p>Alas, this doesn't go the way we want as we get the same segfault as before without any of the actual output we wanted :(</p> <p>Ok, new attempt. Let's skip <code>rustc</code> and see if we can just invoke the LLVM machinery directly via <code>opt</code>. For that, let's first install it:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ rustup component add --toolchain nightly llvm-tools-preview </span></code></pre> <p>It is not the most discoverable because it just gets plopped somewhere into <code>rustc</code>'s sysroot directory:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ OPT=$(find $(rustc +nightly --print sysroot) -name opt) </span></code></pre> <p>We also need the actual IR to pass to <code>opt</code> so let's go back and modify our <code>repro.sh</code> to only pass <code>-C no-prepopulate-passes</code>. We should find our initial <code>rustc</code> generated IR. It's also worth remove the <code>-C debuginfo=2</code> to make the IR a bit smaller:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ ls ./target/debug/examples/basic*.ll </span><span>./target/debug/examples/basic-5f5f0491fbb5b7d3.ll </span></code></pre> <p>Let's try something simple first and just run the IR through <code>opt</code> without any flags as a smoke test:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ $OPT ./target/debug/examples/basic-5f5f0491fbb5b7d3.ll </span><span>opt: ./target/debug/examples/basic-5f5f0491fbb5b7d3.ll:425470:1: error: expected instruction opcode </span><span>bb25: ; preds = %bb24 </span><span>^ </span></code></pre> <p>😐 Wat. Looking at the IR around that line, we find this:</p> <pre data-lang="llvm" style="background-color:#151515;color:#e8e8d3;" class="language-llvm "><code class="language-llvm" data-lang="llvm"><span>bb24: </span><span style="color:#888888;">; [...snip...] </span><span> </span><span style="color:#ffb964;">%186</span><span> = invoke </span><span style="color:#8fbfdc;">i64 asm sideeffect</span><span> inteldialect </span><span style="color:#99ad6a;">&quot;990: clr rax\0A\0A .pushsection set_dtrace_probes,\22aw\22,\22progbits\22\0A .balign 8\0A 991:\0A .4byte 992f-991b // length\0A .byte 1\0A .byte 0\0A .2byte 1\0A .8byte 990b // address\0A .asciz \22dropshot\22\0A .asciz \22request-start\22\0A // null-terminated strings for each argument\0A .balign 8\0A 992: .popsection\0A \0A .pushsection yeet_dtrace_probes\0A .8byte 991b\0A .popsection\0A \0A &quot;</span><span>, </span><span style="color:#99ad6a;">&quot;=&amp;{ax}&quot;</span><span>() #</span><span style="color:#cf6a4c;">23 </span><span> to </span><span style="color:#8fbfdc;">label </span><span style="color:#ffb964;">%bb25 </span><span>unwind </span><span style="color:#8fbfdc;">label </span><span style="color:#ffb964;">%cleanup26</span><span>, !srcloc </span><span style="color:#ffb964;">!38 </span><span> store </span><span style="color:#8fbfdc;">i64 </span><span style="color:#ffb964;">%186</span><span>, </span><span style="color:#8fbfdc;">i64* </span><span style="color:#ffb964;">%is_enabled</span><span>, </span><span style="color:#8fbfdc;">align </span><span style="color:#cf6a4c;">8 </span><span> </span><span>bb25: </span><span style="color:#888888;">; preds = %bb24 </span><span> </span><span style="color:#ffb964;">%_78</span><span> = load </span><span style="color:#8fbfdc;">i64</span><span>, </span><span style="color:#8fbfdc;">i64* </span><span style="color:#ffb964;">%is_enabled</span><span>, </span><span style="color:#8fbfdc;">align </span><span style="color:#cf6a4c;">8 </span><span> </span><span style="color:#ffb964;">%187</span><span> = icmp eq </span><span style="color:#8fbfdc;">i64 </span><span style="color:#ffb964;">%_78</span><span>, </span><span style="color:#cf6a4c;">0 </span><span> br </span><span style="color:#8fbfdc;">i1 </span><span style="color:#ffb964;">%187</span><span>, </span><span style="color:#8fbfdc;">label </span><span style="color:#ffb964;">%bb46</span><span>, </span><span style="color:#8fbfdc;">label </span><span style="color:#ffb964;">%bb26 </span></code></pre> <p>Well that looks awfully like the error the LLVM IR verifier was telling us about (<code>%bb24</code> not having a terminator)! So looks like our assumption about <code>rustc</code> not being the one generating valid IR is wrong. Where did we go wrong?</p> <p>Some light digging into the <code>rustc</code> source reveals that using the new LLVM pass manager (default for LLVM &gt;= 13 thus Rust &gt;= 1.56) means <code>-Z verify-llvm-ir=yes</code> is ignored when combined with <code>-C no-prepopulate-passes</code>. Whelp :/</p> <p>(To be clear, it's not really the new pass manager's fault but rather the way it is setup in <code>LLVMRustOptimizeWithNewPassManager</code>).</p> <p>So we've found one (minor) <code>rustc</code> bug so far but that doesn't help solve our original question. No worries, we can either switch to the old pass manager (<code>-Z new-llvm-pass-manager=no</code>) or just manually add the verifier pass (<code>-C passes=&quot;verify&quot;</code>), either way we get the same ol error:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ ./repro.sh </span><span>Basic Block in function &#39;_ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17h503b14ddd4edd1deE&#39; does not have terminator! </span><span>label %bb24 </span><span>LLVM ERROR: Broken module found, compilation aborted! </span></code></pre> <p>This brings us back to the actual culprit: <code>rustc</code>!</p> <p><img src="https://luqman.ca/blog/achievement-unlocked-rustc-segfault/images/rustc-scooby-meme.jpg" alt="Scooby Doo Mask Reveal Meme: Panel 1 w/ Mask on &quot;Invalid IR Generator&quot;. Panel 2 w/ Mask off &quot;LLVM&quot;. Panel 3 w/ Mask under the mask! Panel 4 w/ Mask off &quot;rustc&quot;" /></p> <p>(We really should have suspected this after the brief foray with trying to print the resulting IR after every LLVM pass didn't give us anything: the IR we feed it was botched to begin with!).</p> <p>So back to the invalid IR: at the end of <code>bb24</code> we are using <code>invoke</code> (a terminator) with our inline assembly from the <a href="https://github.com/oxidecomputer/usdt">usdt</a> probes in <code>dropshot</code> followed by a <code>store</code> instruction. Clearly this is wrong because <code>store</code> isn't a terminator and thus we shouldn't end a basic block with it. Let's see what the corresponding MIR (Rust's Mid-level IR) looks like.</p> <p>Since the failing code is coming from <code>dropshot</code> and not the basic example itself, we can't use our <code>repro.sh</code> hack and so back we go to <code>cargo</code> and <code>RUSTFLAGS</code>: Using <code>-Z dump-mir='http_request_handle_wrap'</code>:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ RUSTFLAGS=&quot;-Z dump-mir=&#39;http_request_handle_wrap&#39;&quot; cargo +nightly build --example basic --features usdt-probes </span><span>[...snip...] </span><span>(signal: 11, SIGSEGV: invalid memory reference) </span><span>$ ls mir_dump </span><span>mir_dump: No such file or directory </span></code></pre> <p>Ok, that isn't working as expected (<code>-Z dump-mir=F</code> should print just the MIR for functions matches the filter <code>F</code> and place it in a <code>mir_dump</code> folder). A bit annoying and we can only shave so many yaks right now but nothing a bigger hammer can't fix (just use <code>--emit=mir</code> to dump out all the mir into the target folder and find the corresponding one for <code>dropshot</code>):</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ RUSTFLAGS=&quot;--emit=mir&quot; cargo +nightly build --example basic --features usdt-probes </span><span>$ ls target/debug/deps/dropshot-*.mir </span><span>target/debug/deps/dropshot-30d947b7471013cc.mir </span></code></pre> <p>Ok, now this looks like reasonable:</p> <pre data-lang="rust" style="background-color:#151515;color:#e8e8d3;" class="language-rust "><code class="language-rust" data-lang="rust"><span>bb24: { </span><span>[...snip...] </span><span> asm!(</span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">990: clr rax </span><span style="color:#99ad6a;"> </span><span style="color:#99ad6a;"> .pushsection set_dtrace_probes,\&quot;aw\&quot;,\&quot;progbits\&quot; </span><span style="color:#99ad6a;"> .balign 8 </span><span style="color:#99ad6a;"> 991: </span><span style="color:#99ad6a;"> .4byte 992f-991b // length </span><span style="color:#99ad6a;"> .byte 1 </span><span style="color:#99ad6a;"> .byte 0 </span><span style="color:#99ad6a;"> .2byte 1 </span><span style="color:#99ad6a;"> .8byte 990b // address </span><span style="color:#99ad6a;"> .asciz \&quot;dropshot\&quot; </span><span style="color:#99ad6a;"> .asciz \&quot;request-start\&quot; </span><span style="color:#99ad6a;"> // null-terminated strings for each argument </span><span style="color:#99ad6a;"> .balign 8 </span><span style="color:#99ad6a;"> 992: .popsection </span><span style="color:#99ad6a;"> </span><span style="color:#99ad6a;"> .pushsection yeet_dtrace_probes </span><span style="color:#99ad6a;"> .8byte 991b </span><span style="color:#99ad6a;"> .popsection </span><span style="color:#99ad6a;"> </span><span style="color:#99ad6a;"> </span><span style="color:#556633;">&quot;</span><span>, out(</span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">ax</span><span style="color:#556633;">&quot;</span><span>) </span><span style="color:#7697d6;">_77</span><span>, options(</span><span style="color:#7697d6;">NOMEM </span><span>| </span><span style="color:#7697d6;">PRESERVES_FLAGS </span><span>| </span><span style="color:#7697d6;">NOSTACK</span><span>)) -&gt; [</span><span style="color:#8fbfdc;">return</span><span>: bb25, unwind: bb217]; </span><span style="color:#888888;">// scope 10 at dropshot/src/lib.rs:581:1: 581:41 </span><span> } </span><span> </span><span> bb25: { </span><span> </span><span style="color:#7697d6;">_78 </span><span>= </span><span style="color:#7697d6;">_77</span><span>; </span><span style="color:#888888;">// scope 9 at dropshot/src/lib.rs:581:1: 581:41 </span><span> switchInt(</span><span style="color:#8fbfdc;">move </span><span style="color:#7697d6;">_78</span><span>) -&gt; [</span><span style="color:#cf6a4c;">0_</span><span style="color:#8fbfdc;">u64</span><span>: bb46, otherwise: bb26]; </span><span style="color:#888888;">// scope 9 at dropshot/src/lib.rs:581:1: 581:41 </span><span> } </span></code></pre> <p>Note in MIR, <code>asm</code> itself is a terminator and so <code>bb24</code> here appropriately says that under normal control flow to go to <code>bb25</code> or if unwinding go to <code>bb217</code>. In <code>bb25</code> we see a simple statement, <code>_78 = _77;</code>, which is assigning the output (<code>_77</code> i.e. <code>is_enabled</code>) from the <code>asm</code> and this should correspond to the <code>store</code> we saw in the LLVM IR.</p> <p>So looks like something in the lowering from Rust MIR to LLVM IR is not quite right. Just eyeballing it, it seems like the store of the asm output is getting added to the wrong LLVM basic block. Wherein it should be part of the &quot;normal&quot; basic block taken by the <code>invoke</code> (indicated by the <code>to label %bb25</code> argument), instead it is incorrectly placed directly after the <code>invoke</code>.</p> <p>That seems to happens <a href="https://github.com/rust-lang/rust/blob/d32ce37a171663048a4c4a536803434e40f52bd6/compiler/rustc_codegen_llvm/src/asm.rs#L293-L306">here</a> in <code>rustc</code>.</p> <p>Now that we have a pretty good idea of why things fail we should be able to make a smaller repro (*):</p> <pre data-lang="rust" style="background-color:#151515;color:#e8e8d3;" class="language-rust "><code class="language-rust" data-lang="rust"><span>#![</span><span style="color:#ffb964;">feature</span><span>(asm_unwind)] </span><span> </span><span style="color:#8fbfdc;">fn </span><span style="color:#fad07a;">main</span><span>() { </span><span> </span><span style="color:#8fbfdc;">let</span><span> _x = String::from(</span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">string here just cause we need something with a non-trivial drop</span><span style="color:#556633;">&quot;</span><span>); </span><span> </span><span style="color:#8fbfdc;">let</span><span> foo: </span><span style="color:#8fbfdc;">u64</span><span>; </span><span> </span><span style="color:#8fbfdc;">unsafe </span><span>{ </span><span> std::arch::asm!( </span><span> </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">mov {}, 1</span><span style="color:#556633;">&quot;</span><span>, </span><span> out(reg) foo, </span><span> options(may_unwind) </span><span> ); </span><span> } </span><span> println!(</span><span style="color:#556633;">&quot;</span><span style="color:#7697d6;">{}</span><span style="color:#556633;">&quot;</span><span>, foo); </span><span>} </span></code></pre> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ rustc +nightly asm-miscompile.rs </span><span>[1] 6057 segmentation fault (core dumped) rustc +nightly asm-miscompile.rs </span></code></pre> <p>Cool, we've got our smaller repro segfaulting, but is it the same issue? Let's take a look at what the LLVM IR says:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ rustc +nightly asm-miscompile.rs --emit=llvm-ir -C no-prepopulate-passes </span></code></pre> <pre data-lang="llvm" style="background-color:#151515;color:#e8e8d3;" class="language-llvm "><code class="language-llvm" data-lang="llvm"><span>$ less </span><span style="color:#8fbfdc;">asm</span><span>-miscompile.ll </span><span> </span><span>[...snip...] </span><span>bb1: </span><span style="color:#888888;">; preds = %start </span><span> </span><span style="color:#ffb964;">%1</span><span> = invoke </span><span style="color:#8fbfdc;">i64 asm sideeffect alignstack</span><span> inteldialect unwind </span><span style="color:#99ad6a;">&quot;mov ${0:q}, 1&quot;</span><span>, </span><span style="color:#99ad6a;">&quot;=&amp;r,~{dirflag},~{fpsr},~{flags},~{memory}&quot;</span><span>() </span><span> to </span><span style="color:#8fbfdc;">label </span><span style="color:#ffb964;">%bb2 </span><span>unwind </span><span style="color:#8fbfdc;">label </span><span style="color:#ffb964;">%cleanup</span><span>, !srcloc </span><span style="color:#ffb964;">!9 </span><span> store </span><span style="color:#8fbfdc;">i64 </span><span style="color:#ffb964;">%1</span><span>, </span><span style="color:#8fbfdc;">i64* </span><span style="color:#ffb964;">%foo</span><span>, </span><span style="color:#8fbfdc;">align </span><span style="color:#cf6a4c;">8 </span><span> </span><span>bb2: </span><span>[...snip...] </span></code></pre> <p>Would you look at that, we've got a <code>store</code> as the last instruction in the basic block with the <code>asm</code>.</p> <p>Now that we have a small repro and know roughly where the issue is in <code>rustc</code> we can try fixing it:</p> <pre data-lang="diff" style="background-color:#151515;color:#e8e8d3;" class="language-diff "><code class="language-diff" data-lang="diff"><span>diff --git a/compiler/rustc_codegen_llvm/src/asm.rs b/compiler/rustc_codegen_llvm/src/asm.rs </span><span>index 03c390b4bd4..91d132eb343 100644 </span><span style="background-color:#4e738a;color:#ffffff;">--- a/compiler/rustc_codegen_llvm/src/asm.rs </span><span style="background-color:#4e738a;color:#ffffff;">+++ b/compiler/rustc_codegen_llvm/src/asm.rs </span><span style="font-style:italic;color:#888888;">@@ -290,6 +290,11 @@ </span><span style="font-style:italic;color:#ffb964;">fn codegen_inline_asm( </span><span> } </span><span> attributes::apply_to_callsite(result, llvm::AttributePlace::Function, &amp;{ attrs }); </span><span> </span><span style="color:#558f1f;">+ // Switch to the &#39;normal&#39; basic block if we did an `invoke` instead of a `call` </span><span style="color:#558f1f;">+ if let Some((dest, _, _)) = dest_catch_funclet { </span><span style="color:#558f1f;">+ self.switch_to_block(dest); </span><span style="color:#558f1f;">+ } </span><span style="color:#558f1f;">+ </span><span> // Write results to outputs </span><span> for (idx, op) in operands.iter().enumerate() { </span><span> if let InlineAsmOperandRef::Out { reg, place: Some(place), .. } </span></code></pre> <p>A little waiting later and we can try compiling our small repro again:</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ rustc +stage1 asm-miscompile.rs </span><span>$ echo $? </span><span>0 </span></code></pre> <p>🎉 Success! So what does the LLVM IR look like now?</p> <pre data-lang="console" style="background-color:#151515;color:#e8e8d3;" class="language-console "><code class="language-console" data-lang="console"><span>$ rustc +stage1 asm-miscompile.rs --emit=llvm-ir -C no-prepopulate-passes </span></code></pre> <pre data-lang="llvm" style="background-color:#151515;color:#e8e8d3;" class="language-llvm "><code class="language-llvm" data-lang="llvm"><span>$ less </span><span style="color:#8fbfdc;">asm</span><span>-miscompile.ll </span><span> </span><span>[...snip...] </span><span>bb1: </span><span style="color:#888888;">; preds = %start </span><span> </span><span style="color:#ffb964;">%1</span><span> = invoke </span><span style="color:#8fbfdc;">i64 asm sideeffect alignstack</span><span> inteldialect unwind </span><span style="color:#99ad6a;">&quot;mov ${0:q}, 1&quot;</span><span>, </span><span style="color:#99ad6a;">&quot;=&amp;r,~{dirflag},~{fpsr},~{flags},~{memory}&quot;</span><span>() </span><span> to </span><span style="color:#8fbfdc;">label </span><span style="color:#ffb964;">%bb2 </span><span>unwind </span><span style="color:#8fbfdc;">label </span><span style="color:#ffb964;">%cleanup</span><span>, !srcloc </span><span style="color:#ffb964;">!9 </span><span> </span><span>bb2: </span><span> store </span><span style="color:#8fbfdc;">i64 </span><span style="color:#ffb964;">%1</span><span>, </span><span style="color:#8fbfdc;">i64* </span><span style="color:#ffb964;">%foo</span><span>, </span><span style="color:#8fbfdc;">align </span><span style="color:#cf6a4c;">8 </span><span>[...snip...] </span></code></pre> <p>The store has moved down into <code>%bb2</code> right where it should be.</p> <p>(*) <strong>Note</strong>: we had to add <code>options(may_unwind)</code> and an unused <code>String</code> variable to actually get it to fail. Removing either of those will stop it from segfaulting. The difference being, the LLVM IR that <code>rustc</code> generates. Without both pieces, <code>rustc</code> just uses a simple <code>call</code> instruction for the inline asm whereas in the broken case, it's using <code>invoke</code> which is considered a terminator unlike <code>call</code>. After the <code>invoke</code>, the control transfer goes to either the 'normal' label or the 'unwind' label. By marking our asm with <code>options(may_unwind)</code> we essentially tell <code>rustc</code> to opt our inline assembly into participating in unwinding. The unused string is there so that there's actually something to cleanup in the case that we do unwind.</p> <p>But, looking at the asm from the original failing code in <code>dropshot</code> there's no mention of <code>MAY_UNWIND</code>:</p> <pre data-lang="rust" style="background-color:#151515;color:#e8e8d3;" class="language-rust "><code class="language-rust" data-lang="rust"><span>asm!(</span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">[...snip...]</span><span style="color:#556633;">&quot;</span><span>, out(</span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">ax</span><span style="color:#556633;">&quot;</span><span>) </span><span style="color:#7697d6;">_77</span><span>, options(</span><span style="color:#7697d6;">NOMEM </span><span>| </span><span style="color:#7697d6;">PRESERVES_FLAGS </span><span>| </span><span style="color:#7697d6;">NOSTACK</span><span>)) -&gt; [</span><span style="color:#8fbfdc;">return</span><span>: bb25, unwind: bb217]; </span><span style="color:#888888;">// scope 10 at dropshot/src/lib.rs:581:1: 581:41 </span></code></pre> <p><a href="https://github.com/oxidecomputer/usdt/blob/c32ec7af3394eee0fbc5c539724f8a89e8a6be48/usdt-impl/src/no-linker.rs#L95">usdt</a> certainly isn't adding it so what gives?</p> <p>Poking through <code>rustc</code>, it seems like that may be due to the way MIR inlining is <a href="https://github.com/rust-lang/rust/blob/297dde9b1ad0c28922fac5046f77c2629cebf662/compiler/rustc_mir_transform/src/inline.rs#L963-L971">implemented</a>. A cleanup block may be assigned if a terminator (like inline asm) gets inlined. The lowering pass will use the presence of such a cleanup target to ultimately decide whether to use <code>call</code> or <code>invoke</code>. </p> <p>But that theory is quickly shot down because the MIR inliner is still disabled by default.</p> <p>Just searching through <code>rustc</code> for other places the unwind target of a terminator might be set yields something promising when it comes to <a href="https://github.com/rust-lang/rust/blob/5160f8f843e1dbd43cf341cc8aa5d917d22c98b9/compiler/rustc_mir_transform/src/generator.rs#L1098-L1102">generators</a>. This would track as the original <code>dropshot</code> failure case is in the context of an async method. Further confirmation that this is where our InlineAsm terminator is getting an unwind target set is that the <code>posion_block</code> mentioned in that bit of code lines up. It has a <a href="https://github.com/rust-lang/rust/blob/5160f8f843e1dbd43cf341cc8aa5d917d22c98b9/compiler/rustc_mir_transform/src/generator.rs#L1080">single statement</a> and if we hop back to the MIR of the failing <code>dropshot</code> example, lo and behold:</p> <pre data-lang="rust" style="background-color:#151515;color:#e8e8d3;" class="language-rust "><code class="language-rust" data-lang="rust"><span> asm!(</span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">[...snip...]</span><span style="color:#556633;">&quot;</span><span>, out(</span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">ax</span><span style="color:#556633;">&quot;</span><span>) </span><span style="color:#7697d6;">_77</span><span>, options(</span><span style="color:#7697d6;">NOMEM </span><span>| </span><span style="color:#7697d6;">PRESERVES_FLAGS </span><span>| </span><span style="color:#7697d6;">NOSTACK</span><span>)) -&gt; [</span><span style="color:#8fbfdc;">return</span><span>: bb25, unwind: bb217]; </span><span> </span><span>[...snip...] </span><span> </span><span>bb217 (cleanup): { </span><span> discriminant((*(</span><span style="color:#7697d6;">_1</span><span>.</span><span style="color:#cf6a4c;">0</span><span>: &amp;</span><span style="color:#8fbfdc;">mut </span><span>[</span><span style="color:#8fbfdc;">static</span><span> generator@dropshot/src/server.rs:</span><span style="color:#cf6a4c;">651</span><span>:</span><span style="color:#cf6a4c;">43</span><span>: </span><span style="color:#cf6a4c;">741</span><span>:</span><span style="color:#cf6a4c;">2</span><span>]))) = </span><span style="color:#cf6a4c;">2</span><span>; </span><span style="color:#888888;">// scope 0 at dropshot/src/server.rs:651:43: 741:2 </span><span> resume; </span><span style="color:#888888;">// scope 0 at dropshot/src/server.rs:651:43: 741:2 </span><span>} </span></code></pre> <p><code>bb217</code> there is the unwind target and it matches exactly with the <code>poison_block</code> as constructed in the <a href="https://github.com/rust-lang/rust/blob/5160f8f843e1dbd43cf341cc8aa5d917d22c98b9/compiler/rustc_mir_transform/src/generator.rs#L1079-L1083">rustc</a>.</p> <p>Armed with some new evidence, we can adapt our simple repro to more closely match the async scenario encountered in <code>dropshot</code>:</p> <pre data-lang="rust" style="background-color:#151515;color:#e8e8d3;" class="language-rust "><code class="language-rust" data-lang="rust"><span>extern crate futures; </span><span style="color:#888888;">// 0.3.21 </span><span> </span><span>async </span><span style="color:#8fbfdc;">fn </span><span style="color:#fad07a;">bar</span><span>() { </span><span> </span><span style="color:#8fbfdc;">let</span><span> foo: </span><span style="color:#8fbfdc;">u64</span><span>; </span><span> </span><span style="color:#8fbfdc;">unsafe </span><span>{ </span><span> std::arch::asm!( </span><span> </span><span style="color:#556633;">&quot;</span><span style="color:#99ad6a;">mov {}, 1</span><span style="color:#556633;">&quot;</span><span>, </span><span> out(reg) foo, </span><span> ); </span><span> } </span><span> println!(</span><span style="color:#556633;">&quot;</span><span style="color:#7697d6;">{}</span><span style="color:#556633;">&quot;</span><span>, foo); </span><span>} </span><span> </span><span style="color:#8fbfdc;">fn </span><span style="color:#fad07a;">main</span><span>() { </span><span> futures::executor::block_on(bar()); </span><span>} </span></code></pre> <p>(Segfaults on <a href="https://play.rust-lang.org/?version=stable&amp;mode=debug&amp;edition=2021&amp;gist=1c7781c34dd4a3e80ae4bd936a0c82fc">playground</a>.)</p> <p>Thus all the mysteries are solved:</p> <ol> <li>the MIR -&gt; LLVM IR lowering for inline assembly outputted invalid LLVM IR when generated with an <code>invoke</code> instruction (Fix submitted <a href="https://github.com/rust-lang/rust/pull/95864">here</a>).</li> <li>every <code>async fn</code> in rust is implemented as a generator and as part of that, terminators in the basic blocks of such a function are modified to include a cleanup target if they can unwind (*) so as to poison the generator (**).</li> <li>We also discovered a <a href="https://github.com/rust-lang/rust/issues/95874">bug</a> with <code>-Z verify-llvm-ir=yes</code> and <code>-C no-prepopulate-passes</code> along the way (Fix submitted <a href="https://github.com/rust-lang/rust/pull/95893">here</a>).</li> </ol> <p>(*) note the segfault in the above <a href="https://play.rust-lang.org/?version=stable&amp;mode=debug&amp;edition=2021&amp;gist=1c7781c34dd4a3e80ae4bd936a0c82fc">playground</a> goes away if you remove the <code>println!</code> from <code>foo</code> because that is the only part that may actually unwind.</p> <p>(**) if you've ever seen a panic saying &quot;<code>async fn</code> resumed after panicking&quot;, this is why.</p> <p><strong>Update</strong>: The above bugs are fixed as of <code>rustc 1.62.0-nightly (52ca603da 2022-04-12)</code>.</p>