Junpeng Wan /dʒuːn.pɛŋ wɑːn/ (joon-peng wahn) 万俊鹏 Computer Security & Systems stefan1wan.github.io/about/ Sat, 18 Apr 2026 16:31:17 +0000 Sat, 18 Apr 2026 16:31:17 +0000 Jekyll v3.10.0 ARM BTB reverse engineering <h2 id="update">Update</h2> <p>I reproduced this work on a Raspberry Pi 4B. The report can be found on <a href="https://arxiv.org/abs/2412.05413">arXiv</a>, which is much clear than this blog. The corresponding code is available <a href="https://github.com/stefan1wan/BTB_ARM_RE">here</a>.</p> <h2 id="intro">Intro</h2> <p>One year and a half ago, I needed to figure out the <a href="http://www-ee.eng.hawaii.edu/~tep/EE461/Notes/ILP/buffer.html">BTB(branch target buffer)</a> capacity of an ARM server. However, no public documents could be found then. The good news was that previous work reverse engineered BTB capacity on x86 architectures[1][2]. And I am luckily enough to reproduce it and according to my results on Kunpeng 920, the BTB capacity is 4K. My code can be found <a href="https://github.com/stefan1wan/BTB_ARM_RE">here</a>.</p> <h2 id="details">Details</h2> <p>My method is to count an PMU event called ARM_PMU_BR_MIS_PRED when executing a bunch of branches. This event will tell us if there is a mispredicted or not predicted branch[3].</p> <p>The following pseudocode is our test gadget, which consists of fall-through unconditional indirect branches. We control the branch number <strong>B</strong> and the align distance <strong>N</strong>, where N is the gap(bytes) between two labels(or basic blocks).</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>adr x0, next1 BR X0 nop nop .......(some other nops) nop next1: adr x0, next2 BR X0 nop nop .......(some other nops) nop next2: adr x0, next3 BR X0 nop nop .......(some other nops) nop next3: ret </code></pre></div></div> <p>We count branch misprediction rates for different B and N. For each test, we first execute our test gadget 10 times to warm up the BTB; secondly, we run the test gadget once and log the difference of the PMU event counter(C). Then the branch misprediction rate is C/B. The result is in the following diagram. <img src="/images/posts/BTB/ARM-capacity.png" style="zoom:100%" /></p> <p>From the diagram, we could know:</p> <ul> <li>BTB set index starts from bit 5. Because if B goes from 3K to 7k, the misprediction rate is almost the same if N goes from 16 to 32(32=2^5); but there is a leap from 32 to 64.</li> <li>The BTB entry size is 4096. Because when N is 32, the rate is 0.01 for 4K however 0.39 for 5K. However, there are still 40-50 branches not buffered by BTB. It may be introduced by timer interrupts or other noise.</li> </ul> <h2 id="a-little-bit-more">A little bit more</h2> <p><em>I am not sure with the following contents and welcome to correct me.</em></p> <h3 id="ways">Ways</h3> <p>We already knew that the set index starts from bit 5. Since the BTB capacity is 4K, the set index bits are at most 12 bits (2^12=4096). Therefore, if we let N &gt;= 2^17, all branches will fit into the same set. By counting branch mispredictions, we could know the ways in each set. Here is our result. <img src="/images/posts/BTB/ARM-setindex.png" style="zoom:70%" /> From the picture, we may conclude that each set has 8 ways. Because when B=8, there is always a low miss rate when log(N)&gt;17. However, if there a eviction buffer for BTB, the ways maybe smaller(like 4 or 6). If way=8, then there are 512 sets(8<em>512=4096). The set index should be 5-13 bit. However, we observe that when log(N) = 14 and B=16, the cache miss rate is 0, which implies that the 16 branches are held by different sets so bit 14 is also used for indexing sets. As the picture shows, each bit from 5 to 14 could influence the set distribution thus influencing results. There should be a hash function which maps 5-13 bit to 512 BTB sets. If way=4, and there exists a BTB eviction buffer, then there are 1024 sets(4</em>1024=4096). The set index should be 5-14 bit.</p> <p>According to Ockham’s razor principle, I tend to believe the BTB is 4 way and some eviction buffer exists. But more explorations need to be done to confirm that.</p> <h2 id="reference">Reference</h2> <ul> <li>[1] <a href="https://ieeexplore.ieee.org/document/4919652">Experiment Flows and Microbenchmarks for Reverse Engineering of Branch Predictor Structures</a></li> <li>[2] <a href="https://xania.org/201602/bpu-part-three">The BTB in contemporary Intel chips</a></li> <li>[3] Armv8-M Architecture Reference Manual</li> </ul> Tue, 14 May 2024 00:00:00 +0000 stefan1wan.github.io/about/2024/05/BTB/ stefan1wan.github.io/about/2024/05/BTB/ KnowledgeShare Change SSD and Battery for my old MBP <h2 id="intro">Intro</h2> <p>I bought my MacBook Pro five years ago. Now in the winter of 2021, its battery life is significantly short, about 1 hour for typing words, and it contains a small-space SSD with 256GB space. I have to store my virtual machines and some data on an external SSD. Fortunately, I found that for my model, Retina, 15-inch, Mid 2015(A1398), both the battery and the SSD Card could be replaced.</p> <h2 id="my-choices">My choices</h2> <p>The SSD strategy was learned from [1] and the Battery choice was learned from [2]. It costs me about 1000 RMB altogether. I think they are worth the price because in this way I do not need to buy a new computer.</p> <h3 id="ssd">SSD</h3> <p>The SSD interface in mac is mSATA. However, an SSD card with mSATA is very expensive, so I choose an M.2 interface SSD with a <a href="https://item.m.jd.com/product/100017938802.html">M.2-to-mSATA converter</a>, which is much cheaper. Since my computer model supports PCIe 3.0 x4 whose theoretical maximal bandwidth is 4GB/s, it better to buy a high-speed SSD for me. My choice is Samsung 970 EVO plus, 512G, which could achieve a sequential read/write speed of 3500MB/s and 3300MB/s. (This is not an advertisement.) Note that Samsung 980 could not be used as the system disk in Mac OS.(I wasted an afternoon on that.)</p> <h3 id="battery">Battery</h3> <p>I buy a new set of batteries from <a href="https://item.jd.com/4494203.html">Jingdong</a> which contains a full set of tools, like screwdrivers and anti-static gloves.</p> <h2 id="procedures">Procedures</h2> <p>The following are my procedures:</p> <ul> <li>backup the system using TimeMachine</li> <li>take out the screws on the behind of the computer, and take down the back lid(There is so much dirt!)</li> <li>unplug the Battery interface to prevent accidents</li> <li>change SSD <ul> <li>unscrew the screws of the SSD, like this <img src="/images/posts/SSD&amp;Battery/mSATA_SSD.jpg" style="zoom:20%" /></li> <li>take down the above old SSD</li> <li>put our M.2-to-mSATA converter to the mSATA interface</li> <li>put the new M.2 SSD on the M.2-to-mSATA converter</li> <li>Screw the screws</li> </ul> </li> <li>change Battery (There is a detailed specification in the battery case) <ul> <li>release the wires of the touchpad</li> <li>use wane and ethanol to pry open the old battery(it may take you a while)</li> <li>put the new battery on and plug the Battery interfaces</li> <li>put back the wires of the touchpad</li> </ul> </li> <li>Re-tighten the screws</li> <li>Start the system by “cmd+option+r+power”, and it will recover from the Internet</li> <li>Erase the Disk APFS mode with GUID. Then recover the system from TimeMachine.</li> </ul> <h2 id="ends">Ends</h2> <p>My old battery looks like this: <img src="/images/posts/SSD&amp;Battery/Old_battery.jpg" style="zoom:20%" /> If I had known that previously, I would not take it out every day. It seems to explode at any time. Now, I have a good better life. Most importantly, I don’t need to care about the disk space too much. <img src="/images/posts/SSD&amp;Battery/newstorage.png" style="zoom:50%" /></p> <h2 id="reference">Reference</h2> <ul> <li>[1]<a href="https://post.smzdm.com/p/a783vk9g/">SSD</a></li> <li>[2]<a href="https://post.smzdm.com/p/a78zn859/">Battery</a></li> </ul> Sat, 20 Nov 2021 00:00:00 +0000 stefan1wan.github.io/about/2021/11/SSD&Battery/ stefan1wan.github.io/about/2021/11/SSD&Battery/ Tutorial Drivers <h1 id="drivers">Drivers</h1> <h2 id="intro">Intro</h2> <p>I wrote some <a href="/files/Drivers.key">slides</a> to share the basic knowldege of drivers. My main reference is <a href="http://gauss.ececs.uc.edu/Courses/e4022/code/drivers/Kernel/docs.html">Writing Network Device Drivers for Linux</a>.</p> Sun, 14 Nov 2021 00:00:00 +0000 stefan1wan.github.io/about/2021/11/Drivers/ stefan1wan.github.io/about/2021/11/Drivers/ KnowledgeShare RSS <h1 id="rss-you-decide-what-you-read">RSS: You decide what you read</h1> <h3 id="introduction">Introduction</h3> <p>Sometimes I will feel bored and want to read and learn something new, but at this time, I don’t know what to read: bookmarks in my Chrome are so chaotic; news apps send me a lot of things I am not concerned about. As a result, I end up spending a lot of time on Moments and Weibo. Recently, I found that <a href="https://en.wikipedia.org/wiki/RSS">RSS</a> could solve this problem, which makes accessing new updates on my interested websites impossible. In actuality, I have collected a lot of information sources from the Internet, like blogs, tutorials, some official websites, and stuff. But in general, they just lay in my collection directories in my Chrome. If I could use RSS to follow their newer contents, I could have something to read.</p> <h3 id="solution">Solution</h3> <p>My solution is quite simple, divided into three steps:</p> <ul> <li>register an account of <a href="https://www.inoreader.com/">Inoreader</a>, which is an RSS reader.</li> <li>add an extension in Chrome: <em><a href="https://chrome.google.com/webstore/detail/rss-reader-extension-by-i/kfimphpokifbjgmjflanmfeppcjimgah">RSS Reader Extension (by Inoreader)</a></em> . If a website has a web feed(support RSS), you could subscript that by clicking this extension.</li> <li>read my subscripts on Inoreader website or the Inoreader application on my phone.</li> </ul> <p>In this way, I could subscript anything which attracts me, and then I could read their new contents when I am bored.</p> <p>By the way, welcome to subscript my blog with RSS!</p> Tue, 02 Nov 2021 00:00:00 +0000 stefan1wan.github.io/about/2021/11/RSS/ stefan1wan.github.io/about/2021/11/RSS/ Misc Mesh Side-Channel Attack <h1 id="mesh-side-channel-attack">Mesh Side-Channel Attack</h1> <h2 id="introduction">Introduction</h2> <p>In this blog, I will briefly introduce a research project done by our group which we have submitted to a top conference. You could find further details on <a href="/files/MeshUp.pdf">our paper</a>.</p> <p>By accessing cachelines to make directed data flow, we could congest the router of Mesh Interconnect of server-grade CPU, where we could get a stable delay. If another program access the memory at the same time and the cachelines are transferred from the router we congested, we could observe a higher dealy. By recording all the delays, our attack is able to deduce the victim program’s secret information, for example, the private key of RSA. We attacked the Java program running on JVM, and captured the Square and multiply sequences.</p> <p>Our attack contains the three parts of work.</p> <ul> <li>The reverse engineering of Mesh NOC topology.</li> <li>Implementing point-to-point mesh accessing and make the interconnect congested.</li> <li>Recording the logs and deduce the secrets.</li> </ul> <h2 id="reverse-engineering">Reverse Engineering</h2> <p>Take an example of Xeon 8260(our experiment environment), there are 28 tiles on the CPU chip, and each tile has core and uncore components. The CHA of uncore component is responsible to reply the LLC access from cores and manage the LLC on this tile. <!-- ![](/images/posts/Mesh_Attack/Xeon_layout.png) --> <img src="/images/posts/Mesh_Attack/Xeon_layout.png" style="zoom:80%" /> To make point-to-point congestion, we should learn the mapping relationships of tiles, CHAs, and cores. Firstly, 4 tiles of 8260 were disabled after it produced. And we will confirm which 4 cores are disabled by reading msr register CAPID6. In our machine, bits 2, 3, 21, and 27 are 0, which means that these four tiles are disabled. As a result, the layout is as following: <img src="/images/posts/Mesh_Attack/Tile.png" alt="" /></p> <p>By the way, the ID of tiles and CHAs are growing from top to down and left to right. As a result, CHA 2 is in tile 4. In this way, we could map the CHA ID with tile ID for all CHAs. We also need the relationships between core ID(core ID is the physical core ID in OS) and CHA ID. We found this information could be got by a PMU event - <em>LCORE_PMA GV</em> (Core Power Management Agent Global system state Value). Firstly, we bind a process to a core(ID=X), and do a lot of operations with that process(e.g. access a large volume of memory.) At the same time, we monitor the <em>LCORE_PMA GV</em> counter of every CHA. We could observe that the counter on one CHA(ID=Y) is higher than others. So we can confirm that core X and CHA Y lay on the same tile because the activities of core X could change the power management state of CHA Y. Repete the above procedure from core 0 to core 23, we could learn the mappings between core and CHA as the following picture shows. <img src="/images/posts/Mesh_Attack/CHA_CORE.png" alt="" /></p> <h2 id="point-to-point-mesh-interconnect-congestion">Point to point mesh interconnect congestion</h2> <!-- ![](/images/posts/Mesh_Attack/Cache_access.png) --> <p><img src="/images/posts/Mesh_Attack/Cache_access.png" style="zoom:60%" /> As the picture shows, LLC is non-inclusive by cores, while L1/L2 is inclusive by cores. So, a core on CPU could use access LLC slices from any tiles. LLC is managed by CHA on the tile, and a hash algorithm could determine the corresponding CHA ID which manages a specific Cacheline. By the way, the input of this hash algorithm is bit 6 to bit 64 of the physical memory. <img src="/images/posts/Mesh_Attack/Associative.png" style="zoom:60%" /> <!-- ![](/images/posts/Mesh_Attack/Associative.png) --></p> <p>We devise an evict-based method L2-evict to make the memory access flow, which is similar to the concurrent work <a href="https://arxiv.org/abs/2103.03443">Lord of Ring</a>. Suppose we want to congest the interconnect between core R and CHA T. Firstly, we will find an EV(eviction set) set. Cachelines in one EV set will be in a set of core R’s L2 cache and will be managed by CHA T. To find the EV set of a specific LLC slice, we use <em>check_conflict</em> and <em>find_EV</em> functons in <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;arnumber=8835325&amp;tag=1">Attack Directory</a>. Such information could also be obtained by PMU events. And to get cachelines in one specific L2 cache set, we use bit 6-bit 15 of physical memory.</p> <p>If we access the cachelines in the above EV set, cachelines will be evicted from L2 to LLC and then reload from LLC to L2 (The L2’s evict policy is pseudo-LRU). In our machine, an L2 set has 16 ways, and an LLC set has 11 ways. To avoid the situation that cachelines in LLC are evicted to memory which will introduce a higher delay, we will make half of bit 16 of EV’s cachelines to 0 and the other to 1. In this way, EV will be evicted to 2 LLC sets. According to our test, setting the number of cachelines in EV to 24 could maximize the congestion of the mesh interconnect.</p> <p><img src="/images/posts/Mesh_Attack/Mapping.png" style="zoom:60%" /> <!-- ![](/images/posts/Mesh_Attack/Mapping.png) --></p> <h2 id="recording-and-analysis">Recording and Analysis</h2> <p>We will access 20 EVs and record the <em>rdtscp</em> timestamp once a time (in actuality, we access 10 EVs and each of them twice). By the gaps between <em>rdtscp</em> timestamps, we could infer the program’s secret information.</p> <p>For example, when a Java program running on JVM is decrypting an RSA-encryped message with a private key, it will call the <em>modPow()</em> method of BigInteger package of JDK, which adopts a slide-window algorithm. And our attack is able to capture the square and Multiply sequences in the process of the slide-window algorithm. As the following picture shows, we could capture 3 kinds of memory access patterns. For instance, in pattern <em>B</em> we could observe square operations directly and then we could deduce the multiply operations by the gaps between the captured square operations. By applying a <a href="https://eprint.iacr.org/2017/627.pdf">SRID algorithm</a>, we could recover about 30% bits of the private keys. <img src="/images/posts/Mesh_Attack/Pattern.png" style="zoom:60%" /> <!-- ![](/images/posts/Mesh_Attack/Pattern.png) --></p> Fri, 08 Oct 2021 00:00:00 +0000 stefan1wan.github.io/about/2021/10/Mesh_attack/ stefan1wan.github.io/about/2021/10/Mesh_attack/ Research The Network-On-Chip Structure of Skylake and Congestion Monitoring <h1 id="the-network-on-chip-structure-of-skylake-and-congestion-monitoring">The Network-On-Chip Structure of Skylake and Congestion Monitoring</h1> <p>If we want to understand the functions and behaviors of the mesh network in Skylake, one way is via <a href="http://kib.kiev.ua/x86docs/Intel/PerfMon/336274-001.pdf">PMON</a>. We can monitor the number of counters for several specific events through PMON to speculate the inner states of CPU. For example, if we monitor event HORZ_RING_BL_IN_USE and read the corresponding counters, it will tell us how many Uncore cycles the horizontal block ring is in use. One of our aims is to illustrate the congestion degree by counting PMON events among where are a lot of events related to congestion, but it’s a pity that Intel didn’t explain these events clearly. However, if we know some design ideas about the Skylake Network On Chip(NOC) structure, especially the design of the routers and the functions of flow control, we are able to learn more from these events.</p> <p><img src="/images/posts/Skylake_NOC/mesh.png" alt="" /></p> <h2 id="the-router">The Router</h2> <p>From the macro-perspective, the NOC of Skylake is a mesh network. The routing algorithm is a Y-X route that could avoid deadlock and is easy to implement. The data goes along the vertical ring firstly and then the horizontal ring. The Common Ring Stop(CMS) is actually the router of the mesh and connects rings of four directions. The picture below is CMS on the PMON document(Ref. 1). It has two agents with partly different functions, which can transfer data from AD, AK, BL, and IV rings. AD, AK, and BL rings have two directions respectively, but the IV ring has only one direction. <img src="/images/posts/Skylake_NOC/cms.png" alt="" /> We can see that from the descriptions of Unit Masks for TxR_VERT_CYCLES_FULL. <img src="/images/posts/Skylake_NOC/cycles_full.png" alt="" /> By the way, in the PMON events, Egress has vertical and horizontal descriptions while Ingress just has the horizontal description. And the EGR area is about twice of IGR. So we guess that the Egress buffers have a specific buffer for each direction and the Ingress buffer stores all the incoming packets.</p> <p>To understand the microarchitecture, one of the possible sources is a paper written by Intel(Ref. 3), which is published about when the Skylake was designed. The router’s design is as follows. <img src="/images/posts/Skylake_NOC/modex.png" alt="" /></p> <h2 id="flow-control">Flow Control</h2> <p>According to the descriptions in PMON events, the flow control method of Skylake is Flowless routing, and specifically, Credit-based routing.</p> <h2 id="congestion-moniter">Congestion Moniter</h2> <p>We can infer the congestion state from many CMS events, and the following are some:</p> <ul> <li>RING_IN_USE: the Uncore cycles during which the rings are in use</li> <li>NACK: not get responded when CMS send messages</li> <li>BYPASS: packets that bypass CMS Ingress or Egress buffer</li> <li>ADS: Anti Deadlock Slot was used</li> <li>SINK_STARVED: discarded packets due to starvation</li> <li>STALL: stalled due to lacking credits</li> </ul> <p>The experiment evaluation can be found on our <a href="https://arxiv.org/pdf/2103.04533.pdf">paper</a>.</p> <h2 id="reference">Reference</h2> <ol> <li><a href="http://kib.kiev.ua/x86docs/Intel/PerfMon/336274-001.pdf">Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring</a></li> <li><a href="https://slideplayer.com/slide/14268395/">Topology and Cache Coherence in Knights Landing and Skylake Xeon Processors</a></li> <li>MoDe-X: Microarchitecture of a Layout-Aware Modular Decoupled Crossbar for On-Chip Interconnects, IEEE Transactions on Computers, Vol. 63, No. 3, March 2014 P622.</li> <li><a href="https://patents.google.com/patent/US20150006776">Intel’s patent about On-chip mesh interconnect</a></li> <li><a href="https://patents.google.com/patent/US20170019350A1/en">Intel’s patent about shared mesh</a></li> <li><a href="https://stackoverflow.com/questions/50077189/skylake-and-newer-ring-bus">Cms answer on stackoverflow</a></li> <li><a href="https://en.wikichip.org/wiki/intel/mesh_interconnect_architecture">mesh priciples</a></li> </ol> Wed, 03 Mar 2021 00:00:00 +0000 stefan1wan.github.io/about/2021/03/Skylake_NOC_functions/ stefan1wan.github.io/about/2021/03/Skylake_NOC_functions/ Research Invisible Probe <h1 id="invisible-probe-timing-attacks-with-pcie-congestion-side-channel">Invisible Probe: Timing Attacks with PCIe Congestion Side-channel</h1> <h2 id="introduction">Introduction</h2> <p>In this blog, I will introduce a research project done by our group, <a href="https://www.ieee-security.org/TC/SP2021/program-papers.html"><em>Invisible Probe: Timing Attacks with PCIe Congestion Side-channel</em></a>. <a href="http://homepage.fudan.edu.cn/zz113/">My supervisor</a> leaded this program and my contribution is mainly in the experiment part which contains a little exploration. The details are on the paper.</p> <h2 id="attack-surface---pcie-peripheral-component-interconnect-express-link">Attack Surface - PCIe (Peripheral Component Interconnect express) Link</h2> <p>This is the first work focus on a side-channel attack through the congestion in PCIe. If the attacker commits congestion upon PCIe link, she may percept what is transferring on this PCIe link. And we find 2 threat scenarios and test them by design 4 specific experiments.</p> <h2 id="two-threat-scenarios">Two Threat Scenarios</h2> <p><img src="/images/posts/inv_probe/topology.png" alt="" /></p> <h3 id="pch-nvme-ssd--nic">PCH: NVMe SSD &amp; NIC</h3> <p>PCH (Platform Controller Hub) was designed to connect multiple relatively slow devices, like hard disks, sound cards, and NICs. We assume that an NVMe SSD card and a NIC are both connected by PCH, so the attacker can repeatedly access the SSD by <a href="https://spdk.io/">SPDK</a> to congest PCH and log the intervals between every 2 accession. If a victim is browsing a website at the same time, the traffic transferred back by NIC will increase the intervals logged by our attacker. If we train the different logs with deep learning, we can distinguish the different websites which the victims are browsing.</p> <h3 id="pcie-switch-rdma-nic--gpu">PCIe Switch: RDMA NIC &amp; GPU</h3> <p>PCIe Switch allows several devices connected to one interface offered by CPU. We assume that an RDMA NIC and a GPU are connected by a PCIe switch. The attacker will repeatedly access memory through another machine’s RDMA NIC. Therefore, traffic transferred on PCIe switch will be discovered. Victim’s activities which related to GPU, like browsing website, training models and input password, will be perceived.</p> <h2 id="four-specific-experiment">Four Specific Experiment</h2> <table> <thead> <tr> <th>NO</th> <th>Congested at</th> <th>Attacker operates on</th> <th>Victim operates on</th> <th>Information stealed</th> </tr> </thead> <tbody> <tr> <td>A</td> <td>PCH</td> <td>NVMe SSD</td> <td>NIC</td> <td>Websites</td> </tr> <tr> <td>B</td> <td>PCIe switch</td> <td>RDMA NIC</td> <td>GPU</td> <td>Websites</td> </tr> <tr> <td>C</td> <td>PCIe switch</td> <td>RDMA NIC</td> <td>GPU</td> <td>Trained models</td> </tr> <tr> <td>D</td> <td>PCIe switch</td> <td>RDMA NIC</td> <td>GPU</td> <td>Password keystrokes</td> </tr> </tbody> </table> <p>In the 4 experiments above, we first use our attack scripts to congest with logging the access time intervals and then recover information in 2 ways. For A, B and C, we collect enough data and train deep learning models. For D, we just extract the keystrokes from logs. When the victim inputs the password in Chrome, the intervals are like Fig. 3, and red stars are the keystrokes related to the password. <img src="/images/posts/inv_probe/strokes.png" alt="" /></p> <!-- + Congest PCH with NVMe to distinguish website + Congest PCIe switch with RDMA NIC to distinguish websites + Congest PCIe switch with RDMA NIC to distinguish trained models + Congest PCIe switch with RDMA NIC to distinguish password keystrokes --> Tue, 02 Mar 2021 00:00:00 +0000 stefan1wan.github.io/about/2021/03/invisible_probe/ stefan1wan.github.io/about/2021/03/invisible_probe/ Research LITE Kernel RDMA <h1 id="paper-read-lite-kernel-rdma">Paper Read: LITE Kernel RDMA</h1> <p>Next week I’ll make a presentation in <em>Advanced Network</em>, a graduate course. Our teacher provided a paper list about Computer Network, from which we can choose a paper and make a presentation, and then introduce it in class. The paper I choose is a 2017 <em><a href="https://www.sigops.org/s/conferences/sosp/2017/program.html">SOSP</a></em> paper <em><a href="https://cseweb.ucsd.edu/~yiying/LITE-sosp17.pdf">LITE Kernel RDMA Support for Datacenter Applications</a></em>. Here are some important points of their work.</p> <h2 id="the-abstraction-mismatch">The Abstraction Mismatch</h2> <p><img src="/images/posts/LITE_RDMA/Native_RDMA.png" style="zoom:50%" /> <!-- ![](/images/posts/LITE_RDMA/Native_RDMA.png) --></p> <p>As in the picture, in Native RDMA, the programmer will write codes on the libraries provided by the RNIC hardware which totally bypassing the kernel. So what the native RDMA mechanism has is Low-level and Difficult-to-use APIs, whilst what the developer wants is high-level and easy-to-use APIs. Therefore there is an Abstraction Mismatch.</p> <p>Things worked well in HPC(High-Performance Computing) because it has special hardware, few applications, and relatively cheaper developers. While in the datacenters, we just have commodity and cheaper hardware and what we handle are a lot of changing apps. The resource sharing and isolation will also be a problem in this scenario.</p> <p>Hince, things will be very complicated when trying to use native RDMA in datacenters.</p> <h2 id="what-this-paper-do-in-general">What This Paper Do In General</h2> <p>This paper adds an Indirection tier in the Linux kernel to support RDMA operations at the OS(operating system) level. The Name <em>LITE</em> comes from “Local Indirection TiEr”, as it refers, the designer just adds one local kernel layer in the local node. The remote side is the same as the native RDMA. With the support of OS, it can provide high-level APIS to userspace, so the applications will be simpler. <em>LITE</em> also on-load the <em>Permission check</em> and <em>Address mapping</em> operations into the kernel, so <em>LITE</em> needs simpler hardware.</p> <h2 id="design-and-abstraction-principles">Design and Abstraction Principles</h2> <p>If we want the features like <em>High-level abstraction</em>, <em>Resource sharing</em>, <em>Performance isolation</em>, and <em>Protection</em>, one easy way is to use kernel. And as Butler Lampson says, “All problems in computer science can be solved by another level of indirection”. So what the <em>LITE</em> do is to add an indirection layer in the kernel. <em>LITE</em> works on RDMA verbs so it’s easy to support different hardware. By the way, verbs are just low-level descriptions of RDMA, not APIs.</p> <p>They list three design principles:</p> <ul> <li> <ol> <li>Indirection only at local for one-sided RDMA</li> </ol> </li> <li> <ol> <li>Avoid hardware indirection</li> </ol> </li> <li> <ol> <li>Hide kernel cost</li> </ol> </li> </ul> <p>To avoid the present hardware indirection, the author finds an API which can register physical address in the kernel. In this way, there is no need to cache the PTEs. <em>LITE</em> registers the whole memory at once and manages them in the kernel, so we just need to store one pair of global keys in RNIC SRAM.</p> <h2 id="some-performance">Some Performance</h2> <p>LITE scales much better than native RDMA wrt MR size and numbers. <img src="/images/posts/LITE_RDMA/MR_SC.png" style="zoom:50%" /> <!-- ![](/images/posts/LITE_RDMA/MR_SC.png) --> LITE only adds a very slight overhead even when native RDMA doesn’t have scalability issues <img src="/images/posts/LITE_RDMA/Latency.png" style="zoom:50%" /> <!-- ![](/images/posts/LITE_RDMA/Latency.png) --></p> <h2 id="in-the-end">In The End</h2> <p>In the <em><a href="https://github.com/WukLab/LITE">code</a></em> of LITE, we can see that it was written and compiled into several kernel modules, but only supports kernel version <em>3.11.1</em>, <em>3.10.108</em> and <em>4.9</em>. If possible, I will read the source code and write another blog(it’s a flag :)). But since the <em>IO_uring</em> have already appeared in the kernel, LITE will be harder to be put in use(as my advisor says).</p> Thu, 10 Dec 2020 00:00:00 +0000 stefan1wan.github.io/about/2020/12/LITE_Kernel_RDMA/ stefan1wan.github.io/about/2020/12/LITE_Kernel_RDMA/ KnowledgeShare My first blog <!-- # Hello world --> <!-- ![](/images/avatar.jpg) --> <p><img src="/images/avatar.jpg" style="zoom:30%" /></p> <h1 id="my-first-blog">My First Blog</h1> <p>This is my blog to share what I learn and how I think in my study and work, maybe a project finished or a paper read recently. Also, the moments or thoughts which will be awkward to be shared in the <em>Moments</em> and <em>Weibo</em>. Besides, I believe that writing is a good way to test whether you really understand something. Hope it will be a long journey.</p> Wed, 02 Dec 2020 00:00:00 +0000 stefan1wan.github.io/about/2020/12/start_blog/ stefan1wan.github.io/about/2020/12/start_blog/ Misc GDB Basic Commands <h1 id="gdbbasic-commands">GDB–Basic Commands</h1> <p>Here is a simple tutorial of GDB I wrote before.</p> <h3 id="gdb">GDB</h3> <ul> <li>gdb level1: use gab to debug binary level1</li> <li>run: execute this binary</li> <li>disas f_A: disassembly function f_A</li> <li>break *0xdeedbeef: set a new breakpoint at position 0xdeedbeef</li> <li>info breakpoint: check all the breakpoints</li> <li>info register: check the states of registers</li> <li>x/wx address: check the contents at address <ul> <li>w could be b/h/w/g as 1/2/4/8 bytes</li> <li>x/100wx: show 100 four-byte word once a time</li> <li>the second x could be u/d/s/x/w (determines the type of the memory address) <ul> <li>u: unsigned int</li> <li>d: show as decimal number</li> <li>x: show as hexadecimal number</li> <li>s: show as strings</li> <li>i: show as instructions</li> </ul> </li> </ul> </li> <li>ni – finish next instruction(if it is a call, it will execute until it returns)</li> <li>si - execute next instruction(if it is a call, it will execute the first instruction of the function next)</li> <li>backtrace - show all the stack frames of the calling chain</li> <li>continue - execute the process until it ends, crashes, or is at a breakpoint</li> <li>set *address = value <ul> <li>set 4 bytes at address</li> <li>change it to char, short, int, and long to represent 1,2, 4, and 8 bytes</li> <li>e.g. set [int]0x80408000 = 666</li> </ul> </li> <li>attach [pid]: attach a running process</li> </ul> <h3 id="programs-compiled-with-debug-symbols">programs compiled with debug symbols</h3> <ul> <li>list: list the source codes</li> <li>b: add breakpoint at source code number</li> <li>Info local: list local variables</li> <li>print val: print the value of variables</li> </ul> <h3 id="gdb-peda">GDB-peda</h3> <ul> <li>checksec – check protection mechanisms of the binary</li> <li>elfsymbol - get every PLT addresses(useful at rop)</li> <li>vmmap - check all the memory segments and permissions(read, write,execute)</li> <li>readelf - check positions of important elf data structures(.plt, .plt.got, .bss)</li> <li>find /bin/sh - find the address of string “/bin/sh”</li> </ul> Thu, 31 Oct 2019 00:00:00 +0000 stefan1wan.github.io/about/2019/10/GDB_Basic_Commands/ stefan1wan.github.io/about/2019/10/GDB_Basic_Commands/ Tutorial