<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://gilli.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="https://gilli.dev/" rel="alternate" type="text/html" /><updated>2025-08-28T10:35:59+00:00</updated><id>https://gilli.dev/feed.xml</id><title type="html">Christian’s Dev Notes</title><subtitle>This is my personal website where I mostly write about programming stuff and other things that I find interesting.</subtitle><entry><title type="html">Why I Ditched malloc for AI Inference</title><link href="https://gilli.dev/programming/2025/08/28/why-i-ditched-malloc.html" rel="alternate" type="text/html" title="Why I Ditched malloc for AI Inference" /><published>2025-08-28T00:00:00+00:00</published><updated>2025-08-28T00:00:00+00:00</updated><id>https://gilli.dev/programming/2025/08/28/why-i-ditched-malloc</id><content type="html" xml:base="https://gilli.dev/programming/2025/08/28/why-i-ditched-malloc.html"><![CDATA[<p>In this post, I’ll show the impact poor memory management had on <a href="https://github.com/nirw4nna/dsc">DSC</a>, a tensor library I wrote from scratch in C++ and Python. I’ll then show how I fixed this by implementing a general purpose memory allocator from scratch.
The goal here is not to implement something that is state-of-the-art or novel but rather to show what I went through when working on DSC and what I learned about memory management in the process.</p>

<blockquote>
  <p>Note: all the tests in this post were done on my workstation with a AMD Ryzen 9 3900X CPU, 64GB of RAM and an NVIDIA RTX 3060 TI GPU running Linux Mint 22.1 (Linux 6.8).</p>
</blockquote>

<h3 id="the-problem">The Problem</h3>
<p>During a single forward pass of Qwen2.5 0.5B, DSC is making over 2400 tensor allocations and deallocations. Below is a Perfetto trace obtained by running DSC with tracing enabled <a href="#note1"><sup> 1 </sup></a>. Here, each pink spike shows either a tensor allocation or a deallocation ranging from a few microseconds to hundreds of microseconds and, in very few cases, even milliseconds.</p>

<p><img src="/assets/images/why-i-ditched-malloc/alloc_naive_perfetto.png" alt="Perfetto trace of a single forward step of Qwen2.5 0.5B" /></p>

<p>This overhead compounds quickly but more importantly, this <em>unpredictability</em> can lead to random performance degradation. For example, consider a speed of roughly 100tok/s, this will give you around 10ms to generate a single token. In this case, a single 1ms spike in your memory allocator could lead to a 10% performance loss end-to-end.</p>

<p>Here’s how I eliminated all the <code class="language-plaintext highlighter-rouge">malloc</code> calls from the hot-path of DSC and how this led to better and more predictable performance.</p>

<h3 id="the-naive-approach">The Naive Approach</h3>
<p>When it comes to manually managing tensors, the most direct approach would be to naively call <code class="language-plaintext highlighter-rouge">malloc</code>and <code class="language-plaintext highlighter-rouge">free</code> whenever memory is required. If we want to deal with tensors on the GPU this involves also calls to <code class="language-plaintext highlighter-rouge">cudaMalloc</code> and <code class="language-plaintext highlighter-rouge">cudaFree</code> or equivalent. Also, we probably need a mechanism for sharing the underlying data buffer between tensors (see: <a href="#appendix-a---reference-counting">Appendix A - Reference Counting</a>), this could mean adding another object to the mix that wraps the raw data pointer and that is reference-counted.</p>

<p>One way of modelling our data is like this:</p>

<p><img src="/assets/images/why-i-ditched-malloc/tensor_dsc.png" alt="How a tensor is modeled in DSC" /></p>

<p>When it’s time to allocate a new tensor this is the workflow we expect:</p>
<ol>
  <li><code class="language-plaintext highlighter-rouge">malloc</code> the tensor descriptor - the object that holds the tensor attributes like shape, stride and dtype</li>
  <li><code class="language-plaintext highlighter-rouge">malloc</code> the data buffer - the object that wraps the actual data pointer and is reference-counted</li>
  <li><code class="language-plaintext highlighter-rouge">cudaMalloc</code> the actual data block that lives on the device</li>
</ol>

<p>Freeing is the same just in reverse order.</p>

<p><img src="/assets/images/why-i-ditched-malloc/tensor_alloc_steps.png" alt="The steps required to allocate a new tensor in DSC" /></p>

<p>This pattern has to repeat twice for every intermediate tensor during inference. Matmuls need an output buffer, attention mechanisms spawn dozens of intermediate tensors for each layer and so on.</p>

<p>This could lead to a general performance degradation that is not easy to track<a href="#note2"><sup> 2 </sup></a> and, most importantly, not easy to fix because this requires a complete redesign of how memory is managed in the entire application.</p>

<p>During my testing with DSC I found that memory allocations and deallocations account for 20-25% of inference time. This is pure overhead without any computational value. This number varies greatly depending on the size and complexity of the model (more layers = more allocations also, more complex operations = more intermediates = more allocations) but it also depends on how optimized everything is: if kernels get faster, allocations will account for even more of the inference time, becoming the real bottleneck.</p>

<p>Clearly, we can do better.</p>

<h3 id="step-1-identifying-predictable-patterns">Step 1: Identifying Predictable Patterns</h3>
<p>The key insight that made me change how I do memory allocation in DSC is that AI inference has very predictable memory patterns.</p>

<p>Unlike general-purpose applications that allocate wildly different object types and sizes, neural networks follow a simple template:</p>
<ul>
  <li><strong>Fixed model parameters</strong>: a few hundred to a few thousand tensors loaded once at startup and never freed</li>
  <li><strong>Predictable intermediates</strong>: roughly the same temporary tensors created and destroyed in the same order every forward pass</li>
  <li><strong>Large allocations</strong>: the unit of allocation is the tensor and that’s it. This means memory gets allocated in large chunks, so <a href="https://en.wikipedia.org/wiki/Fragmentation_(computing)">fragmentation</a> isn’t a major concern</li>
</ul>

<p>So, instead of asking the operating system for memory during inference, which is my hot path, I decided to do things a bit differently: allocate everything I need upfront and manage memory myself.</p>

<h3 id="step-2-leveraging-predictable-patterns">Step 2: Leveraging Predictable Patterns</h3>
<p>Instead of dynamic allocations everywhere, I switched to static allocations everywhere.
At startup I allocate three things:</p>
<ol>
  <li>A static array of tensors descriptors - the objects that store the tensor attributes like shape, stride and dtype</li>
  <li>A static array of data buffers - the objects that wrap the raw data pointer and have a reference counter</li>
  <li>A large memory pool - one for each device (CPU, GPU) for the actual tensor data</li>
</ol>

<p>This is what this looks like for the tensors descriptors:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">dsc_ctx</span> <span class="p">{</span>
    <span class="n">dsc_device</span> <span class="o">*</span><span class="n">devices</span><span class="p">[</span><span class="n">DSC_MAX_DEVICES</span><span class="p">];</span>
    <span class="n">dsc_tensor</span> <span class="o">*</span><span class="n">tensors</span><span class="p">[</span><span class="n">DSC_MAX_OBJS</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>And this one is for the data buffers and memory pool:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">dsc_device</span> <span class="p">{</span>
    <span class="n">dsc_data_buffer</span> <span class="n">nodes</span><span class="p">[</span><span class="n">DSC_MAX_OBJS</span><span class="p">];</span>
    <span class="c1">// Actual pointer to device-allocated memory (ie. via cudaMalloc on NVIDIA GPUs)</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">device_mem</span><span class="p">;</span>
    <span class="c1">// Alignment requirements</span>
    <span class="n">usize</span> <span class="n">alignment</span><span class="p">;</span>
    <span class="c1">// Extra device-specific information (ie. device name if GPU)</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">extra_info</span><span class="p">;</span>
    <span class="c1">// How much memory this device has and how much of that is being used</span>
    <span class="n">usize</span> <span class="n">mem_size</span><span class="p">,</span> <span class="n">used_mem</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Here’s the core insight: tensor descriptors are tiny objects that contain only a few attributes and there are only a few hundred to a couple of thousand of them active at any given time so I can just allocate all of them in a static array.
Same goes for data buffers: they are even smaller (just one pointer and two integers for size and number of active references) so it makes even more sense to allocate them as an array <a href="#note3"><sup> 3 </sup></a>.</p>

<p>The actual data is managed through a simple memory pool per device. Instead of individual calls to <code class="language-plaintext highlighter-rouge">cudaMalloc</code>, I allocate one big chunk of GPU memory<a href="#note4"><sup> 4 </sup></a> at startup and manage it with a free-list allocator (see: <a href="#appendix-b---the-free-list-allocator">#Appendix B - The Free-List Allocator</a>).</p>

<p>This also has the added benefit of securing all the resources needed upfront: if there isn’t enough memory available the program won’t even start and will point this out clearly. This automatically gets rid of random out-of-memory errors during inference.</p>

<h4 style="font-style: italic; font-weight: normal;">Allocation Becomes a Linear Scan</h4>

<p>With this setup “creating” a tensor becomes trivial:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dsc_tensor</span> <span class="o">*</span><span class="nf">find_empty_tensor</span><span class="p">(</span><span class="n">dsc_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span> <span class="p">{</span>  
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">DSC_MAX_OBJS</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">dsc_tensor</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">tensors</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="n">dsc_tensor_is_free</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
        <span class="p">}</span>	
    <span class="p">}</span>
    <span class="k">return</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">dsc_tensor</span> <span class="o">*</span><span class="n">dsc_new_tensor</span><span class="p">(</span><span class="n">dsc_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                           <span class="k">const</span> <span class="kt">int</span> <span class="n">n_dim</span><span class="p">,</span>
                           <span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="n">shape</span><span class="p">,</span>
                           <span class="k">const</span> <span class="n">dsc_dtype</span> <span class="n">dtype</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">ne</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n_dim</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="n">ne</span> <span class="o">*=</span> <span class="n">shape</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
	
    <span class="c1">// Step 1: allocate the tensor descriptor </span>
    <span class="n">dsc_tensor</span> <span class="o">*</span><span class="n">new_tensor</span> <span class="o">=</span> <span class="n">find_empty_tensor</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span>
    <span class="n">DSC_ASSERT</span><span class="p">(</span><span class="n">new_tensor</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
	
    <span class="c1">// Step 2 &amp; 3: allocate the data buffer and the actual data</span>
    <span class="n">new_tensor</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">=</span> <span class="n">dsc_data_alloc</span><span class="p">(</span><span class="n">dsc_get_device</span><span class="p">(</span><span class="n">device</span><span class="p">),</span>
                                     <span class="n">ne</span> <span class="o">*</span> <span class="n">DSC_DTYPE_SIZE</span><span class="p">[</span><span class="n">dtype</span><span class="p">]);</span>
	
    <span class="c1">// Fill tensor descriptor with shape, stride, dtype, ecc...</span>

    <span class="k">return</span> <span class="n">new_tensor</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That is: iterate through a (<strong>small</strong>) linear array and return the first tensor descriptor that is not currently in use, grab a chunk from the memory pool and return.
Freeing a tensor is even simpler: notify that you no longer need memory for this tensor and mark your descriptor as available.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">dsc_tensor_free</span><span class="p">(</span><span class="n">dsc_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">dsc_tensor</span> <span class="o">*</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">x</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">dsc_data_free</span><span class="p">(</span><span class="n">dsc_get_device</span><span class="p">(</span><span class="n">x</span><span class="o">-&gt;</span><span class="n">device</span><span class="p">),</span> <span class="n">x</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">dsc_tensor_set_free</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The memory pool itself uses reference counting, so the actual chunk of data is freed only when the counter goes to 0. When that happens the chunk is placed back into the list of free blocks by the allocator.</p>

<p><img src="/assets/images/why-i-ditched-malloc/tensor_alloc_dsc.png" alt="How tensors are allocated in DSC" /></p>

<h3 id="the-drawbacks">The Drawbacks</h3>
<p>This approach is obviously not perfect, there are several important drawbacks to keep in mind:</p>
<ol>
  <li><strong>Higher upfront cost</strong> - all the allocations and setup need to take place during initialization, this could lead to longer startup times</li>
  <li><strong>Higher memory usage</strong> - the total amount of memory required by the system is larger than the naive <code class="language-plaintext highlighter-rouge">malloc</code> approach and can potentially become a bottleneck in constrained systems</li>
  <li><strong>Overall increased complexity</strong> - we are trading simplicity and speed of development for more control and better performance</li>
  <li><strong>Specialization over generality</strong> - while this implementation works very well for this particular problem it will not adapt easily to a different one. In contrast, the general approach of just using <code class="language-plaintext highlighter-rouge">malloc</code> will always work ok</li>
</ol>

<p>While (1) and (2) depend on the physical constraints of your system and so they need to be evaluated objectively, (3) and (4) are more subjective. For me, they are worthwhile because:</p>
<ol>
  <li>Implementing this kind of algorithms with this very specific set of constraints is not that much more work and the performance and control gains could be huge</li>
  <li>We’re trying to solve one specific problem, not chasing a moving target. This means we can afford a specialized solution</li>
</ol>

<p>Regardless of what you decide to do, if you’re going to implement something that requires manual memory management, I strongly suggest you wrap your <code class="language-plaintext highlighter-rouge">malloc</code> and <code class="language-plaintext highlighter-rouge">free</code> calls in a well-defined internal API. This way it will be easy for you to change the implementation details (ie. opt-in something more performant like <a href="https://jemalloc.net/">jemalloc</a>). But if you sprinkle your code with random calls to <code class="language-plaintext highlighter-rouge">malloc</code> and <code class="language-plaintext highlighter-rouge">free</code> you’ll never even consider refactoring it because it will be too complex.</p>

<h3 id="the-results">The Results</h3>
<p>The impact of this new implementation was immediate and measurable:</p>
<ul>
  <li>Faster and more predictable allocations</li>
  <li>Simpler debugging - no memory leaks or double-free issues</li>
  <li>Overall increased reliability - each call  to <code class="language-plaintext highlighter-rouge">malloc</code> or <code class="language-plaintext highlighter-rouge">free</code> can potentially fail. By limiting those to only the startup phase of our application we increase the overall reliability of our system</li>
</ul>

<p>Here’s a side by side comparison of the naive <code class="language-plaintext highlighter-rouge">malloc</code> approach versus the new, static, approach:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Naive <code class="language-plaintext highlighter-rouge">malloc</code></th>
      <th>Static Arrays + Memory Pool</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Total inference time (1 token)</strong></td>
      <td>69ms</td>
      <td>51ms</td>
    </tr>
    <tr>
      <td><strong>Total allocation overhead</strong></td>
      <td>15.7ms</td>
      <td>862us</td>
    </tr>
    <tr>
      <td><strong>Allocation overhead</strong></td>
      <td>22.7%</td>
      <td>&lt;2%</td>
    </tr>
    <tr>
      <td><strong>Average latency (alloc)</strong></td>
      <td>2.4us</td>
      <td>288ns</td>
    </tr>
    <tr>
      <td><strong>Average latency (free)</strong></td>
      <td>4us</td>
      <td>66ns</td>
    </tr>
    <tr>
      <td><strong>Worst case latency (alloc)</strong></td>
      <td>184us</td>
      <td>6us</td>
    </tr>
    <tr>
      <td><strong>Worst case latency (free)</strong></td>
      <td>1.276ms</td>
      <td>1us</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/why-i-ditched-malloc/speedup_alloc.png" alt="New vs old approach speed comparison visualized" /></p>

<h3 id="conclusion">Conclusion</h3>
<p>In this post I showed you how I removed all the  <code class="language-plaintext highlighter-rouge">malloc</code> and <code class="language-plaintext highlighter-rouge">free</code> from DSC hot path by implementing a general purpose allocator from scratch that works surprisingly well.</p>

<p>The concepts we explored are not novel and are widely used: PyTorch for example uses <a href="https://github.com/microsoft/mimalloc">mimalloc</a> to handle CPU memory and a custom caching allocator for GPU. Regardless, it was fun for me “re-discovering” something on my own and finding that the same ideas (albeit with a lot more polish) are used in state-of-the-art systems like PyTorch.</p>

<p>Finally, I’d like to thank Jon and all the team over at <a href="https://hotaisle.xyz/">HotAisle</a> for giving me access to an 1x MI300X AMD VM to develop and test DSC. Your support has been invaluable. If you’re looking for some compute to run AI workloads or just try out AMD cards definitely check them out!</p>

<p>All the code discussed in this blog post and more is publicly available on <a href="https://github.com/nirw4nna/dsc">GitHub</a>.</p>

<p><br /></p>

<hr />

<p><br /></p>

<h3 id="further-resources">Further Resources</h3>
<p>There are a lot of good resources available online on the topic of memory management. The following is an incomplete list of resources I found useful when working on the allocator for DSC:</p>
<ul>
  <li>I learned a lot about memory allocation in general from this <a href="https://www.gingerbill.org/article/2019/02/01/memory-allocation-strategies-001/">series of posts by the gingerBill</a>. The free-list allocator I implemented for this post is a modified version of what Bill presented in part 5.</li>
  <li><a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">A very good article</a> on the arena allocator in C by Ryan Fleury. Though DSC doesn’t use an actual arena the core lessons still apply.</li>
  <li><a href="https://github.com/ennorehling/dlmalloc">The original code</a> of <code class="language-plaintext highlighter-rouge">dlmalloc</code> by Doug Lea in plain and readable C code. A modified version of this algorithm is used in glibc.</li>
  <li><a href="https://nullprogram.com/blog/2023/09/27/">Another very good article</a> on the arena allocator by Chris Wellons.</li>
</ul>

<p><br /></p>

<hr />

<p><br /></p>

<h3 id="appendix-a---reference-counting">Appendix A - Reference Counting</h3>
<p>The term reference counting refers to a technique used to keep track of the number of active references to a resource such as a block of memory.
It’s a very simple and commonly used technique for memory management. For example, it’s what C++ <code class="language-plaintext highlighter-rouge">std::shared_ptr</code> does to implement “automatic memory management”, it’s the main mechanism used in Python to implement garbage collection and is also used extensively to track objects in operating systems.</p>

<p>To understand how it works and why it’s useful when working with tensors consider this PyTorch-like snippet:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span>
</code></pre></div></div>
<p>here we are creating a random tensor with some data associated to it. This is our “resource” and it has reference count = 1. Then, we want to create a new tensor that is just a flattened version of the first one without duplicating <code class="language-plaintext highlighter-rouge">x</code>.  For this we create a new tensor <code class="language-plaintext highlighter-rouge">y</code> but, instead of copying the data from <code class="language-plaintext highlighter-rouge">x</code>, we just make <code class="language-plaintext highlighter-rouge">y</code> point to the same data as <code class="language-plaintext highlighter-rouge">x</code>. Now, our resource has a reference count of 2.
The Python garbage collector will then keep track of <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> as if they are two separate entities the only difference is that when it’s time to free one of those instead of freeing the actual data buffer the reference counter is decreased. Only when the counter reaches 0 the data is actually freed.</p>

<p>Like all things, reference counting comes with some short-comings:</p>
<ol>
  <li>It requires every object to keep one extra field for the counter which causes overhead, especially if the size of the objects is very small.</li>
  <li>The naive version I described above doesn’t work with <em>reference cycles</em> where an object refers directly or indirectly to itself. Such objects will always have a nonzero count and so will never be freed.</li>
  <li>In a concurrent setting, all the updates to the counter must be atomic, which could introduce unwanted delays.</li>
</ol>

<p>For our specific use-case (1) is not really an issue since we only need an integer (we could even use a <code class="language-plaintext highlighter-rouge">uint8_t</code> to save space) as counter for each data buffer that we have and since we only have a limited number of (usually very large) buffers the overhead is negligible<a href="#note3"><sup> 3 </sup></a>.
(3) is not a problem either since we are not in a concurrent setting. We actually use multiple threads to speed-up CPU computations but this is just for computations, no allocation is made in the concurrent part of DSC.</p>

<p>The only real drawback is (2). But this is actually not a problem for our simple use case because we have the following constraints:</p>
<ol>
  <li>We have only 1 kind of object, the data buffer, that is reference counted</li>
  <li>Data buffers can only reference data blocks, they can’t reference other data buffers. Also, each data block can have at most 1 data buffer pointing to it.</li>
</ol>

<p>These mean that it’s not possible for us to have a circular dependency where two data buffers refer to each other.</p>

<p>In general, there are more advanced techniques that can be used to prevent this issue. For example, Python has a <a href="https://docs.python.org/3/extending/extending.html#reference-counts">built-in cycle detector</a>.</p>

<p><br /></p>

<h3 id="appendix-b---the-free-list-allocator">Appendix B - The Free-List Allocator</h3>
<p>A free-list allocator is a general purpose allocator which doesn’t impose any restrictions. This means it allows allocations to be out of order and of any size. Due to its nature, the performance of this algorithm is generally worse than other, more specialized ones but this one is very simple and for our purposes is a good compromise for now.</p>

<p>The idea is simple: keep a list of free contiguous blocks in memory along with their size. When memory is required, search through the list until you find a block where data can fit. If you find it, mark the data as used and remove the node from the free-list.
To free data we must know the size of the data block to free and put that block of data back into the free-list. Finally, try to <em>coalesce</em> contiguous blocks of memory together to form larger free blocks.</p>

<p>We’ll see a free-list implementation that uses a linked-list to store the free data blocks. More efficient implementations are possible, for example using <a href="https://en.wikipedia.org/wiki/Red%e2%80%93black_tree">red-black trees</a>, but those are more complex and generally not worth it for our simple use-case.</p>

<p><img src="/assets/images/why-i-ditched-malloc/free_list.png" alt="How a free-list allocator works" /></p>

<p>To allocate a new block of memory we have to go through the list of free nodes and find one that is large enough for our block of data. If multiple blocks satisfy this requirement we have to choose a policy: if we settle for the first block that works this is called <em>first fit</em>, if instead we decide to choose the block that works best (ie. the smallest available that fits) this is called <em>best fit</em>.
The best fit approach can be a bit slower but it has the added benefit of reducing internal fragmentation.</p>

<p>Regardless, the time complexity for this operation is <strong><em>O(n)</em></strong>, where <strong><em>n</em></strong> is the number of elements in the free-list.</p>

<p><img src="/assets/images/why-i-ditched-malloc/free_list_alloc.png" alt="Allocating memory using a free-list allocator" /></p>

<p>Once we allocate a free block, the free-list is updated and the now-used block is removed.</p>

<p><img src="/assets/images/why-i-ditched-malloc/free_list_alloc_done.png" alt="Updating the free-list of a free-list allocator" /></p>

<p>To free a previously allocated block of memory we just need to find its header. Then, iterate over the free-list until we get to the right position in memory order (this can be done by simply comparing the address of the block we are trying to free with those of the other blocks in the list) and insert the new node there.</p>

<p>When iterating, we have to store both the previous and next free nodes so that we are able to merge contiguous free blocks if possible.</p>

<p>As for the allocation process, the time complexity for freeing a block of memory is <strong><em>O(n)</em></strong>, where <strong><em>n</em></strong> is the number of elements in the free-list, as we need to iterate over the list in order to insert the block in the right memory order.</p>

<p><img src="/assets/images/why-i-ditched-malloc/free_list_free.png" alt="Freeing memory using a free-list allocator" /></p>

<p>The full source code of the actual implementation is available on <a href="https://github.com/nirw4nna/dsc/blob/main/dsc/include/dsc_device.h">GitHub</a>.</p>

<p><br /></p>

<hr />

<p><br /></p>

<p><a name="note1"><sub>1.</sub></a> This trace has been obtained by compiling DSC in release mode with tracing enabled:
<code class="language-plaintext highlighter-rouge">make shared DSC_FAST=1 DSC_GPU=1 DSC_TRACING=1</code> and then running a single inference step of the Qwen2.5 0.5B model with this command:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TRACE</span><span class="o">=</span><span class="mi">2</span> <span class="n">TRACE_ALLOC</span><span class="o">=</span><span class="mi">1</span> <span class="n">python3</span> <span class="n">examples</span><span class="o">/</span><span class="n">models</span><span class="o">/</span><span class="n">qwen2_5</span><span class="p">.</span><span class="n">py</span> \
<span class="s">"Can you write a quicksort in Python?"</span> <span class="o">-</span><span class="n">n</span> <span class="mi">1</span> <span class="o">--</span><span class="n">device</span><span class="o">=</span><span class="n">gpu</span> <span class="o">--</span><span class="n">dtype</span><span class="o">=</span><span class="n">f32</span>
</code></pre></div></div>
<p>Tracing in DSC works by capturing a CPU timestamp before and after a function exits, in this case the two functions we are looking at are: <code class="language-plaintext highlighter-rouge">dsc_new_tensor</code> and <code class="language-plaintext highlighter-rouge">dsc_tensor_free</code>.</p>

<p><a name="note2"><sub>2.</sub></a> This is also referred to as “death by a thousand cuts”. One infamous example was in the Chrome browser where every keystroke triggered <em>250000</em> allocations due to an abuse of <code class="language-plaintext highlighter-rouge">std::string</code> (see: <a href="https://groups.google.com/a/chromium.org/g/chromium-dev/c/EUqoIz2iFU4/m/kPZ5ZK0K3gEJ">https://groups.google.com/a/chromium.org/g/chromium-dev/c/EUqoIz2iFU4/m/kPZ5ZK0K3gEJ</a>).</p>

<p><a name="note3"><sub>3.</sub></a> In practice, the value <code class="language-plaintext highlighter-rouge">DSC_MAX_OBJS</code> can be defined at compile-time with a default value of 1000. The actual <code class="language-plaintext highlighter-rouge">dsc_tensor</code> in DSC is exactly 56 bytes while <code class="language-plaintext highlighter-rouge">dsc_data_buffer</code> is 24 bytes. This means that we are “wasting” roughly 100KB of RAM (1 array of tensors, and 2 array of data buffers, one for CPU and one for the GPU) which is a very reasonable number, considering the kind of hardware required to run even the smallest models like Qwen2.5 0.5B.</p>

<p><a name="note4"><sub>4.</sub></a> Currently, the size of the memory pool for each device can be configured at startup, when the <code class="language-plaintext highlighter-rouge">dsc_ctx</code> object is created. If a value is not specified DSC on GPU will take 90% of the total GPU memory.</p>]]></content><author><name></name></author><category term="programming" /><summary type="html"><![CDATA[In this post, I’ll show the impact poor memory management had on DSC, a tensor library I wrote from scratch in C++ and Python. I’ll then show how I fixed this by implementing a general purpose memory allocator from scratch. The goal here is not to implement something that is state-of-the-art or novel but rather to show what I went through when working on DSC and what I learned about memory management in the process.]]></summary></entry><entry><title type="html">Hello, World!</title><link href="https://gilli.dev/misc/2025/07/11/hello-world.html" rel="alternate" type="text/html" title="Hello, World!" /><published>2025-07-11T00:00:00+00:00</published><updated>2025-07-11T00:00:00+00:00</updated><id>https://gilli.dev/misc/2025/07/11/hello-world</id><content type="html" xml:base="https://gilli.dev/misc/2025/07/11/hello-world.html"><![CDATA[<p>Hello, internet world!</p>

<p>Welcome to my blog. Here I’ll write about stuff I’m currently working on, mostly programming-related. More specifically, at least for the foreseeable future, things related to the development of <a href="https://github.com/nirw4nna/dsc">DSC</a> a GPU-accelerated tensor library I wrote from scratch.</p>

<p>To make things more interesting I set a <a href="https://x.com/nirw4nna/status/1937513290789454108">fun little challange for me</a>: to make DSC comparable, in terms of inference speed, with PyTorch on a dense LLM by the end of 2025. Will see how this goes, in any case there will be a lot to learn on this journey so, if you find this kind of things interesting, I encourage you to follow along with me!</p>

<p>I plan to post longer-form content here roughly once a month where I go in-depth in some topics and shorter updates daily/weekly on Twitter/X.</p>

<p>By the way, DSC is completely open-source so if you like this kind of programming you can even join me for the ride :)</p>]]></content><author><name></name></author><category term="misc" /><summary type="html"><![CDATA[Hello, internet world!]]></summary></entry></feed>