VanessaSaurus dinosaurs, programming, and parsnips https://vsoch.github.io/ Sat, 03 Jan 2026 00:48:27 +0000 Sat, 03 Jan 2026 00:48:27 +0000 Jekyll v3.10.0 The Fifth Decade <p>Happy 2026! Let’s <a href="https://youtu.be/WeJEHzY-tIY" target="_blank">start the year off with some dancing</a>. 💃 This year I promise:</p> <ol class="custom-counter"> <li>To continue to live with authenticity and integrity.</li> <li>To be inspired, diving deeply into ideas and learning.</li> <li>To adventure, and look for beauty in unexpected places.</li> <li>To not make myself smaller to make others comfortable.</li> <li>To push myself physically and do things that are hard.</li> <li>To age with elegance, and appreciate myself as I am.</li> <li>To believe a person when they show me who they are.</li> <li>To dance, laugh, and be curious about the world.</li> <li>To prioritize self-care and rest.</li> </ol> <p>To little funny looking, goblin me: you finally grew above 4 feet. You’ll feel out of place for a long time, and life is going to be hard when you enter adulthood. You’ll grow from it, and you’ll be OK. You will learn that beauty and value are within you, and Sovereign. They do not require validation or being chosen. They exist because you are a person who notices the world.</p> <div style="width:100%; padding:20px"> <div style="width:400px"> <img src="/assets/images/posts/new-year-2026/vanessa-goblin.jpeg" /> </div> </div> <p>This closing year, 2025 had a <a href="https://youtube.com/playlist?list=PL7TRSgnVkOR0opaprhYCCuWCoBYfGs8Ed&amp;si=dTvYYiuTT7-uHuLC" target="_blank"> lot of adventures</a> with dancing, Flux tutorials, running and biking and beautiful places!</p> <p>Onward to the adventures of 2026, still as a goblin, but taller! 💪</p> <div style="width:100%; padding:20px"> <div style="width:400px"> <img src="/assets/images/posts/new-year-2026/vanessa-january-2026.jpg" /> </div> </div> Fri, 02 Jan 2026 00:09:00 +0000 https://vsoch.github.io/2026/fifth-decade/ https://vsoch.github.io/2026/fifth-decade/ Gifts <p>Presents are interesting to think about. It’s an exchange of a physical item as to say, “I value you.” And don’t get me wrong - they can be fun, and informative about a person or relationship. I’ve historically been someone that puts time into thinking about the right gifts for specific people. That goes back to sewing pajama pants and blankets in high school to custom gifts for friends and family to (as I’ve been older) niche food or experiences I know the person will enjoy. I have spent upwards of a year carefully preparing (what I deem to be) highly meaningful gifts. Experiential gifts are also a lot of fun, especially for a group (e.g., a pinata or food item to enjoy together). The exception to that are things that the person really wants or needs, which is something else you can do well when you know someone well. Receiving a gift of something you really like (and will use) that you maybe would not get yourself is the sweet spot (e.g., high end chocolate).</p> <p>Where it starts to go wrong is expectation. When birthdays or holidays roll around and you are expected to send a gift, what do you do if it isn’t in your heart? What do you do if there isn’t anything the person needs, or if you just don’t feel like it? There is too much expectation that is created by advertisement and a generally consumerist society. It starts to feel badly to feel trapped into having to send something. It feels equally bad to receive something that you feel was forced or actually indicates that the person doesn’t know you at all. And let’s take this a step further. Gift giving, if normally done, when withheld, turns into a pathological means to express a lack of approval. The quality or quantity becomes a point of comparison, and I’m guilty of making this comparison and feeling like I wasn’t valued quite as much.</p> <p>As I’ve gotten older, I’ve realized something important. The most meaningful gift that someone can give me is their time. Time is limited, and precious. It reflects intellectual curiosity, openness to experience, and choice. Two people choosing to share time is a mutual desire and not an obligation. There is no need to assert value with words, because it is shown through action. There is a mutual shared value of conversation, which typically comes down to emotional and/or intellectual connection. I know I’m valued not because I receive an obligatory present in the mail, but because I just spent many hours with someone engaged with me, laughing, and my internal cup of connection is overflowing.</p> <p>I will still enjoy (and give) presents, but I no longer feel an obligation. And I’m also investing more in myself, both in terms of self-care and tangible items. My cup has overflowed this year, and I’m grateful to all the people that have been a part of that. That said, when you receive a present from me, know that it comes from my heart. When you spend time with me, you probably already knew that. ❤️</p> <blockquote> <p>You may also have noticed that I have not written as much this year. The reason is because the year has been too rich to want to spend the time. I am finding that my adventures, whether they be to new cities, intellectual dives into ideas with colleagues, balanced rest, or to scale a mountain with my best friend, are how I want to spend the time. I am branching out, taking care of myself, and expanding my world in ways I never did before, and possibly a lot of the ideas that wound up in writing here are being spoken in a much more engaged context. It was a year of adventure, learning, and growth. On to another in 2026! 🥳</p> </blockquote> <iframe width="560" height="315" src="https://www.youtube.com/embed/F8qoEXjTJlI?si=IVOOV6d5S7GHHNJy" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe> Mon, 22 Dec 2025 00:00:00 +0000 https://vsoch.github.io/2025/gifts/ https://vsoch.github.io/2025/gifts/ Agentic Orchestration of an HPC Workload in Cloud <p>One of the most satisfying and learning-rich pieces of work from this year is represented in this white paper, “Agentic Orchestration of HPC Applications: A study using Google Gemini in Cloud.”</p> <p><br /></p> <embed src="/assets/posts/agentic-orchestration-hpc-workloads-cloud-sochat-milroy.pdf" type="application/pdf" width="100%" height="600px" /> <p><br /></p> <p>First, I’ll provide a little bit of back-story. We were using simple models to convert JSON job specifications or batch jobs between formats. I had a sense that the agents (specifically, Gemini) could do much more, and dove in. At first I was not sure the agent could successfully build a Docker container. It did. And then I was not sure about deployment and optimization in Kubernetes. That worked! Of course, there was a lot of nuanced detail with respect to how the orchestration was done, and how me (the human) interacted with the agents as a team. The learning from this early work is represented in this white paper.</p> <p>In summary, we used an agentic team (with Google’s Gemini) to build, deploy, optimize, and run scaling studies for HPC applications in Kubernetes. Work is underway (and most of the software done) to do similar experiments using AutoGen, LangChain, and a more formalized state machine design with Model Context Protocol (MCP). This work is immensely exciting because we have more ideas for extending agents to scheduling, topology, and job design. We released this as a white paper since we wanted to extend it before any kind of journal submission, and (for me) I care more about sharing the work than getting it into some high-end venue.</p> Fri, 21 Nov 2025 09:00:00 +0000 https://vsoch.github.io/2025/hpc-orchestration-hpc-workload-in-cloud/ https://vsoch.github.io/2025/hpc-orchestration-hpc-workload-in-cloud/ Rootless User-Space Kubernetes with GPU <p>This is a first prototype to get GPU devices working in <a href="https://github.com/rootless-containers/usernetes">User-space Kubernetes</a>, or Usernetes. I am calling it a prototype because it is a “works for the first time” and will be improved upon. For our use case, we will be testing and using on clouds that have NVIDIA GPU devices, however we will need to support other device types in production, and this will be future work. I want to create this write-up while everything is fresh in my mind, because I just had 2.5 days of working through the complexity of components, and learning a lot.</p> <h3 id="a-bit-of-background">A bit of background</h3> <p>We want to test User-space Kubernetes “Usernetes” ability to run a GPU workload, and compare between Kubernetes (as provided by a cloud) and the equivalent user-space setup deployed with the same resources on the VM equivalent. Google Cloud has excellent tooling for deploying GPU and installed drivers for GKE, so I was able to get this <a href="https://github.com/converged-computing/flux-usernetes/tree/main/google/experiment/mnist-gpu/test/gke">vanilla setup</a> working and tested in under an hour. The setup of the same stack, but on user-space Kubernetes on Compute Engine deployed with a custom VM base on Terraform, would prove to be more challenging.</p> <p>I’ve designed various driver installers for previous work, including <a href="https://github.com/converged-computing/aks-infiniband-install">infiniband on Azure Kubernetes Service</a> and more experimental ones like <a href="https://github.com/converged-computing/flux-distribute">deploying a Flux instance alongside the Kubelet</a>. NVIDIA GPU drivers are typically installed in a similar fashion, in the simplest case with <a href="https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin.yml">nvidia device plugin</a> but now NVIDIA has exploded their software and Kubernetes tooling so everything but the kitchen sink is installed with the <a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide">GPU Operator</a>. Getting this working in user-space was uncharted territory, because we had two layers to work through - first the physical node to the rootless docker node (control plane or kubelet) and then from that node to the container in a pod deployed by containerd. Even just for the case of one layer of abstraction, I found many unsolved issues on GitHub and no single source of truth for how to do it. Needless to say, I wasn’t sure about the complexity that would be warranted to get this working, or if I could do it at all.</p> <h3 id="resources-and-cloud">Resources and Cloud</h3> <p>For this environment, we are working on Google Cloud, and specifically with V100 GPUs, because I can get them in very small numbers (on the order of 1-4 per node, and for a few nodes). To develop with a few GPU on a node it would be a reasonable cost, about $12.00/hour (for reference, the cost of each GPU is $2.48, and then the corresponding instance is ~1.75). This is good example of how tiny bits of resources can go a long way if you are a developer, and (personally speaking) I like clouds best for development environments that I can control over all other use cases. I needed these up for a long time for development, and created easily 50 different setups over a few days. When I had nodes up for most of a day, the total cost was about $150.0. When I realized I would need to do a lot more work, I cut down the number of GPU per node to 2 (my pytorch workflow has a master and worker).</p> <h3 id="virtual-machine">Virtual Machine</h3> <p>When I first started, I took a strategy of using what Google provided. When you select the V100 and navigate to OS, it gives you an option to select one of their ML optimized images. These images are very old (Debian 11 is the newest, which I think dates to 2021) and they only go up to CUDA 12.3. I thought that would be OK to start, but in retrospect it made the environment more error prone. I had to remove and reinstall docker as rootless, and there wasn’t transparency about how the initial Debian base was customized. A good strategy for building these images is to start from an empty slate to the highest extent possible to maximize transparency of what changes have been made.</p> <p>What ultimately worked was to start with an ubuntu 24.04 image and install my own drivers and CUDA, and then I could choose versions selectively (CUDA 12.8, and I seem to remember the driver version being used was 560.xxx). I was a bit nervous about this because the recommendation was lower than that for the V100 on the n1-standard family, but their provided ML image wasn’t working for me so I had an open mind. You can see the driver install commands <a href="https://github.com/converged-computing/flux-usernetes/blob/b0eb6a3e611bb2e3cf63af18682a169335fe1083/google/gpu/build.sh#L23-L34">here</a>.</p> <h3 id="usernetes">Usernetes</h3> <p>The <a href="https://github.com/converged-computing/flux-usernetes/blob/b0eb6a3e611bb2e3cf63af18682a169335fe1083/google/gpu/build.sh#L47-L93">install of Usernetes</a> was typical. You need to enable several kernel modules, cgroups v2, and install a rootless container technology. I chose rootless docker, although on HPC systems you would be forced to use podman. I also set most limits (e.g,. nproc, memlock, stack, nofile, etc.) to unlimited.</p> <p>One gotcha in this setup that is specific to Google Cloud is how logins to machines are handled. You will typically get OS login, or otherwise login as your email / username. The problem with this is that you get assigned a really high id, and one that isn’t present for any ranges in <code class="language-plaintext highlighter-rouge">/etc/subuid</code>. What happened for me on the first day is that rootlesskit was failing (somewhat silently, or at least I missed looking in places to check for it) so I was running rootful docker. The problem was not only that uidmap wasn’t installed, but that the user didn’t have a range. It was actually the debug output of <a href="https://github.com/containerd/nerdctl">nerdctl</a> that I tested on a whim that pointed me to the issue with the setup, and shout-out to Akihiro for again excellent work. I decided to use the default ubuntu user, with id 1000, akin to what I did on AWS and Azure.</p> <h4 id="rooty-docker">Rooty Docker!</h4> <p>Using “ubuntu” poses a bit of an issue for ssh-ing in. The “gcloud” client is not going to easily allow ubuntu. What I needed to do was first ssh in with my os login, add a public key for my machine to <code class="language-plaintext highlighter-rouge">/home/ubuntu/.ssh/authorized_keys</code> and then ssh in as ubuntu. For the terraform setup that doesn’t expose an ephemeral IP, I needed to edit the instance with the control plane, to add an ephemeral IP for ssh.</p> <p>As a side note - it wasn’t hard getting GPU devices to work with rootful User space kubernetes (yeah, that doesn’t make sense, does it)? I couldn’t use this setup, even as a mock, because rootful <a href="https://github.com/rootless-containers/usernetes/pull/366">breaks usernetes on a multi-node setup</a>. I was able to create a GitHub CI test that reproduces the issue, and hopefully it will be fixed soon! I’m thinking it’s probably related to a rootful docker not properly working with slirp4netns, but I am not an expert there and haven’t looked into it.</p> <p>The last customizations for docker needed on the host VM were to install the <a href="https://github.com/converged-computing/flux-usernetes/blob/b0eb6a3e611bb2e3cf63af18682a169335fe1083/google/gpu/build.sh#L184-L201">nvidia container toolkit</a> “nvidia-ctk” and configure it to use the docker runtime, and with CDI (the container device interface). For this step I allowed it to generate devices for the development machine I was on, and note that these need to be <a href="https://github.com/converged-computing/flux-usernetes/blob/b0eb6a3e611bb2e3cf63af18682a169335fe1083/google/gpu/tf/basic.tfvars#L19-L21">regenerated</a> when that VM base is used for Terraform.</p> <h3 id="docker-compose">Docker Compose</h3> <p>Most of the issues on GitHub and instructions for rootless docker and NVIDIA GPU had one indirection in mind - getting the devices on the host to show up in a single docker container. We have two indirections, because we need to map the host devices into a node on the VM running the kubelet, and then that node (a rootless docker container) has containerd that needs to further pass those devices to containers running in the User space Kubernetes cluster. This means we need to solve the problem twice, and essentially have every component in the stack (e.g., the nvidia runtime config file and nvidia toolkit install) duplicated. I found a lot of GitHub issues (<a href="https://github.com/NVIDIA/nvidia-container-toolkit/issues/85">here is one open since 2023</a>) that would suggest setting no-cgroups = true in the nvidia container runtime config at “/etc/nvidia-container-runtime/config.toml” and I did try that, but found that it failed after the second indirection.</p> <h4 id="a-gotcha-with-the-nvidia-runtime">A Gotcha with the Nvidia Runtime</h4> <p>Many instructions directed to tweak the nvidia runtime to have “cdi” enabled as a feature, and then point to the nvidia-container-runtime executable for the runtime. In fact, there is a <a href="https://github.com/converged-computing/flux-usernetes/blob/b0eb6a3e611bb2e3cf63af18682a169335fe1083/google/gpu/build.sh#L196">command</a> to easily do that. What I realized was that with rootless docker, it wasn’t picking up the default location of the daemon.json, or where the user space one was expected to be. I looked into the service and found that it was running “/usr/bin/dockerd-rootless.sh” and tweaked the entrypoint of that file to explicitly target the config, like this:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> exec "$dockerd" "--config-file" "/etc/docker/daemon.json" "$@" </code></pre></div></div> <p>That was a manual change I had to make on the VM (that is saved as the base image for Terraform). It’s important to validate that the nvidia runtime is present (detected) along with rootless before moving forward:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> docker info | grep -i runtimes Runtimes: io.containerd.runc.v2 nvidia runc docker info | grep root rootless </code></pre></div></div> <p>You should also test the nvidia runtime before moving forward. You should be able to use it with a vanilla ubuntu container and have “nvidia-smi” working and seeing devices!</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-798e9725-623d-ca7f-f15d-b1908ec8bb0d) GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-be5719da-cd52-8a40-09bb-0007224e9236) </code></pre></div></div> <h4 id="docker-compose-yaml">docker-compose yaml</h4> <p>The tweaks to the default usernetes docker-compose.yaml are minimal. I had tested added permissions (caps, for example) but ultimately just needed to specify using the nvidia runtime, and then the list of devices. You can see the setup <a href="https://github.com/converged-computing/flux-usernetes/blob/main/google/gpu/docker-compose.yaml">here</a>, and note that I think (but have not tested) that adding “devices” vs. the “deploy” directive do the same thing. Note that if you try to start the control plane (or a worker) with “make up” without using the nvidia runtime and asking for devices, without the “no-groups = true” you will get an error, specifically <a href="https://github.com/NVIDIA/libnvidia-container/issues/154">this one</a> about bpf_prog_query with failed permissions. That is another issue that has been open since 2022. 🙃</p> <h4 id="usernetes-node">Usernetes node</h4> <p>A Usernetes node can be the control plane or a worker. The general procedure for the control plane is to bring it up, run kubeadm init, install flannel, make the kubeconfig, and then prepare a join command to send to workers. The worker nodes also need to be brought up with the same setup, but then they just need to have the join-command (it’s a text file that is executed as a command for kubeadm join). The additional step I needed to add to this was a Makefile command to “make nvidia” that would setup CDI to be used inside the node.</p> <div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nl">.PHONY</span><span class="o">:</span> <span class="nf">nvidia</span> <span class="nl">nvidia</span><span class="o">:</span> <span class="nv">$(NODE_SHELL)</span> nvidia-ctk system create-dev-char-symlinks <span class="nt">--create-all</span> <span class="nv">$(NODE_SHELL)</span> nvidia-ctk cdi generate <span class="nt">--output</span><span class="o">=</span>/etc/cdi/nvidia.yaml <span class="nt">--device-name-strategy</span><span class="o">=</span>uuid <span class="nv">$(NODE_SHELL)</span> nvidia-ctk cdi list <span class="nv">$(NODE_SHELL)</span> nvidia-ctk config <span class="nt">--in-place</span> <span class="nt">--set</span> nvidia-container-runtime.mode<span class="o">=</span>cdi <span class="nv">$(NODE_SHELL)</span> nvidia-ctk runtime configure <span class="nt">--runtime</span><span class="o">=</span>containerd <span class="nt">--cdi</span>.enabled <span class="nt">--set-as-default</span> ... <span class="nv">$(NODE_SHELL)</span> systemctl restart containerd </code></pre></div></div> <p>In the above, we create a set of symlinks that I found were needed in practice, but I would see errors if they didn’t exist. For the GPU operator, I found that there was an <a href="https://github.com/NVIDIA/gpu-operator/blob/6171a52d2fa30001f01b728edef5558b32f66a8d/validator/main.go#L329">environment variable</a> in the validator that needed to be set to disable trying to make them, which would fail in usernetes with a permissions error. I didn’t wind up pursuring that path further (using the GPU Operator) because it was highly error prone, and making changes that led to a broken state for Usernetes. We are also generating the cdi file “nvidia.yaml” in “/etc/cdi” and setting the nvidia container runtime mode to use it. Finally, we are configuring the nvidia container runtime to work with containerd, and (still) with CDI enabled. The sed commands (not shown) are uncommenting and enabling different settings I found that would (at least at face value) possibly help in rootless mode. Finally, we restart containerd.</p> <p>It took me a lot of testing and learning (I have no experience with CDI or working with these tools beyond installs of the nvidia device plugin that have just worked on clusters in the past) to get to the above. You can see the full Makefile <a href="https://github.com/converged-computing/flux-usernetes/blob/main/google/gpu/Makefile">here</a>. At this point, we have the node configured also with the nvidia container toolkit, and containerd updated (and restarted) to use it.</p> <h4 id="nvidia-device-plugin-and-gpu-operator">NVIDIA Device Plugin and GPU Operator</h4> <p>It’s typically easy to apply the <a href="https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin.yml">nvidia device plugin</a> to have devices detected and working. This gave me quite a bit of trouble, because at first (when I wasn’t using CDI) it only detected anything when I specified “tegra.” That would have the labels show up on the nodes, but then when I tried to create pods they would fail not knowing what tegra is (and understandably, that’s the wrong setup). Changing it to use nvml would fail to find the library, and “auto” didn’t work at all (at least at the onset). Before I got the CDI just right I went through half a day of going back and forth between using (and trying to tweak) this yaml file and testing the GPU operator, and found a lot of really weird states.</p> <p>Several times I could deploy the first, have the GPUs show up, but then fail in the cluster, and then apply the GPU Operator. A few times that seemed to work, and other times (most times) it just led to more errors, and not even getting so far to get labels for the GPU. I don’t know how this worked once, but when I tried to reproduce it, I would get containerd <a href="https://github.com/rootless-containers/usernetes/issues/365#issuecomment-2676108220">operation not permitted</a> errors, along with an error about a PID. There were at least 5 times when something would work, and then I would save an image of the VM with the changes, bring up the Terraform setup to reproduce, and reach the “moment of truth” with deploying pytorch and be faced with another new error. That usually felt bad. 😞 My best guess based on this work is that we were having interactions between components and slight differences in GPU operator components coming up that led to inconsistent state.</p> <p>What I ultimately decided is that the GPU Operator was too complex to understand easily. I tried customizing the values.yaml install with helm to disable un-needed components (for example, I don’t need MIG here to split GPUs, and the V100s don’t support that anyway) to try to simplify (and make it understandable) but my intuition told me that it was too complex. I didn’t like that it seemed to break the setup, give me inconsistent states, and there were so many init containers and dependencies it wasn’t clear if there were race conditions. All I can say is that on the rare cases something started working, it didn’t reliably reproduce in this rootless setup. In several of those cases it worked for a first run, and then broke for subsequent with new errors. This is what led me to not use it, and focus on the details of the CDI and the simple device plugin daemonset deployment. That ultimately worked like a charm, despite not being recommended for production setups.</p> <h4 id="the-application">The Application</h4> <p>The ML app wasn’t without issues. Specifically, the entrypoint for the master or a worker might look something like this:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> python3 /opt/pytorch-mnist/mnist.py --epochs=10 --backend=nccl --batch-size=128 </code></pre></div></div> <p>I hit several errors about not finding GPUs, or (in the case of rootful Docker with usernetes) the networking never working. I had a mangly device error that was resolved when I updated the container to the latest version (now 2 years old). Another unexpected issue was with respect to data. I had prepared data to use from the old container, and when it was attempted to be used with the newer version, it wouldn’t validate and would try to download. Given that the download links weren’t working at the time, I couldn’t run anything. I had to ensure that the data matched the container. There is more on that <a href="https://github.com/converged-computing/flux-usernetes/tree/main/google/gpu/docker">here</a>. We ultimately build our own container with the data to ensure it is available, and we won’t take time during our experiments to download it. Side note - in that exercise I learned that I could convert a Python egg to a zip, unzip to explore, and then make changes and repackage into an egg! YOLO!</p> <h3 id="summary">Summary</h3> <p>The final setup is using an ubuntu 24.04 base with CUDA 12.8 and drivers 560.xxx, and using a strategy of rootless docker for usernetes with the nvidia runtime exposing devices via CDI. Setting no-cgroups to true or not using the nvidia runtime will not work, either due to needing cgroups later or the bfp permissions error I noted earlier. Once in the container, we need to again prepare CDI to be used with containerd, and ensure that we generate symlinks in advance. The GPU operator results in a broken state, and the nvidia device plugin, on its own, can best expose devices on the nodes to be available to the pods. Here are a few images for posterity that show everything working. First, the nvidia device plugin:</p> <div style="padding:20px; margin:auto"> <img src="/assets/images/posts/nvidia/nvidia-device-plugin.png" /> </div> <p>And this is when the devices show up (just one GPU) on our tiny nodes. For context, we can’t get many GPU on Google Cloud, so we are maximizing the number of GPU per node, since we primarily need to test Usernetes for network.</p> <div style="padding:20px; margin:auto"> <img src="/assets/images/posts/nvidia/devices.png" /> </div> <p>And pytorch working - go epoch, go!</p> <div style="padding:20px; margin:auto"> <img src="/assets/images/posts/nvidia/epoch.png" /> </div> <h3 id="reflecting">Reflecting</h3> <p>This process from zero to working that lasted 2.5 days (with a bit of sleep between days) was uncharted territory, and I knew would be challenging from the getgo. It is an experience that I must strangely enjoy? I say this because I have moments of joy and anguish, and several times in the period of time I “decided” to give up. But somehow (even when there are many other things I should have be doing) I found myself continually returning to the setup. It meant waking up with “just one more idea” or bringing down a setup, finally eating, and then during mind wandering deciding that I wasn’t done yet.</p> <p>I don’t know if this is mental strength or just stubbornness (I think likely the latter). It’s a brute force approach, when I think about it, because I almost refuse to stop until I physically fall over. This didn’t happen the first night, but did the second night. It’s also a kind of problem that I know I won’t get help with. Not to get too philosophical, but I’ve realized through life experience that if I want something done, I need to do it. If I want change, I need to figure out the steps to take and take them. It’s easy to defer responsibility or blame, and I won’t do it. Often it’s not productive, because it accomplishes nothing. That kind of approach applies in everyday life when it comes to making choices about taking care of oneself, and also for learning and solving hard problems. Knowing this was an important problem to solve for my institution, team, and community, even the start of a solution, I felt that responsibility. I’m grateful for the inspiration and support I get from my small team to have the inner fire that drives me. It makes the work challenging, and when we solve problems together, fulfilling and fun.</p> <p>All in all, I can’t say what the essential fix was, but I will say this is a complex setup. In retrospect, my advice here is to follow intuition, try to build components that you have the most transparency (and control) over, and choose simplicity over complexity over what can be software bloat for a simple use case. And on that note - I am going to leave my dinosaur cave and go outside! And then likely I really do need to come back and work on those slides, which I have been very successfully putting off for 3 days now. 😉</p> Sat, 22 Feb 2025 00:00:00 +0000 https://vsoch.github.io/2025/rootless-usernetes-gpu/ https://vsoch.github.io/2025/rootless-usernetes-gpu/ Reflections for 2024 <p>It’s Friday evening, and the holiday is just beginning. I’m relaxing in the quiet of my apartment after the work day, a good nap, and although I could delve into some things I am hungry to work on, I know I have two weeks for that. I want to take a moment to reflect. This will likely take me a few days to write, and I’ll intersperse that writing with work.</p> <h3 id="a-rich-life-is-a-good-life">A rich life is a good life</h3> <p>I finish this year having confidence in who I am, and what I need and desire for a rich life. I say “rich” as opposed to using the term “happy” because a complete life experience requires an entire gamut of emotional and physical experiences. Pain, sadness, loss, and <a href="https://youtu.be/quFjRXYtVIU?si=QtWQYSc2sgyoHneh" target="_blank">loneliness</a> are the complements to their opposites, and the most interesting people I’ve met know them intimately. It is a blessing to survive adversity because, although you may come out with scars, you come out with the ability and knowledge that you can heal. This ironically helps in software engineering as well, which has consistent challenge and uncertainty. You can dive into anything not having confidence of understanding, but having confidence in your ability to eventually get there.</p> <p>The embracing of <a href="https://vsoch.github.io/2023/authentic/" target="_blank">authenticity</a> and proactivity are also still important to me. We decide who we want to be, are genuine about it, and pursue it <a href="https://youtu.be/2Si4MPktn7s?si=xroDeyrc_XL9KY_a" target="_blank">without making excuses</a>. When we make mistakes, we ponder them, decide to change, and do it. A victim mentality looks at the world and finds reasons that it has been unfair to it. An empowered mentality takes responsibility for those same observations and becomes the impetus of change. This is the standard that I strive for, and I am committed to continue to work on myself, physically, emotionally, and intellectually, to pursue growth and avoid complacency.</p> <p>This year, <a href="https://youtu.be/iWWMrCV-SBw?si=4CKSZULUK5gjvRcc" target="_blank">my heart was full</a>. I saw <a href="https://youtu.be/lhVG_BIaQow?si=xKaav1Y808oNjhde" target="_blank">beauty all around me</a>. I embraced reflection, quiet, and imagination. I’d like to share some of these thoughts today.</p> <h3 id="i-learned-my-limits">I learned my limits</h3> <p>I have a superpower of productivity. I can’t explain it, but I can focus and get a lot done, and in a short amount of time. I am mentally strong in that it is hard for my environment, whether that be social or academic pressure, to touch me. I <a href="https://youtu.be/Vy7RaQUmOzE?si=smxf-0xjZ1mLwrgB&amp;t=120" target="_blank">romanticize this story sometimes</a>, imagining myself moving through chaos, time slowed down, and brushing aside the bullets. My heart is <a href="https://youtu.be/YDJoakiL1h8?si=I3nstdeAwlh5Ze-D" target="_blank">inspired by the things that fill it with love</a>. It also can make it hard to relate to other people, because I might understand and see the same stressors, but I don’t feel them as strongly. In a culture where it is trendy and common to be busy, I am not. I am <a href="https://youtu.be/xJA-5gjbJBU" target="_blank">dancing in my head</a>, and I follow my nose and pursue things that are inspiring, or bring me joy. This year, however, I would learn the physical limits of my gift.</p> <p>My team embarked on a large performance study in August. We had a very short amount of time to do a lot of work, and there are only a handful of us. I was excited and determined to have a successful outcome, so I made an explicit decision to turn on my turbo mode, and not take it off. This is an interesting quality of my productivity – that I can turn it on and off at will, and strategically direct it along with my attention. However, instead of a few hours in a day, it was turbo mode for the entirety of the day, and for weeks. We ultimately completed most of what we set out to do. I felt proud, both of myself and my team.</p> <p>But after it ended, and I looked down at myself, I realized that the effort physically took something out of me. There was less of me by the end of the month, 12 pounds to be exact. My frame has always been long with lean muscle, and this doesn’t make me a huge person, so that was too much to lose. It felt terrible. It was the same lesson that I had learned when I was a freshman and new to running, and would push myself to collapse, either on the track, or completely blacking out in a cross country race. My mind has always been a lot stronger than my body, which feels tiny in comparison. I learned that my mind, which often wants to conquer mountains, needs to be more considerate of the bag of bones and meat that carry it around. Or flip that around, and the lesson is that I need to make my body stronger to match the demands of my mind.</p> <h3 id="i-learned-to-say-no">I learned to say no</h3> <p>The end of the study combined with the end of a fiscal year, and shifting of work that led to a lot of rumination. I thought about my efforts the previous year, and while the majority were fulfilling, I couldn’t ignore a creeping feeling of sadness from some of the pursuits where I was working really hard, and trying desperately to be valued. I could complete entire projects on my own, make documentation, presentations, and give talks at high profile venues, but I still felt disconnected. I decided that I needed to introspect on this feeling, and that the experiences of the last year were not OK for me to repeat.</p> <p>For most of the year I blamed myself for not finding the right way to connect. Many of us have experienced this. We try to fit into group dynamics, and give huge amounts of energy and time. What I’ve come to realize is that often there is no maliciousness or negative intent on anyone’s part. Communication is a two way street. It is simply the case that people vary in the degree to which they can successfully communicate, and the extent to which they try. The latter can make up for the first, but requires time that isn’t always possible to give. I was spreading myself too thin trying for universal connection when more realistically, I needed to prune my graph. Constantly showing up and hoping for connection when the dynamic was not there was not something I was not going to do anymore.</p> <h3 id="i-decided-to-change">I decided to change</h3> <p>Pruning a connection graph means tweaking the scoring algorithm at each node. My previous algorithm was faulty, and included a variable of <a href="https://youtu.be/8ui9umU0C2g?si=I-IShyM0GQ45j2JI" target="_blank">wanting to be accepted</a>. The trade-off was placing less emphasis on consideration of my own time and value. I decided to try a new algorithm – one that would place focus on the projects that I was passionate about, and listening for connection instead of trying to force it. When the noise of expectations becomes more quiet, the signal from people that value our contributions can be more strongly heard. It was akin to moments this year when I found myself in the middle of nature and heard – <i style="font-style:italic">felt</i> – the quiet. It was there all along.</p> <p>I decided to change in many ways, and this was the first: prioritizing relationships not based on expectation or idealistic desire, but tangible evidence. And it needs to go both ways – we need to invest in the other person as well. If you need an algorithm for knowing if someone is important to you, think of how often you think of them. If you find them on your mind, or they are the person you want to brainstorm ideas with? That speaks for itself. I used to associate missing with sadness, because it meant something I treasured was not there. I’ve changed that perspective to a more positive one. Missing someone is a beautiful thing, because it means you were lucky to find connection. In many cases you will see them again, and that is something to look forward to. If you won’t? You can be grateful to have had shared experiences with them, and <a href="https://youtu.be/DpkAxNeCM4U?si=Vl2iRYfUXTaI74zT" target="_blank">carry</a> those memories with you.</p> <p>When I refocused my energy, the sadness not only dissipated, but my metaphorical connection tree is thriving, growing stronger roots and fewer, healthier and more verdant leaves. As I am able to focus on signal that I think is important for our community, I am becoming a better technical leader. I don’t fit any templates for what that is supposed to look like, and I realize that I won’t, and that in and of itself is important for the ecosystem. After this subtle change I’ve found a level of joy and satisfaction in my work that is unparalleled even to the best times I had before.</p> <h3 id="i-decided-to-adventure">I decided to adventure</h3> <p>Rumination might start from one thread, but has a quality of trickling easily into other parts of life. As I thought about how I spent my time, I didn’t like the idea that life would be as it is now, forever. Routine provides consistency and safety, but we can also get stuck in them. I have the duality of liking routine, which is safe, contrasted with a heart that craves adventure. That manifests in careful reflection and decision to pursue experiences that are novel and allow me to break free from comfort zones.</p> <p>As a result, at the end of this year I sought adventure. I found that I <a href="https://youtu.be/5obK-GuCBW0?si=Vl9844f_qJLferGO" target="_blank">needed to be brave</a>, because I had grown up learning a mindset of fear. I embarked on an over 1000 mile road trip, found myself exploring mountains, flying down steep hills on my bike, and unapologetically experiencing all the beauty that the world has to offer. In retrospect, I was able to turn a knob to control my own level of risk aversion to be oriented to risk seeking. I embraced uncertainty. It was not accidental, but done through careful decision, and I can attest I had moments where I would sit in quiet thought and not stand up until I felt a shift in my perspective. I didn’t know that was possible. It’s fascinating that the mind is capable of that. I learned that joy comes from immersion in new experiences and adventures, and it’s relatively easy to decide to pursue them more often. I am excited for a life where I have these adventures to look forward to.</p> <h3 id="i-learned-friendships-are-tiny-and-powerful">I learned friendships are tiny and powerful</h3> <p>At the end of last year (2023), I had an epiphany that I needed people. This was a surprise for me because during times of my life when things have been hard, I’ve learned to find strength from within myself. What I wasn’t sure about at the end of last year was the details of that. How many people? Under what context? What specifically were my needs? The last is an important question, because I’ve noticed it is common to have a need or request but not think carefully through what you are actually asking of others.</p> <p>I learned this year that I only need a few, and I value close friendships that are open, vulnerable, and supportive over any volume of people that often forces more superficial interaction. My desire is for direct communication, and interactions of sharing stories, learning, and joy. I want laughing fits, and I want psychological safety. If there is conflict, that means that the other person metaphorically embraces you and shows you through their actions that you are going to <a href="https://youtu.be/Vc0Wq6Rn6ng?si=QEffoT5M3_fRO3K5" target="_blank">figure it out</a>. They will not abandon you or give up on you, and disagreement is tackled with thoughtful discussion and kindness. They consistently show up when you don’t ask them to, in the good times and more the bad, and they absolutely don’t have to. Once I experienced (and fully realized the power) behind this kind of communication, in contrast to blaming, shaming, and feeling like you are put on the defense? I metaphorically fell to my knees. I am tearing up as I write this. I am so grateful.</p> <h3 id="i-learned-to-set-boundaries">I learned to set boundaries</h3> <p>We can often tolerate things because we are supposed to. Because if we don’t, we are publicly shamed. Or we compare ourselves to an ideal and feel that we have failed. We might keep trying, and blame ourselves. I had an epiphany this year when I realized that this self-blame was counterproductive toward protecting myself. In the same way that some people feed our souls and we might nourish these relationships, we equivalently need to be aware of those that drain us. I decided this year that I would do a better job to protect myself, and set boundaries. A quick feeling of dread and anxiety in specific situations was a harbinger of something that was not good for me. It is not selfish to not want these negative experiences anymore. We don’t have to engage. We can walk away.</p> <p>It would be akin to being in a room with a stove, and having to touch it. Sometimes it’s off, and it’s OK. But then it often burns you. It hurts. You know the pain is coming again, but it’s unpredictable. The stove in this metaphor might be a person or experience, and either way, may not even be aware of its influence. The stoves in our lives are there to teach us lessons. The insight I had is that it is my choice for the stove to have that power over me, or not. I decided to learn from the experience, and leave that room. This choice was empowering, and I encourage others to set boundaries for the stoves in their lives, whatever they may be.</p> <h3 id="i-learned-to-love">I learned to love</h3> <p>There is a common association of love with ownership and possession. While I won’t say that’s wrong, what I realized this year is that truly selfless love has no quality of that. I believe that truly loving someone means deriving joy from their happiness, and not expecting anything in return. In the best of worlds it is sharing experiences and laughter, and feeling a sense of connection that feeds your soul. It is wanting to support them, and give them your time and every facet of your <a href="https://youtu.be/afxK5-MwG-I?si=r3-dErpMxwTEPRGt" target="_blank">superpower</a>. You would run mountains for them. You would wait <a href="https://youtu.be/rtOvBOTyX00?si=W9JJE6q7IMGzJT8D" target="_blank">a thousand years</a> to show up for them. Love is often portrayed to be primarily romantic, but it can be found in friendships, and in friends that turn into your family. Importantly, we must be able to apply that love to ourselves. When <a href="https://vsoch.github.io/2023/things-fall-apart/" target="_blank">things fall apart</a>, it is a hug of internal strength and determination that ultimately gets us through. This was a thought I had this year that gave me a feeling of immense safety and peace.</p> <h3 id="thank-you-2024">Thank you, 2024</h3> <p>It was a beautiful year. I’ve learned how to embrace what I have instead of focusing on what I do not, approach life with curiosity and desire for growth, and how daily fulfillment and taking care of the self is more of a decision than anything else. I danced and laughed so much this year, and that is a strong attestment of my joy 🥰. I think if this were it, I could look back on my life and find that I have had it all. I am convinced that it’s only going to get better. Happy holidays, folks, and onward to new adventures and learning in the New Year.</p> Sun, 22 Dec 2024 00:00:00 +0000 https://vsoch.github.io/2024/reflections/ https://vsoch.github.io/2024/reflections/ For Coach <p>I went on a bike ride this weekend, and in passing a parking lot of a trail head I saw an older man standing next to a car, laughing with a friend, and he looked just like my old high school running coach, “Coach.” Yes, we called him exactly that. He was (at least in front of us, a rowdy group of high school students, 13-18 years of age) stoic and commandeering, easily earning our respect so we would abide to his orders to run many laps around the track, or loops around a course. He led us through the cross country, winter track, and spring track seasons.</p> <p>This weekend after I saw this reflection of him in someone else, I couldn’t see through my goggles. The tears flowed, and not quietly – it was the kind of aching, sobbing that emits from your soul unexpectedly when something you’ve shoved down deep forces itself out into the open. I found myself flying under a bridge, not able to breathe, and having opened up a well of sorrow that I hadn’t allowed myself to experience.</p> <p>Coach <a href="https://www.legacy.com/us/obituaries/concordmonitor/name/peter-lovejoy-obituary?id=20874970">died in 2013</a> while I was in graduate school. I received notice a few days before it happened, and even if it had been enough time to think about coming to say goodbye, I couldn’t go. I faced a lot of health issues, and combined with the stress of graduate school, it was too much. I was ashamed at the time for him to have seen me – I was thin and sickly, and a shadow of my previous self.</p> <p>As I rode and allowed myself to experience the suppressed emotion, I remembered many things. I remembered how he would <a href="https://youtu.be/Qq6oX5uwwwk?si=f8QXv59DRjz-Mjq1&amp;t=1975">pull me aside</a> before and after races, when I was anxious and stuck in my head, and give me a combination of tough love and logic to bring me back into the present. And side note - that particular class I championship I was stacked for events (800, 1600, and two relays) and although I didn’t do great for all of them, the <a href="https://youtu.be/Qq6oX5uwwwk?si=rz8AfANE1ktyTScf&amp;t=2149">4x400 at the end</a> was one of the fastest times of my (very short) 400 career - close to about one minute. I wasn’t a 400 meter running or sprinter, but I did love the relay. I remembered practices. Regardless of what happened during the day, at 3pm sharp I’d report to the track and find the comfort of routine. Coach helped to create the environment at practices that made running, albeit it was challenging, such a joy. I carry that joy with me to this day, which is almost 25 years later.</p> <p>I realized this weekend how much I took him for granted, and the fact that he was a source of stability during my entire transition from adolescence into young adult – freshman to senior year. I saw him every day after school, and for many hours, for the Fall, Spring, and Winter, and often in the summer for training. He was the one that saw potential in me when I was undeniably the slowest on the team as a freshman. He was the one that pulled me aside when I grew 5-6 inches before my sophomore year, and told me how to add dried fruit and nuts to my died to put on some weight, because I needed to be strong. I saw him early on Saturdays for meets. He designed workouts that catered to my running style – middle distance. He was a source of support if I ever came to him with complaint. He told me he was proud of me.</p> <p>I cried today because I never properly mourned his loss, the loss of someone that had been so important to me, and cared for us so deeply. I cried because I never got to say goodbye, and tell him that he gave me a backbone of stability during high school that I wouldn’t have had without him. To Coach, wherever you are, there are few people that touch our lives deeply, and you are one of them for me. I am so grateful for the time that I got to spend with you. It’s common to think that people value us based on financial support or gifts, or some requirement based on biological relation. But at the end of the day, the most valuable asset that someone can give us is their time. I will keep your memory with me always, and for the people that I love, I will also give them my support and time. And I will do a better job to tell them more frequently how important they are to me.</p> <div style="padding:20px"> <img src="/assets/images/posts/coach/coach.JPG" /> </div> Mon, 25 Nov 2024 00:00:00 +0000 https://vsoch.github.io/2024/for-coach/ https://vsoch.github.io/2024/for-coach/ A Future for HPC and Cloud: Collaboration Across Boundaries <p>The Developer Stories Podcast recently released an episode with Dan Reed (<a href="https://rseng.github.io/devstories/2024/hpc-dan/" target="_blank">“HPC Dan”</a>) that talked about the future of High Performance Computing. While there was ample conversation on resources and some policy, we only touched on some of the ideas about what to do it about it, or more specifically, how we should be working together. In this post, I want to talk about some of the problems I see with our current academic culture that prevent us from more successfully collaborating across the space. I like to think about these possible futures that don’t exist yet. Let’s jump in.</p> <h2 id="traditional-practices-do-not-scale-to-cross-community-collaboration">Traditional practices do not scale to cross-community collaboration</h2> <p>In academia, we are accustomed to writing papers. We are told (and expect) that publishing a paper in a highly respected venue is what will get the most attention, and thus have the most impact. And perhaps it still is true that this means to share information will be distributed to the academic community, and be a sound strategy to give us “career credits” or a metric of value for career advancement. But it is problematic. And you can’t even point this out because <a href="https://vsoch.github.io/2022/code-of-conduct/" target="_blank">you’ll get in trouble for it</a>.</p> <p>Here is one problem with the above. While it works if you live in the isolation of an academic community, it doesn’t scale well beyond that. The issue is that today we need not just be talking to the academic community, but to the larger cloud community. We are in a present day and entering into a future where <a href="https://cacm.acm.org/research/hpc-forecast/" target="_blank">cloud is the leader</a>, from an economic standpoint, and we are in somewhat of a competition for resources and talent. The two communities have been presented as a dichotomy, and at worst in an adversarial light. I hated this perspective – when someone on a panel would raise their voice and say “We cannot afford you!” or point fingers. It wasn’t productive – pointing fingers and blaming someone does not make progress. That same energy can go toward proactive action to try ideas and do something about solving the problem.</p> <h4 id="what-it-takes-to-be-influential">What it takes to be influential</h4> <p>Now we can talk about what it takes to be influential. Indeed, when you are the “little guy” in the face of an economic powerhouse, it ‘s easy to feel powerless. And if you are a pessimist, maybe this is how you see it. But if you are an optimist, you might recognize that while it’s out of your direct control, it is within your indirect control. You can have influence, even if you are just one person. You have a voice.</p> <p>How to have influence? We first need to define a line between two kinds of work – the conceptual piece that might include algorithms, design and architecture, and then an implementation, which ideally is of more production quality. In the academic system that promotes publication first, we lean heavy on the conceptual. Ideas are often presented without an implementation, or if there is an implementation, it is a weak point. It’s a mistake to design something elegant but never turn it into a product that catches attention. It’s also a mistake to implement something that looks flashy but has a poor design. The strongest work will be a balance of those two – an elegant and well-thought out algorithm <strong>and</strong> a production-level implementation that further adds evidence (via implementation) that the idea works. This brings up another problem about skill. Often the researcher coming up with the algorithm isn’t a programmer. The paper might entirely be math. If the researcher knows how to program, they most likely have never written production-quality software. It’s a wide gap to span.</p> <p>If there are scattered people in the academic community capable of spanning this divide, they often don’t have or cannot make the time. Given a reward system based on publication, and that a publication is sufficient with the algorithm, the implementation is not a priority. To add challenge to that, in order to have influence, you often need to generate many of these paired ideas and implementations. From the standpoint of work, it’s a lot, and most of it won’t lead to an outcome of change. I think that’s likely why we don’t see many of these winning combinations from our community. It’s hard to do, has a high failure rate, and there is no direct reward in place for the work. It’s much easier to fold the self into the more quick, turn around reward systems to write papers that show incremental conceptual improvement that are accepted and published at frequent venues.</p> <h2 id="engagement-is-not-well-defined-so-nobody-does-it">Engagement is not well defined, so nobody does it</h2> <p>Influence often has to start with establishing a voice. Establishing a voice often means speaking up, and being persistent. It’s easy to think that the voices of the few cannot be heard and have impact, but I’ve found this to not be true – one or few people can inspire change if they send out a consistent signal. This means showing up to working groups and (often) being the only one from the HPC community, posting on group lists to ask questions or engage, and taking time to listen to podcasts and watch talks from venues that are not traditionally in the HPC space to learn new ideas. It often means finding connections between what you know in your community and these “other” spaces, and then being forward to reach out to individuals in the other space to ask to talk about something. Many times, these conversations might not come to anything. But when they do? That’s where you have influence. It means leaving our silos. We are most comfortable in silos. But those that leave silos (and zones of comfort) to share ideas across boundaries will have the most impact (influence).</p> <h4 id="examples-from-the-hpc-community">Examples from the HPC community</h4> <p>I can give direct examples from our community for individuals that I think have bridged this gap and had great success. The first is <a href="https://x.com/ahcorporto" target="_blank">Ricardo Rocha</a> of CERN, who is (obviously) firmly rooted in HPC – <a href="https://home.cern/about" target="_blank">CERN</a> “The European Organization for Nuclear Research” is the largest particle physics laboratory in the world. Ricardo has been a leader in voice and work that has spanned the cloud-native and HPC communities for years, most recently giving a Keynote at Kubecon North America about <a href="https://youtu.be/xMmskWIlktA?si=gHhfhuhoZTICD9D7" target="_blank">multi-cluster scheduling with Kueue</a>. Another example is Torsten Hoefler, head of the Scalable Parallel Computing Laboratory <a href="https://htor.inf.ethz.ch/" target="_blank">(SPCL)</a> that I’ve stumbled on recently learning about Ultra Ethernet. If you look at his lab’s YouTube <a href="https://www.youtube.com/@spcl" target="_blank">channel</a> (yes, that is notable in and of itself, how many labs do that?) Torsten very notably is not presenting recordings from venues, he is taking talks from venues (and beyond) and recording them to share intentionally. He is adding the branding for his laboratory. They also have an active <a href="https://x.com/spcl_eth" target="_blank">presence on social media</a>, which is also notable. I’ve noticed that some academics tend to be very active on social media, and others either pretend it doesn’t exist, or turn their nose up to it. I’m not saying that social media venues are healthy for society or a good use of time (they can really steal attention in a terrible way) but they are a means to reach a wide audience of people. Just making a post when you have something important to say, which is what I try to do (often linking to my full thoughts here) is strategic to getting a message across, regardless of how you feel about the services.</p> <p>I believe that this is something we need to do more of – putting out information (and advocating for it) without having it be of direct benefit to us (publication, conference proceedings, etc), and putting out information when we have something to say. I’ve been experimenting with this idea recently with a few talks on <a href="https://youtu.be/ZXM1gP4goP8">container pulling in Kubernetes</a> and <a href="https://youtu.be/-36DlwrSPec">scheduling to containers in Kubernetes</a>. I got tired of the “wait for a venue and ask for permission” to share ideas. Ironically, the second talk (now over 4 months old) would not have been presented until this weekend if I had submitted it to the Canopie HPC venue. I also would have been limited to a tiny bit of time, and it’s unlikely it would have been shared beyond a single room of predominantly one demographic, one community. Is that really the best outcome?</p> <h4 id="openness-and-transparency-are-a-hallmark-of-collaboration">Openness and transparency are a hallmark of collaboration</h4> <p>Another feature that must come from the venues themselves is transparency. It almost doesn’t matter if a community has annual, flashy events if they are venues of privileged – you must pay to enter, and to access information, and beyond that, it’s closed. From the example above with Torsten, the first talk I watched was his recording of a talk he gave at <a href="https://salishan.ahsc-nm.org/">Salishan</a>. This (to me) comes across as one of these high-end, invite only HPC events that I (and most) would never be privy to attend. When presentations at these venues are not shared publicly, and yes, on places like YouTube, this is knowledge that will be forgotten. It doesn’t matter how impressive your work is if you present it to a room of 30 people and that’s the end of it. It makes me sad to read blogs of prominent people in our community that reference talks from these closed events, and know that I’ll never be able to see or learn from them. If we are championing reproducibility, transparency, and openness, we are not practicing what we preach. The argument about needing to attract attendees and keep a conference profitable doesn’t cut it. Look at Kubecon – it’s an absolute beast in terms of attendance. They have their talks up before the conference is even over!</p> <p>Speaking of Kubecon, one of my favorite things to do is watch talks from it for weeks (and more) after they come out. I find interesting projects and reach out to people, and this is an opportunity to grow network and thinking space. If the talks weren’t on YouTube, my portal to that world would be closed. We are missing on that opportunity for others to reach us by not sharing. I feel that I get to experience some of the learning of the event despite not being there. The organizers of Kubecon I suspect recognize that not everyone can attend, for reasons that vary between people, and they don’t want to close off knowledge. I respect and champion this perspective, and hope that the HPC community can eventually catch up.</p> <h4 id="our-actions-need-to-directly-support-the-open-sharing-of-knowledge">Our actions need to directly support the open sharing of knowledge</h4> <p>What the HPC community needs to get better at is the open sharing of knowledge. There are specific projects that do this well, but our conferences (generally) do not. The researchers and labs that are going to have impact and be successful not only do great and impressive work, but they are actively sharing it. I know about Torsten’s work and lab because I listened to him talk on a Podcast about Ultra Ethernet, and then I found his YouTube channel and Twitter feeds. My network and space for learning has grown because he has put his work out there.</p> <h2 id="routine-for-engagement-is-missing">Routine for engagement is missing</h2> <p>It’s problematic that the HPC community has no established routine to know how to engage. This is often why solutions cannot be offered up for the problems at hand – it’s not clear how to act when there is absence of instructions for the thought and engagement process to begin with, let alone solutions themselves. Maybe that is where creativity comes in – which broadly speaking is generating something from nothing. But that takes time and freedom to think (I’ll talk about this later). For the first problem – “how to engage” – learning and engaging in ways that don’t fit a traditional routine for an academic are hard to do. The academic mindset is one of permission. Do others think this is a good idea? Can I get permission from my boss to work on it? At best, we submit proposals (with creative thought) but they still need to be approved. When they are not, we abandon them for the time being in favor of whatever we are given permission to do.</p> <h4 id="influence-is-deciding-to-bake-fruit-cake">Influence is deciding to bake fruit-cake</h4> <p>But much of what has to be done will never be granted permission because it’s either too risky or unknown and questionable. Much of what needs to be done just needs someone that decides to do it, and then shows people after that. You don’t ask permission to bake a new fruit cake you think will actually taste good, you bake it, and then offer others a taste. They might realize that it tastes good, but if you asked them in advance they would say “No way, fruit cake is terrible. Don’t do it.” In the second case, you’d never had made the cake. And my thinking of fruit cake comes directly from <a href="https://www.hpcdan.org/reeds_ruminations/2010/12/dreams-of-a-final-theory-the-origin-of-fruitcake.html" target="_blank">this post</a> on Dan Reed’s blog. He has strong feelings about fruit cake! 🍰</p> <p>A lot of good ideas are also accidental – you start doing one thing, and maybe it’s even just for fun and learning. You start building something, and stumble on an insight or something even cooler along the way. That goes against the academic desire to write down a plan a priori, get it stamped and approved, and then start working on it. You also need to have time to explore and play like that. So high level:</p> <blockquote> <p>the models of thinking and working that are often needed for innovation and ideas that are different and useful to influence a larger power don’t fit with what we are expected or trained to do.</p> </blockquote> <p>They don’t fit into the time or schedule we are afforded based on our established routines.</p> <h2 id="our-reward-systems-dont-encourage-relaxed-creative-thinking">Our reward systems don’t encourage relaxed, creative thinking</h2> <p>It seems like a lot of academics are on a treadmill to meet deadlines. There is some promise that the treadmill will slow down, but in practice, I never see that it does. This makes time hard to come by, and so the things that get prioritized are those that fall into a comfortable, established routine. If there is something that falls outside of what we deem the highest bang for the academic credit buck it’s not invested in. You don’t make the time.</p> <h4 id="collaboration-is-leaving-the-comfort-of-your-local-market">Collaboration is leaving the comfort of your local market</h4> <p>Let’s pretend that we are all bakers in a town. Our highest reward comes from baking our recipes, possibly with slight deviation so they are known to be tasty, and taking them to the local market to sell for profit. It would be hugely (temporally) costly to walk to neighboring towns looking for bakers working on similar recipes, and then spending time testing new, often very different combinations of ingredients. We might come back tired, broke, and not having found a great recipe. On the other hand, maybe we don’t have an immediate success, but we are invited to other markets. We taste test a much broader range of goods. We grow in so many more ways than if we stayed in our little town.</p> <p>And maybe before communication afforded it, that would be the likely outcome. But unbeknownst to us, the network of bakers in other town have discovered Twitter, YouTube, and a use for other (sometimes terrible) social media services that allow them to quickly iterate on ideas and work together. Not only have they caught up to the tastiness of our recipes, they have surpassed us, and are designing robots to make the recipes for them. And we are still here, fudging around with the amount of cinnamon in our oatmeal raisin cookies. We still haven’t figured out we could join their communication channels, and bring the story of cinnamon to the larger community to iterate on much faster.</p> <blockquote> <p>If you don’t get the metaphor, it’s about the time of payoff, and the initial cost of communication. Taking the time to engage outside of your comfort zone doesn’t have an immediate payoff but a longer term one.</p> </blockquote> <p>The other issue with this paradigm is that people want established paths of behavior. There are no established paths for interaction with cloud communities. People don’t know what to do, so the default is to do nothing.</p> <h2 id="the-future-is-large-collaborative-projects">The future is large, collaborative projects</h2> <p>I believe in our HPC community to innovate and come up with amazing ideas. I also believe in the power of numbers, and that you can start with even a mediocre idea or project, and with enough motivated contributors, turn it into something equally innovative. That is how I see the innovation space in Kubernetes. Often a feature or component comes out, and it is first a little rough around the edges. But like clay, with many contributors and common need, it transforms over time into an elegantly designed solution that solves a lot of problems. I am biased here (and recognize my bias) that I have more faith in large, collaborative efforts to solve some of the most challenging problems than say, a small group that are isolated in academia. Sitting in these small groups, I think we will have the most success through engagement – bringing out expertise to the table and conversation for these larger projects. Is it often uncomfortable? Yes. Does it often go against traditional academic norms and incentives? Yes. I think with this strategy we can solve larger problems, and in a more collaborative fashion that leads to things we champion (but often don’t practice) like reproducibility and transparency.</p> <p>I can give a quick example with respect to multi-cluster scheduling. There are huge internal projects working on the problem. And they will likely come up with interesting papers. But I believe it would be a better strategy to first collaborate with the SIG multi-cluster group, ensure they are handles for customization (for specific use cases like HPC) and then to optimize for that. I believe that a viable future for most models that are converged (general problems of compute that can sit between cloud and HPC communities) is that the powerhouse global community is going to put together some kind of skeleton, and the initial version won’t fit exactly what we need. But it will very likely be customizable, and we will customize-away for our use cases. Maybe our use cases will emerge in the larger community, and they will be solved before we’ve had a chance to write papers on what we are doing. Cloud companies get a competitive advantage for standardizing things. This means they themselves need that ability to customize, and that need directly helps us.</p> <p>This goes back to the talk I shared from Ricardo – I can guarantee you he has something like that in mind. We prioritize working together, and we figure out the details for what we specifically need. This is a different strategy than what I normally see – working in silos and coming up with disparate solutions that then further separate the two communities. Ironically, because the underlying use cases are so similar, we usually have a loophole that the Kubernetes (and cloud-native) community eventually innovates what we need anyway. Examples include (but are not limited to) batch workflows, topology-aware scheduling, and custom scheduler policies. The scheduling space is still a bit rough, I’ll admit, but it’s getting a lot better, and really quickly. I suspect the next item to add to that list will be multi-cluster and multi-tenancy. We will see.</p> <h2 id="the-gopher-has-no-clothes">The gopher has no clothes</h2> <p>I sometimes feel like I’m pointing out that the emperor has no clothes. But it’s strange to watch these cycles repeat, year after year. The insight is that there is not a real divide in the actual technology space – the current divide results from us not working together. We have similar workloads and similar needs, and the only reason we have entirely different projects is because HPC has largely existed in a silo. A lot of the innovations that we need are ironically coming to be, not because of our input, but because they are foundational to workloads we share in common and cloud needs them too.</p> <p>I am just one person, but I will continue to express my views, and to have my voice, even if I am a bit against the grain or considered non-conformist for it. I know that my opinions are often threatening to people, and that is outside of my control. If you find my ideas threatening, it might make sense to think about why. And after that, let’s have a conversation about it. Let’s grow and learn from one another, because we very likely have similar goals in this beautiful space of work.</p> <p>On that note, I’m off to a running adventure! And this week is Supercomputing. I’ll be watching Kubecon talks, engaging remotely however I can (without having purchased a ticket) and probably enjoying a quiet week of focus on programming projects and learning. I do hope to go in the future for some fun social aspect. To all my friends in attendance, have an amazing week!</p> Sun, 17 Nov 2024 08:30:00 +0000 https://vsoch.github.io/2024/across-boundaries/ https://vsoch.github.io/2024/across-boundaries/ My Old Friend <p>Hello self-loathing, my old friend <br /> You’ve come to torture me again <br /> While my self-esteem is leaping, <br /> You show up forceful, inward creeping. <br /> Because the joy, that was dancing in my brain. <br /> You try to claim <br /> You give me hell, defiance! <br /></p> <p>In restless angst, I’ve been alone <br /> Through the wildness I’ll roam <br /> In the darkness of a tunnel <br /> My spinning mind an endless puzzle. <br /> I’m blinded another passing light <br /> Filled with night <br /> I need a hug, and guidance <br /></p> <p>And in my mind’s eye I did see <br /> The problem’s almost always me. <br /> Not good enough or worth keeping <br /> Not smart enough, attention seeking. <br /> People doing wrongs – they really never cared. <br /> You’re impaired. <br /> Safety is self reliance. <br /></p> <p>“Folks” said I, “You do not know <br /> This burden of internal blow. <br /> See my wrongs they might teach you <br /> Don’t repeat them I beseech you. <br /> My self-worth, like diving eagles fell <br /> No one tell <br /> Choose not love but violence. <br /></p> <p>And my folks they are dismayed. <br /> At this monster that they made. <br /> But tomorrow is a new morning. <br /> When new hopes they might be forming. <br /> And the beauty of the world is written in our falls” <br /> Our tunnel walls. <br /> And I found peace in silence. <br /></p> <p><br /></p> <audio controls=""> <source src="/assets/audio/hello-my-old-friend-2024.mp3" type="audio/mp3" /> </audio> <p><br /></p> <iframe width="560" height="315" src="https://www.youtube.com/embed/4fWyzwo1xg0?si=zUz-dmajVEhDc7xy" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe> Sun, 10 Nov 2024 11:30:00 +0000 https://vsoch.github.io/2024/hello-self-loathing-my-old-friend/ https://vsoch.github.io/2024/hello-self-loathing-my-old-friend/ Interactive Docker Builds <p>One of my favorite parts of my work are fun brainstorming sessions. Sometimes you are thinking about specific, actually important projects, and other times you go off on tangents and come up with really fun, often experimental ideas that have nothing to do with anything you are “supposed to be” working on. It’s this thinking space that tends to be more fun, the “What if…” and it’s this space that gets me excited about programming and designing software, even after all these decades! This post is about one of those fun ideas.</p> <h2 id="when-package-managers-make-us-cry">When package managers make us cry</h2> <p>To give another tip of the hat to <a href="https://archive.fosdem.org/2018/schedule/event/how_to_make_package_managers_cry/">one of my favorite talks</a> (although it’s debatable if the package manager software or people are crying in that reference) there is a special scenario when my package manager makes me cry immensely. Yes, it is when I’m building containers. When you use “<a href="https://spack.readthedocs.io/en/latest/containers.html">spack containerize</a>” and can generate a Dockerfile from a spack.yaml environment with said spack environment building in one line (<a href="https://github.com/converged-computing/performance-study/blob/f0bd89443d1080be311117254cb5d2c82e686108/docker/google/cpu/amg2023/Dockerfile#L35">here is a recent example</a>) it means building an entire tree of software in one layer. This also means your build can be going for hours and if it fails, you lose the entire thing. I can’t tell you how many times this has happened to me - it usually is for builds that have worked before, but then an update to spack breaks something, and I’m not expecting it. I kick myself later. My strategy is often to do the build interactively, meaning shelling into the base container and then issuing the commands one by one, and then committing. That’s hard to remember to do every time, hence the tears. But can we do better than that? Can we create a docker builder that will, upon failure, still commit the layer and allow me to shell in to continue working on it?</p> <h2 id="customizing-mobymoby">Customizing moby/moby</h2> <p>For those not familiar, the docker main code lives at <a href="https://github.com/moby/moby">github.com/moby/moby</a>. Last night was the first time I looked at it, and it was refreshingly easy to follow after months of reading Kubernetes (and especially tracing logic in <a href="https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet">the kubelet</a>, which I’m doing for a small talk I hope to do soon about containerd and the <a href="https://vsoch.github.io/2024/container-pulling/">SOCI snapshotter</a>). I found instructions for setting up a development environment <a href="https://github.com/moby/moby/blob/master/docs/contributing/set-up-dev-env.md">here</a>, and TLDR, it comes down to cloning the repository, starting a development container in VSCode, and then you can build and run the daemon with one command:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hack/make.sh binary install-binary run </code></pre></div></div> <p>If you <a href="https://github.com/moby/moby/tree/master/hack/make">look in that directory</a> you’ll see build logic that mirrors the structure of the codebase, and what you likely know about Docker already. We are building a <a href="https://github.com/moby/moby/blob/master/testutil/daemon/daemon.go">daemon</a> service that will be delivered over the docker socket, along with a frontend proxy that will issue requests to it to list (ps), build, etc. I also found the <a href="https://github.com/moby/moby/tree/master/builder">builder logic</a> to be nicely organized, where the previous (older?) logic to build from a dockerfile was in “dockerfile” and the newer buildkit is in “builder-next.” I haven’t looked into build-kit, but you can see where the backend makes a choice <a href="https://github.com/moby/moby/blob/5aaceefe5be751d55d0a4e9212ddba04408d1a1c/api/server/backend/build/backend.go#L62-L73">here</a>, and seems to run some kind of a <a href="https://github.com/moby/moby/blob/5aaceefe5be751d55d0a4e9212ddba04408d1a1c/builder/builder-next/builder.go#L428">solve in a go routine</a> as opposed to the Dockerfile builder that is dispatching each line <a href="https://github.com/moby/moby/blob/5aaceefe5be751d55d0a4e9212ddba04408d1a1c/builder/dockerfile/builder.go#L297">here</a> through a function with a massive switch statement for the <a href="https://github.com/moby/moby/blob/5aaceefe5be751d55d0a4e9212ddba04408d1a1c/builder/dockerfile/evaluator.go#L67-L104">directive type</a> (e.g., RUN, ENV, etc). This was really easy to trace and understand the logic for - thank you to the Docker developers for that! 🙏</p> <p>What does this mean for development? What is neat is that you can write your own little scripts that interact with the client, which in turn will make calls to the daemon. It meant that I very easily could write my own <a href="https://github.com/researchapps/moby/blob/debug-interactive-builder/main.go">little script</a> that would take in custom inputs, run a build, and do other customizations that I wanted. The main logic to make the client and then issue the build request looks like this:</p> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">apiClient</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="n">NewClientWithOpts</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">FromEnv</span><span class="p">)</span> <span class="c">// check errors here</span> <span class="k">defer</span> <span class="n">apiClient</span><span class="o">.</span><span class="n">Close</span><span class="p">()</span> <span class="c">// Create temporary directory for reader (context) to copy the Dockerfile to</span> <span class="n">tmp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">MkdirTemp</span><span class="p">(</span><span class="s">""</span><span class="p">,</span> <span class="s">"docker-dinosaur-build"</span><span class="p">)</span> <span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="n">log</span><span class="o">.</span><span class="n">Fatalf</span><span class="p">(</span><span class="s">"could not create temporary directory: %v"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span> <span class="p">}</span> <span class="k">defer</span> <span class="n">os</span><span class="o">.</span><span class="n">RemoveAll</span><span class="p">(</span><span class="n">tmp</span><span class="p">)</span> <span class="n">copyFile</span><span class="p">(</span><span class="o">*</span><span class="n">dockerfile</span><span class="p">,</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span> <span class="s">"Dockerfile"</span><span class="p">))</span> <span class="n">reader</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">archive</span><span class="o">.</span><span class="n">TarWithOptions</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">archive</span><span class="o">.</span><span class="n">TarOptions</span><span class="p">{})</span> <span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="n">log</span><span class="o">.</span><span class="n">Fatalf</span><span class="p">(</span><span class="s">"could not create tar: %v"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span> <span class="p">}</span> <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">apiClient</span><span class="o">.</span><span class="n">ImageBuild</span><span class="p">(</span> <span class="n">context</span><span class="o">.</span><span class="n">Background</span><span class="p">(),</span> <span class="n">reader</span><span class="p">,</span> <span class="n">types</span><span class="o">.</span><span class="n">ImageBuildOptions</span><span class="p">{</span> <span class="n">Remove</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span> <span class="n">ForceRemove</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span> <span class="n">Dockerfile</span><span class="o">:</span> <span class="s">"Dockerfile"</span><span class="p">,</span> <span class="n">Tags</span><span class="o">:</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="o">*</span><span class="n">target</span><span class="p">},</span> <span class="n">Interactive</span><span class="o">:</span> <span class="o">*</span><span class="n">interactiveDebug</span><span class="p">,</span> <span class="p">},</span> <span class="p">)</span> </code></pre></div></div> <p>That is only a partial view of the entire script, so yes, the details are missing. The “InteractiveDebug” flag that is carried forward as to the build is not part of traditional ImageBuildOptions, and what I added for my little feature. To summarize (and you can look at the script to see further) I added this new argument that, when present, will allow the dispatchRun function to fail during the layer build, and still commit the layer. When it returns to the calling function, the error type is checked, and given an “InteractiveError” that I added, although the build will break at that point (no further layers will be attempted), because I’ve commit that particular layer, the build will finish, give me an image ID, and I can shell in and see that the offending line was partially run. And that’s it! Here is an example (immensely simple) Dockerfile:</p> <div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">FROM</span><span class="s"> ubuntu</span> <span class="k">RUN </span><span class="nb">touch</span> /opt/i-should-not-exist <span class="o">&amp;&amp;</span> <span class="nb">false</span> <span class="o">&amp;&amp;</span> not-a-command </code></pre></div></div> <p>Which will generate the file “/opt/i-should-not-exist” and then immediately issue false (returns 1) and another failed “not a command” (it won’t even get there). If you try to build this with regular Docker, it will fail and you won’t have an image id, let alone the layer. But when you use my little monster and add the “-i” flag for interactive? You get a different outcome!</p> <div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>This will fail <span class="go">./bin/docker-build -t fail -f Dockerfile.fail . </span><span class="gp">#</span><span class="w"> </span>But with interactive <span class="nt">-i</span>, it will work! <span class="go">./bin/docker-build -i -t works -f Dockerfile.fail . </span></code></pre></div></div> <p>Here is a direct shot of the little monster in action, showing the container complete, and the file exists when I shell inside, despite the line failing:</p> <div style="padding:20px"> <a target="_blank" href="/assets/images/posts/docker/little-monster.png"><img src="/assets/images/posts/docker/little-monster.png" /></a> </div> <p>This was fun to develop, because you can run the daemon in one terminal window (and see output) in your development environment, and follow your client in the other. Note that I commented out build kit logic for now. I will want to read that code a little more closely before I tweak it. If you’ve used build kit, you know those layers are being built in parallel, so the logic is slightly different.</p> <h2 id="an-actual-feature">An actual feature?</h2> <p>I don’t have plans at the moment to try to suggest this to be an actual feature - for the time being it was a fun experiment, and I invite others to <a href="https://github.com/researchapps/moby/pull/1">look at the PR</a> (to my own repository) if interested. That said, I do plan to test this in the future with more substantial builds, and I’m wondering why we don’t have something like this anyway? Granted that the build does not become reproducible, this kind of feature would help immensely with debugging. For all of the cases when you do cry because you lose hours of time when your fails during a long layer build, this would be a nice feature to have. Likely you’d want to debug and get it working, and then rebuild without this flag. Let me know what you think! If there is interest we can minimally pursue adding to build kit and engaging with the moby developers about the idea.</p> Thu, 24 Oct 2024 10:00:00 +0000 https://vsoch.github.io/2024/interactive-docker/ https://vsoch.github.io/2024/interactive-docker/ Pulling Containers and the SOCI Snapshotter <p>How does one pull containers in Kubernetes, you ask? Well, if you don’t think much about it, the answer is pretty slowly. Of course it depends on the size of your containers. If you don’t want to read, this is what I found (and discuss here):</p> <ol class="custom-counter"> <li>Moving containers to a registry local to the cloud didn't have obvious impact</li> <li>Adding a local SSD improved pull times by 1.25x</li> <li>The SOCI snapshotter improved times by 15-120x (!)</li> </ol> <p>My biggest surprise was the SOCI snapshotter, which I expected to work well but not THAT well.</p> <blockquote> <p>Note that the huge variation likely has to do with the index of the archive, and the extent of what the entrypoint needs, which is retrieved on demand. The containers that had a 120x improvement in pull time weren’t real application containers - they were generated programmatically. The containers that saw a 15x improvement were spack images, and for a machine learning container I saw a 10x improvement. I still need to do more work to understand the details.</p> </blockquote> <p>Finally, I didn’t see that AWS had provided a means to install with a daemonset, which (imho) is a more flexible strategy than having to install to the AMI or node. I <a href="https://github.com/converged-computing/soci-installer">created a daemonset installer</a> this morning before going on a bike ride. 🚲 The rest of this post will detail my brief exploration (and fun) of this space, starting with observations from a recent performance study, and finishing with the creation of a daemonset for SOCI.</p> <h2 id="an-example-from-the-wild">An example from the wild</h2> <p>For a recent performance study we had moderately sized containers (in the order of less than 10 GB) and the slowest ones took a few minutes. Here is a quick shot of that - and mind you this includes pulling partial containers, where we’ve already pulled some layers. That’s why you see some times close to 0.</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/performance-study/raw/main/analysis/container-sizes/data/img/pull_times_experiment_type_aws_eks_cpu.png" /> </div> <p>This full data is <a href="https://github.com/converged-computing/performance-study/tree/main/analysis/container-sizes">available on GitHub</a> with a DOI, if you are interested <a href="https://zenodo.org/doi/10.5281/zenodo.13738495"><img src="https://zenodo.org/badge/837429553.svg" alt="DOI" /></a>. If you use our data, please cite it! I can also show you how similar these containers are. This (so far) is my favorite plot from the study. It’s so beautiful, and (if you know what you are looking at) says a lot too.</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/performance-study/blob/main/analysis/container-sizes/data/similarity/cluster-container-layer-content-similarity.png?raw=true" /> </div> <p>I really like this plot because it shows (with quite neat separation) the clustered environments we used for the study. First, the plot shows similarity of containers based on layer content digests using the Jacaard coefficient, which is the set intersection of two containers over the union. Note that this image doesn’t show every label. So what are we looking at? 👀</p> <ol class="custom-counter"> <li>Containers that aren't similar to any others (browns) in the diagonal are spack builds</li> <li>The top left green square is shared by containers for Google Cloud and Amazon (with libfabric) for GPU builds</li> <li>The next tiny square (note the image doesn't show every label) has Google GPU images</li> <li>The box toward the middle is CPU images, from both Google Cloud and AWS (with libfabric)</li> <li>The space between the next clusters is an amg2023 image built with spack</li> <li>The next little square (third from the bottom on the diagonal) is rockylinux (Compute Engine CPU)</li> <li>The last two squares are Azure, first for GPU then CPU</li> <li>The very last entries in the matrix are two more amg2023 images, unlike even each other.</li> </ol> <p>The build of the containers (done by yours truly) is done intentionally to maximize redundancy of layers. This means a shared base (different depending on the environment) with added dependencies for Flux and oras, and only the application logic at the end. Now, if you build a spack environment into a container, you’ll get one big, chonky layer with the application and all dependencies. I did this for a recent experiment (a small set) just to compare, and the “spack builds” matrix was all browns.</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/blob/main/experiments/similarity/data/similarity/spack/cluster-container-similarity.png?raw=true" /> </div> <p>And that is exactly what I expected.</p> <h2 id="a-controlled-example">A controlled example</h2> <p>I had a week between taking a road trip and coming back where it was fairly quiet, and I decided to have some fun. I wanted to do an experiment where I could control the number of layers and range of image sizes, and then see how long they took to pull – first using no strategy aside from “whatever the cloud is doing” and then trying different (more established) ones. I created a tool, the <a href="https://github.com/converged-computing/container-crafter">container-crafter</a> in Go that would take a parameter study file, and then sploot out the set of builds, where every layer in each build was guaranteed to be unique. I chose a range of image sizes based on percentiles derived from parsing 77K Dockerfile from the scientific community, provided by the <a href="https://rseng.github.io/web">Research Software Database</a>. I wrote this into a full paper, and also did a huge analysis of the larger ecosystem, but I’ll share a few of the fun plots that are specific to the pulling parts.</p> <h3 id="does-number-of-layers-matter">Does number of layers matter?</h3> <p>I wasn’t sure what I’d find here! It’s definitely the case that if you try to use layers close to (or over) the registry limit of 10GB, you can get ImagePullBackoff and then retry. I made sure not to go over that limit (and didn’t see any of these events in my data). But what did I see? I saw that number of layers doesn’t seem to matter at all.</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/blob/main/experiments/pulling/analysis/data/test/img/pull_times_test_duration_by_size_n1-standard-64.png?raw=true" /> </div> <p>What does matter (in that plot) was the total size. The largest size there (19GB) took about 2 minutes to pull. The variation looks random. The other ones were so tiny they were insignificant, from a pulling standpoint. But I couldn’t be absolutely sure that it never mattered, so I chose to stick with the median number of layers (9) and the max (which is actually 125, enforced by Docker).</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/raw/main/experiments/pulling/analysis/data/run1/img/pull_times_test_duration_by_size_run1.png" /> </div> <p>When I looked at that much larger range of sizes I started to see the curve that I expected. This (with added sizes along the slope) would be the set of sizes for my experiment.</p> <h3 id="how-does-pull-time-scale-with-size">How does pull time scale with size?</h3> <p>Logically, the first thing we want to look at is how pull time varies with size. And we see what we expect - that time increases as the images get larger. It’s hard to see with these plots (and often the plots aren’t super great with showing the quartiles) but it does appear superficially that having fewer layers leads to a larger variance in the size. Here is for 125 layers:</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/blob/main/experiments/pulling/analysis/data/run1/img/pull_times_duration_by_size_run1_125_layers.png?raw=true" /> </div> <p>And 9 layers:</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/blob/main/experiments/pulling/analysis/data/run1/img/pull_times_duration_by_size_run1_9_layers.png?raw=true" /> </div> <p>Eyeballing the means, the 125 layers is maybe 10 seconds slower? Could that be the time needed for extraction? I didn’t dwell on this, because the reality is that people are going to build images with the number of layers that they need, and not artificially try to put content into more. People are not building 125 layer images. Thus, moving on, I chose to use the median from the dataset, 9 layers.</p> <h3 id="how-does-using-a-local-registry-influence-pull-time">How does using a local registry influence pull time?</h3> <p>I’ve been told a few times that moving the containers to be “closer” to the cloud can make a difference. In this case, that would mean a registry hosted by the same cloud provider, and in the same region. Sure, worth a try! But guess what - it made no difference.</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/blob/main/experiments/pulling/analysis/data/run1/img/pull_times_duration_by_size_run1_125_layers.png?raw=true" /> </div> <h3 id="how-does-using-a-local-ssd-influence-pull-time">How does using a local SSD influence pull time?</h3> <p>The filesystem has a huge impact in pulling. After all, you are writing and extracting, so having good IOPS must be a variable. And indeed it was! There is a quota for the quantity of SSD per instance family, so I could only go up to a size 64 cluster, but I did see pull times go down a bit.</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/blob/main/experiments/pulling/analysis/data/run3/img/pull_times_duration_by_size_run3_9_layers.png?raw=true" /> </div> <p>We can see that adding a local SSD improves pull times by 1.25x. You can see more images (e.g., log times) in the <a href="https://github.com/converged-computing/container-chonks/tree/main/experiments/pulling">repository</a>. If you want a simple solution, this storage is pretty cheap so probably worth it. You will need to ask for more quota for larger clusters, however.</p> <h3 id="big-daddy-soci-snapshotter">Big daddy SOCI snapshotter!</h3> <p>I was first exposed to this idea of “image streaming” through a <a href="https://cloud.google.com/blog/products/containers-kubernetes/introducing-container-image-streaming-in-gke">flag provided by Google</a> and I have to give it to Google, they continue to be a leader in usability. I had not yet learned about the requirements for (what I suspect under the hood) is the SOCI snapshotter (or more likely an optimized derivative), but they made it work with GKE and a flag, and I just needed my images in their artifact registry. I already had tagged and pushed them there. Dear lord, I was shook.</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/blob/main/experiments/pulling/analysis/data/run4/img/pull_times_duration_by_size_run4_9_layers.png?raw=true" /> </div> <p>I didn’t even believe the data I was seeing. Containers that had taken minutes before were now pulling in around a second. Since I was very skeptical, I gave image streaming a challenge. I built spack images (meaning ONE huge layer with all application logic and dependencies) and I would <a href="https://github.com/converged-computing/container-chonks/blob/main/experiments/pulling/run-streaming-experiment.py">run an experiment</a> that would require running the applications and seeing the output. With these applications (albeit smaller containers, but with one main layer) I still saw a 15x improvement in pull times. To be more specific, this is the event recorded by the Kubelet, and it’s the time when you see the container go from creating to running. I used <a href="https://github.com/resmoio/kubernetes-event-exporter">this event exporter</a> to collect my data.f</p> <div style="padding:20px"> <img src="https://github.com/converged-computing/container-chonks/blob/main/experiments/pulling/analysis/data/streaming/img/pull_times_duration_by_nodes.png?raw=true" /> </div> <p>The above is running LAMMPS, amg2023, the OSU Benchmark All Reduce, and Minife. And yes, <a href="https://github.com/converged-computing/container-chonks/tree/main/experiments/pulling/metadata/streaming">all of the output is present</a>. These experiments (unlike the first pulling experiments that used a Job) use the <a href="https://github.com/flux-framework/flux-operator">Flux Operator</a>. After observing this (on Google Cloud) I had to dig in and at least guess what was going on under the hood. This is when I found SOCI.</p> <h2 id="the-soci-snapshotter">The SOCI Snapshotter</h2> <p>The SOCI “Seekable OCI” Snapshotter is (I think) a beautiful design that combines the work of the reference types working group in OCI (I participated in a few years ago) and an ability to index a compressed archive. I think it started as a fork off of <a href="https://github.com/containerd/stargz-snapshotter">the stargz snapshotter</a>, which is a great (similar) tool that requires an <a href="https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md">eStargz format</a> (also pushed to a registry). But here is the insight that maybe stargz missed. People largely don’t want to do too much extra work. Using stargz, as I understand it, requires another build step. From the <a href="https://github.com/awslabs/soci-snapshotter?tab=readme-ov-file#project-origin">project README</a> it sounds like AWS did a fork (that they kept) after substantial changes that likely would have been hard to accept upstream. This isn’t new news – it was done a <a href="https://aws.amazon.com/about-aws/whats-new/2022/09/introducing-seekable-oci-lazy-loading-container-images/">few years ago</a> and it’s mostly been that I’m (relatively speaking) a newer developer (end of 2022) when it comes to Kubernetes that I didn’t try it until now. Side note - why haven’t you made us an easy to deploy flag still, AWS? The insight that was found in a <a href="https://www.usenix.org/conference/fast16/technical-sessions/presentation/harter">paper from 2016</a> is that:</p> <blockquote> <p>Waiting for all of the data is wasteful in cases when only a small amount of data is needed for startup. Prior research has shown that the container image downloads account for 76% of container startup time, but on average only 6.4% of the data is needed for the container to start doing useful work.</p> </blockquote> <p>I think it’s funny that industry isn’t super paper focused, but somehow I’ve seen this paper referenced in a gazillion places to justify this (and similar) work. And thus it makes sense that while you are waiting for the rest of the container to pull, you might as well make progress with running things! Especially when GPUs are involved, these cloud clusters get expensive very fast. Waited for too many containers? There goes your retirement! Just kidding. But maybe not, depending on how big of a mistake it is… I digress!</p> <p>Without going into details, containerd runs on all the kubelet nodes to handle pulling of containers. SOCI itself is a <a href="https://github.com/containerd/containerd/blob/main/docs/PLUGINS.md#proxy-plugins">containerd plugin</a> called a snapshotter, which means that it handles creating a directory to unpack layers for an image. There is a <a href="https://dev.to/napicella/what-is-a-containerd-snapshotters-3eo2">really nice article</a> I found that illustrates this. So when you use SOCI, you create an artifact called a SOCI index that has your expected manifest, and then a set of “zTOCs” that are akin to a table of contents for the index manifest. Concretely speaking, this would be like saying “The binary is located at XX offset in the archive, and has this span (size).” There is a nice <a href="https://github.com/awslabs/soci-snapshotter/blob/main/docs/glossary.md">glossary of terms</a> if you want more details. Now, it won’t create these indices for ALL layers - just ones above a <a href="https://github.com/awslabs/soci-snapshotter/blob/212fe220f061413eb9f1a86556057128b25f4cab/soci/soci_index.go#L61">certain size</a> (10MiB).</p> <p>My (naive) understanding for these remote shapshotters is that instead of extracting all layers to a directory and then allowing start of the image, they mount (and lazily fetch) the image contents instead. This is why we need to have fuse fs installed, and we need the ztoc as associated artifacts to the image available via the referrers API! A socket path is added to the containerd config, and containerd uses that socket (along with the index for the archive manifests) to fetch additional content from a registry on demand. Here is a much more nicely articulated summary:</p> <blockquote> <p>One approach for addressing this is to eliminate the need to download the entire image before launching the container, and to instead lazily load data on demand, and also prefetch data in the background.</p> </blockquote> <p>I am only a few days into learning about SOCI and need to do my <a href="https://github.com/awslabs/soci-snapshotter/tree/main">codebase reading</a> to get a better understanding, so that’s the explanation I can give for now. I’m also interested in cases for which this works really well, and cases for which is does not. For example, what about shared libraries? I’ll need to do more experiments to see when SOCI isn’t as good, or even when it breaks. My mind is also already spinning in happy loops discovering that these plugins exist, period, and dreaming up what I might create.</p> <h2 id="a-daemonset">A daemonset</h2> <p>I’ll briefly review my strategy for creating the daemonset. I knew that I wanted to start with nsenter to process 1, which is on init, and that would also mean I’d leave the pod and be present on the node, which is where the kubelet and associated tooling is installed. If you look in my <a href="https://github.com/converged-computing/soci-installer/blob/main/daemonset-installer.yaml">daemonset</a> you’ll also see there is a shared mount with the host, and that is there so I can copy files from the pod container onto the host. The main <a href="https://github.com/converged-computing/soci-installer/blob/main/docker/entrypoint.sh">entrypoint</a> for that pod is primarily interested in doing that copy, and running the script to install SOCI with nsenter. The main <a href="https://github.com/converged-computing/soci-installer/blob/main/docker/install-soci.sh">install script</a> can then install dependencies (toml for parsing the config.toml in Python, the aws credential helper, SOCI itself, and fuse). I have <a href="https://github.com/converged-computing/soci-installer/blob/main/docker/write_config.py">one script</a> to edit the containerd configuration file (to configure SOCI as a proxy plugin) and then (importantly) I authenticate with the registry that is serving the node pause image. This was a bit of a catch-22, because we needed to restart containerd, but in doing so, we would kill our pod. Then if we wanted to pull the pause image, we couldn’t because we didn’t have the pause image! You can see that <a href="https://github.com/awslabs/soci-snapshotter/blob/212fe220f061413eb9f1a86556057128b25f4cab/docs/kubernetes.md#limitations">AWS lists it as a limitation here</a>. Maybe my approach is dumb, but I decided to just pull it? I retrieve the URI for the pause container (and region and registry) from the containerd configuration, authenticate with nerdctl, and then allow the script to exit with a non-zero code and restart, and on restart I pull it with nerdctl.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo</span> /usr/local/bin/nerdctl pull <span class="nt">--snapshotter</span> soci <span class="k">${</span><span class="nv">sandbox_image</span><span class="k">}</span> </code></pre></div></div> <p>It felt like a dumb approach, but it worked! The example provided by AWS goes from taking 71 seconds to pull down to 7, which is a ~10x improvement. That’s pretty good. 😊</p> <h2 id="summary">Summary</h2> <p>That’s all I have for today - I already took a bike ride but I want to go outside again soon. Some caveats! I made this in a morning. It could hugely be improved, and I welcome you to open issues for discussion or pull requests. The daemonset is currently oriented for Amazon EKS (GKE has a flag that works out of the box, and I’m not sure about Azure, I don’t have an account with credits there now), and I haven’t tested “all the authentication ways” but they seem fairly straight forward if someone wants to contribute or make a request. Also note that I already see design details that I would change, but I’m content for now. I also have not tested this at any kind of scale, mostly because we need to ask AWS permission to use credits, and I only had blessing for this small development.</p> <p>And that’s all folks! 🥕 I’ll very much be doing more exploration of this space in the future, and all of the above somehow materialized in a 9 page paper. I feel like I’m pretending to be an academic sometimes because I’m much more suited to building things, and that is how I want to have impact. But I figure while I sit where I sit, I can just rock both. 😎</p> Fri, 04 Oct 2024 10:00:00 +0000 https://vsoch.github.io/2024/container-pulling/ https://vsoch.github.io/2024/container-pulling/