Inspiration

We recently read about Cognition’s latest models, SWE-grep and SWE-grep-mini (https://cognition.ai/blog/swe-grep) and were really excited by the sheer throughput the team was able to achieve on these new fine-tuned LLMs (over 2000 tps!). It made us naturally envision a future where we’ve optimized inference to be so fast that agentic coding isn’t bottlenecked by LLM throughput, but rather, by the system and network latency overhead the is inherited with the current system architecture.

We determined that the modern agentic coding paradigm, where user’s codebases exist locally on as a client and the LLM inference is served remotely on a server, concedes vast amounts of performance due to high network latency overhead incurred from the frequent back-and-forth communication they perform,

In an ideal world, all users could own their own H100s and host their own coding models locally. However, this is impractical, and uneconomical. What we need to do is create a fast and scalable framework for models to still be served on servers, but also allow users to experience most of the performance benefits of having their own datacenter class GPUs locally.

We identify two key insights that inform the design of our system:

1) To eliminate network latency overhead, we must co-locate both the codebases and our compute

2) Having only a single user on a remote server is a poor utilization of compute resources. An ideal system should, similar to vLLM, be able to effortlessly scale multiplexing compute resources (memory, GPUs, etc.) to new users.

3) As the number of coding agents per machine increases, the number of tool calls made will also further increase. Rather than spinning up new processes with each call a more scalable solution is to create a persistent background service that receives requests from client processes to do grep, ls, etc. and handle the functionality of the common commands without actually needing new processes

For those familiar with operating systems concepts, making tool calls (aka creating a new grep, ls, find process) is akin to making a syscall. It forces the currently running process to preempt itself, causing a context switch to a new process flushing the TLB and losing a lot of Addressing #1: IPC.We perform interprocess communication (IPC) between our (potentially many) qwen-code-ipc clients, and our mem_search_service, in order to eliminate process creation overhead associated with each shell command an agent may make. We created a custom daemon (background process) to run on a server. Our service uses a custom library we created, where we developed pseudo-shell commands that operate directly on memory mapped files. For the purposes of the hackathon, we implemented ripgrep one of the more time consuming, and frequently used shell commands.

What it does

Curserve essentially is a high-performance serving engine that enables hundreds of users to run coding agents simultaneously with near-zero latency for file operations. Instead of the traditional architecture where code lives on laptops and LLMs run remotely (causing constant network round trips for every grep/ls/cat command), Curserve co-locates everything: codebases, LLM inference, and search operations all live on the same server. Users SSH in and run a single command to start an AI coding session. Behind the scenes, a memory-mapped search service keeps all active codebases in RAM, allowing instant file operations without spawning shell processes. The system transparently intercepts the coding agent's filesystem calls and routes them through IPC to this blazing-fast in-memory service.

How we built it

mem-search-service (Rust daemon)

  • Uses ripgrep's core libraries (grep-searcher, grep-regex) for proven search performance
  • memmap2 for zero-copy memory-mapped file access
  • Rayon for parallel search across CPU cores
  • notify crate for real-time file change detection and auto-reload
  • Simple API: alloc_pid(), ripgrep(), close()

IPC Layer

  • Unix domain sockets for sub-millisecond communication (~0.1ms vs network latency)
  • JSON protocol for simplicity and debuggability
  • Shared request socket + per-client response sockets

Modified Qwen-Code-CLI

  • Forked Qwen-Code-CLI and modified its tool layer to route calls to our daemon instead of spawning subprocesses
  • Integration is easy for any Python-based agent. Could be done with minimum effort for other tools like Gemini CLI
  • vLLM + Infrastructure

Deployed Qwen2.5-Coder-32B-Instruct via vLLM for LLM serving

  • Rented an A100 GPU from vast.ai (~$0.63/hr)
  • Co-located vLLM, mem-search-service, and user codebases on the same server, eliminating network latency entirely
  • Users SSH in and run their own Qwen-Code-CLI instances, all sharing the same daemon
  • Multi-tenant: one daemon serves many concurrent users

Challenges we ran into

We had countless challenges trying to wrangle designing and implementing this system, but here are some notable ones:

  • We had to fit the best open source coding agent we could into an affordable GPU. After settling on renting an A100, we found we would be able to fit a Qwen 30B coding model in the GPU’s 80gb of VRAM. However, we forgot to anticipate the model’s limits on the context window and had to settle on testing on medium size codebases with our model’s 64k tokens of context (though it was flaky, with the system crashing several times). Future iterations could serve better GPUs and models.
  • We had to reverse engineer Qwen Code (which itself was based on gemini-cli) in order to replace grep tool calls with calls to our specialized searching process, and replace calls to the Chinese API to our hosted server. We faced numerous problems with syncing and hangs, but were able to work through them with the help of Cursor.
  • Memory-mapped file invalidation: originally, we expected code modifications to be reflected in the file we memory-mapped because we had a pointer to the shared memory. However for VSCode and other editors, they don’t update files in place but actually overwrite the file, leaving a new copy, making our old file descriptor and memory-mapped file to now be invalid and stale. We implemented a service that watched for file system changes and remapped a file whenever a file was written to.
  • We initially prototyped in Python with mmap + regex for the in-memory search operations, but for sparse queries on large repositories, our implementation underperformed compared to ripgrep. We wanted to match ripgrep's speed, so we rewrote it in Rust using ripgrep's internals to use ripgrep with in-memory files instead of using the filesystem API, which has unneeded overhead.
  • The wifi at the event was terrible, so we went around SF to various public libraries and cafes. Wifi was a huge challenge at this event

Accomplishments that we're proud of

  • Our in-memory ripgrep implementation is up to 5-30x faster than the normal ripgrep implementation (requiring a process to spin up). The speedup varies widely based on the repository size and the specific query being made, but across all experiments it is more performant than the original approach. This is exacerbated even more when considering rollout generations with many tool calls because the memory mapping of the repository is only done once while every search operation after has quick access.
  • We’re proud of making a system that works reliably even when the user has spotty network access (like this hackathon). By having the code and LLM in the cloud, once the user prompts the agent, the agent does not have any dependencies on the user’s computer. Unlike cursor which needs to run commands locally, our system reduces points of failure and security risks of running code on user’s devices.
  • We designed a flexible full architecture that allows us to minimally modify any open source coding agent framework (like Qwen-Code CLI or Gemini CLI) to work with our server-side framework.

What we learned

We learned a lot from this project. Here are a few mentions:

  • We also learned how to serve open source coding agents like qwen with modern tools like vLLM (popular LLM serving framework) and Vast.ai (GPU cloud aggregator for spot instances).
  • We learned how to get the model to be deterministic for profiling, changing temperature to 0 and removing top-K sampling.
  • We learned how to queue tool use tasks from hundreds of users on the server to a single specialized code-searching process. We learned how to use sockets for asynchronous interprocess communication.

What's next for Curserve

Our next steps is to do system optimization. We want to speed up file reading by doing copyless reads. Client requests service to mmap file (pinned in physical memory) to the client’s virtual address space, rather than the client needing to copy the file contents. We want to Implement codebase paging and eviction policies when RAM limits are reached + Add speculative codebase prefetching (once RAM is full) based on and also optimize memory layout for common search patterns. In the future we also want to Support for distributed codebases across multiple servers as well as copy-on-write if a server is dedicated to many users working on the same repo (e.g. PyTorch team on Meta can use this to do fast coding on PyTorch with copies existing where git diffs diverge). We want to track how “hot” specific files are. If files are frequently being written to. We don’t reconstruct the suffix tree, we just do boyer moore fast search.

Built With

Share this project:

Updates