SomeOddCodeGuy's Ramblings

If You Have the Hardware- Use it to Learn!

SomeOddCodeGuy — Tue, 03 Mar 2026 03:51:17 GMT

If you've never messed with open source LLMs and you jumped on the ClawdBot/OpenClaw hype train: take some time to learn more about how local models work. You likely went through the trouble of getting a Mac Mini, so you now have a nice little test box to play with. Just do it. Turn off Clawdbot/OpenClaw, and make OTHER things with it. Just for a few hours, even.

For the vast majority of folks using AI to Vibe code, make agents, etc- right now they are the equivalent of people building websites using the heaviest no-code/low-code solutions, or just slapping ALL the biggest libraries in, without a care in the world for performance. You're probably wasting a ton of efficiency in your current setups because you don't understand how a lot of it works under the hood. You don't understand samplers well, or what tokenization is doing. You may not have a good feel for what small and weak models can really do, or what you absolutely have to have large models for (When I say small models- Im talking models that make Claude Sonnet 3.7 look like a genius).

Whatever efficiencies you're aiming for are probably a drop in the bucket compared to what you could be doing if you really had a feel for all that. And the only thing holding you back from that knowledge is just taking the time to learn it.

The easiest way to learn this stuff is doing. You have the hardware now, so why not? Forget the little hype-bot that LinkedIn convinced you to install. Set it aside and use that Mac Mini to learn how LLMs work at a deeper level by trying to wrangle local models to do complex work.

THAT will be worth its weight in gold.

Also, don't cheat yourself. Yes, the local ecosystem is easier now. 10 minutes + an LM Studio install and tada: all done! But what did you really learn? No no; I'm saying to do it the long way around. Grab Open WebUI. Grab llama.cpp. Get em hooked up together. Use a little model like one of the new Qwen3.5 8b models. Get the responses to be actually good; try to find ways to make the model stop repeating itself. Things like that.

Next: write a small agent. Do it with that crappy little 8b or less model, and try to get something of value out of it.

This is all possible to do, but I promise it'll be harder than accomplishing the same thing with some 2026 proprietary API model. And that's the point.

Once you've done all that, you'll later go back and revisit what you think right now is great work with LLMs, and suddenly have the same realization every developer does when they go back to their old code: "Wow, I can do a lot better than this now."

Much like developers first learning to code, and thinking that just writing 500x "if statements" is good enough- you're only just now scratching the surface of how you should properly use LLMs. Now you need to start learning the more complex stuff. Don't settle for the novice approaches you're doing so far. There's SO MUCH MORE out there.

And who knows- you may just find that local models are fun enough to be worth obsessing over a bit ;)

An Analogy to Help Understand Mixture of Experts

SomeOddCodeGuy — Thu, 26 Feb 2026 03:04:38 GMT

If you're having a hard time understanding MoE strength vs dense models, and roughly where they might land when comparing them, think about this super oversimplified analogy. I'm hoping it makes sense:

The Scenario

Imagine a paid trivia competition, but all the questions are about carpentry regulations: you're given a piece of paper, you fill out the paper and then hand it in.

There are two "teams" competing with each other, except one team just has a single dude on it. Both teams need a place to sit in the building while the competition is going on.

Team 1 (10b Dense Model)

Team 1 is just some fairly experienced carpenter with 10 years of experience. He gets the paper, works through every question himself, and turns it in.

He really likes his personal space, so he reserved 10 seats all to himself. (Bear with me...)

Total experience on the team: 10 years
Experience applied to each question: 10 years
Total Seats Needed: 10 seats

Team 2 (40b a10b MoE Model)

Team 2 is a large crew of 40 first-year apprentices. None of them know the full trade; each one has only learned a few specific things about carpentry during their year.

Each question has multiple parts to it, and for each part, 10 of the apprentices are picked based on whoever among them has the most relevant knowledge to that specific part. Once a part is answered, those ten return to the group, and the process repeats for the next part. By the time a single question is fully answered, dozens of different apprentices may have contributed.

When answering, each set of ten apprentices that get called up aren't huddling up and collaborating; they each independently write their own answer to the question part on a small piece of paper, and then all of those answers get blended together to create one combined response. The final answer written on the trivia paper for that part of the question will be a mix of what they all came up with.

Once all of the questions have been answered in this fashion, they turn it in.

Total experience on the team: 40 years
Experience applied to each question: 10 years (10 apprentices x 1 year each)
Total Seats Needed: 40 seats

Comparing the Teams

Now, technically you could say that each team is applying the same number of years of experience to each question, even though the way the teams are structured is totally different. For each question, they are bringing an aggregate total of 10 years of experience.

But beyond that: Team 2's combined aggregate knowledge and experience of 40 years is much larger.

Team 2's setup is so powerful because even though their team is full of apprentices who each only know a slice of the trade, they are hand-picking the best ten people for each question part. Depending on what all the different apprentices studied, you could end up with Team 2's total knowledge including information Team 1's carpenter doesn't know; and they may reason through things that the carpenter struggled with alone.

The downside to team 2's setup is that they need 40 seats, while Team 1 only needs 10 seats. Team 2 takes up a LOT more space than Team 1.

Socg's note: The seats are memory. In case you missed that lol. I couldn't figure out a better way to shoehorn that into the analogy.

Team 3 (40b Dense Model)

Now, imagine if there was a third team with a master carpenter that had 40 years of experience; the same number of years of experience as all of Team 2 combined. And he absolutely loves his space, so he also got 40 seats. But its 1 really really experienced and smart carpenter doing all the work.

Even though team 2 has a combined total of 40 years of experience, and the master carpenter has 40 years, and even though both teams required 40 seats: the quality difference is going to be significant. The master carpenter will likely have 'seen it all' and experienced it, too, while the apprentices are only ever applying 10 aggregate years of apprentices at a time.

This means that not only is that master carpenter likely going to make better use of their overall knowledge, but they will understand the questions much better and be able to really comprehend what is being asked at a level the apprentices likely won't.

Total experience on the team: 40 years
Experience applied to each question: 40 years
Total Seats Needed: 40 seats

The Takeaway

When comparing models, it's pretty safe to say:

All things being equal, an MoE will likely outperform a model that has the same number of parameters as the active parameters. So a 30b a3b MoE (30b model, but only 3b active) will beat out a 3b dense model.
All things being equal, an MoE will likely have worse overall comprehension than a similar size dense model of the same size as its total parameters. Even if their knowledge might be similar, the dense model will simply "get" things better than the MoE. For example, a 120b a5b MoE will likely misunderstand statements far more often than a 120b dense model, which will "read between the lines" on what you want far better and understand inferred speech better.

Anyhow, that's majorly over-simplified, but hopefully it helps paint a better picture.

I Won't Miss The Cold...

SomeOddCodeGuy — Mon, 02 Feb 2026 23:36:00 GMT

This has nothing to do with technology, but just so you know- I'm a tropical beastie, and I absolutely will not miss the 22 degree weather this pass weekend. I am no longer built for this.

That is all.

My Personal Guide for Developing Software with AI Assistance - 2026 Edition

SomeOddCodeGuy — Sun, 01 Feb 2026 19:31:00 GMT

What's Changed Since 2024

So back in May of 2024 I wrote the first version of this little guide, at a time when agents were absolute crap and Wilmer was still in a state that couldn't even be called v0.01. Back then it got a fair bit of interest in various forums since there simply weren't a lot of resources like that.

Some variation of that workflow is what I used for years. Until September 2025, to be exact.

Enter Claude Code. I decided to give it a try because my old workflow simply took too much energy for what I could handle after work, and I was really starting to hear good things about these agents.

It's safe to say: Claude Code has won me over.

How I Code Today

I don't "Vibe Code" in the sense that most people think of it. The concept of Vibe Coding is essentially handing all of the labor to the agent; your job is to describe the product specification, to give some general constraints and guidance, but otherwise you let the LLM do what it does best.

Unfortunately, so far that has had some pretty catastrophic results for many companies.

Instead, I find that it's better to treat Claude Code as if it were a junior developer. I handle all the architecting, planning and design up front. Designing every single bit of the app, end to end. Researching deeply every tool, library, design pattern, crafting the code quality gates and all the rules it has to follow, specifying all of the naming conventions to adhere to, etc

This can take days. Maybe even weeks, depending on the scope of the project. But after that? I can let Claude Code just run wild. There's no more room for it to mess up; no creative expression available.

Understanding The Process

First, coding with an agent is completely different than doing standard development. You are, essentially, acting as a team lead for a robot that is about equivalent to a competent junior dev (who can type faster than you can think lol). It is your job to direct that dev in such a way that they build something amazing.

I've built several personal apps this way, and honestly it's so much fun.

Step 1: Architect and Deep Research Everything

Claude, Gemini, ChatGPT, etc all suffer from the same two core problems when it comes to Software Architecture:

hallucinations
outdated information that they don't realize is outdated, because LLMs struggle with a sense of time.

This is where your personal knowledge, and Deep Research, come in.

First you want to lay out what you expect. What tools and languages will you be building this in? What design patterns do you want it to use? What folder structure? What constraints do you have?

Once you start to come to a consensus with the AI on these things, stop everything and ask it for a series of deep research prompts. Ask it to write the prompts looking to validate the information and designs so far, including having it check developer opinions on those solutions via blog posts, forum posts, etc.

For each prompt you're given: open a new window, select the Research/Deep Research option, and give it the prompt.

After the deep research finishes, I generally copy the result and take it back to the original chat window. I'll paste it like this:

Below is the result of the deep research prompt


// words words words


Please use this and reconsider the above recommendations.

This almost always results in it catching its own hallucinations, in it realizing something is out of date or there's a better way, etc. This means that not only do you clean up the designs, but you also get a chance to learn some new stuff.

As always: follow through to the sources if the information is new to you.

Read the Sources

When you're reading the deep research, look at how it came to some conclusions. Even with DR, I've seen it make pretty big mistakes, misinterpreting something to mean something else. You have to be careful. Make sure you understand what the Deep Research is saying before you give it to the LLM.

Step 2: Generate Architectural Documents

Once I've done the deep research on the relevant topics, and explained/had it copy down all of my architectural goals, constraints, etc, then I work with the AI to generate the actual architectural documents. These documents cover everything we've made decisions on: design patterns, security constraints, how components interact, data flow, the whole picture.

Next, I review the documents for gaps. This is where I bring in a second LLM to assist me; at this point we've done what I think is best and what my first LLM thinks is best, so lets get a third set of eyes on this problem.

I ask the second LLM to do code reviews, security reviews, and general architecture reviews. I tell it to deep dive; do multiple web searches, run deep research prompts, etc to validate findings. If there is any ambiguity, I want it researching.

When the second LLM finds issues, it comes back to me and we talk through them. Once I'm happy, we apply the changes. This back and forth continues until the architecture is solid.

This step is critical. A lot of teams skip it, which likely contributed to the "95% of AI ventures fail" stat.

You see it all the time, developers complaining about unmaintainable software that was vibe-coded, falling apart at the seams with tons of critical bugs and security issues. The reality is that the same thing would happen if you just tasked a pile of fresh entry-level junior devs to write a complex system, too. You have to think like a team lead and give them what they need to succeed in spite of that.

Step 3: Break the Architecture Into Modules

Once the architecture is solid, I break it apart into logical chunks. These modules include things like:

Security
Database
Front End (including folder structure)
Backend (including projects and folder structure)
Infrastructure
Integrations

Each module gets its own documentation. This makes it easier to reason about each piece independently, and it makes the development plan cleaner.

Step 4: Create the Development Plan

This is where I have the AI break everything up to this point into a concrete development plan. We've made the decisions; we've left little to the imagination. But getting it documented like this, as opposed to having the LLM just work off the architectural docs, gives me a chance to manage the order it works on things to ensure maximum testing and quality, and one final peek over everything to make sure none of our plans were lost in translation.

The development plan always follows the same structure, each of which gets their own documents:

Step 1 is pre-prep. This step is all me. I do anything that needs to happen before the AI can start coding. This includes creating the project in my IDE to ensure everything is set up correctly, setting up a local git init if I decide to do a separate local repo for a staggered-staging, getting directories and permissions set up, installing any apps we need, etc. Usually 1 MD file for this.
Step 2 is tool verification. The LLM tests all the tools it needs to make sure it's ready to use them. If it needs to run dotnet commands, it tests that. If it needs to run docker, it tests that. We catch issues early before they become blockers, and it gives me a chance to update the settings.json if permissions are out of whack. Usually 1 MD file for this.
Step 3 and beyond is where the coding happens. Each step after this is building the application, one logical module at a time, with code quality gates before each commit. N number of MD files here; bigger projects can hit 14+ files.

An example from a little personal project I'm working on during the weekends:

Defining the Code Quality Gates

First- I specify ALL conventions in this plan. Naming conventions, style conventions, even down to XML docs and comment conventions. Nothing is left to chance.

After that, I tell the AI exactly what quality checks must pass before any code gets committed. I don't leave this vague. I spell it out.

For example, here is what I currently have set up for a personal .NET core 10, C# backend based personal chat app I'm tinkering with:

Build -- dotnet build must produce zero warnings on production projects. This single command runs NuGet Audit (dependency vulnerabilities), all five Roslyn analyzer packages (SonarAnalyzer, StyleCop, Roslynator, Meziantou, BannedApiAnalyzers), and .editorconfig style enforcement because TreatWarningsAsErrors is on and EnforceCodeStyleInBuild is true.
Test -- dotnet test --collect:"XPlat Code Coverage" must pass all tests with at least 80% line coverage on changed code.
Mutation test -- Stryker.NET with --since main must stay above the break threshold (40). This catches tests that have coverage but don't actually assert meaningful behavior.
ReSharper CLI -- inspectcode.sh must produce zero errors and zero warnings on production code. Suggestions/hints are informational only.
SonarQube -- Scanner with /p:SonarQubeAnalysis=true on the build step. Quality gate must pass: zero new bugs, zero new vulnerabilities, zero unreviewed security hotspots, 80%+ coverage on new code, under 3% duplication.
Gitleaks -- gitleaks git --source . --staged --verbose must produce zero findings.
DRY verification -- Manual pass looking for duplicated logic, repeated string literals, copy-pasted blocks across modified files and their neighbors.
XML doc compliance -- Any file touched must have its XML docs brought into the concise style (one-sentence summaries, third person, no filler).
Comment cleanup -- No commented-out code, no what-comments, no emojis in any touched file.

Steps 1-6 are tool-enforced gates. Steps 7-9 are discipline checks that happen during the work but get verified before commit. If any step fails, fix and re-run from that step. No skipping, no deferring.

Building and Reviewing

Once I have the full plan laid out, I let the bots run wild. They build one module at a time, running the quality gates before each commit.

For me, a smaller project can take a few hours, and burn through my hourly 5x Max usage a few times. I always use Opus 4.5 for it, but it's worth the wait for that quality.

After the AI thinks a module is complete, it is required to spin up a separate agent specifically to do a code review end to end. This agent checks that nothing was missed, no gaps exist, and that everything matches the architecture. The agent leaves me a SignOff.md file confirming it checked everything.

Then I come in for my own code review.

You, like any senior dev or team lead or dev manager, are responsible for the code you commit. When a bug hits production, the agent didn't fail. You failed. This part takes time, and it can be grueling, but with a bit of patience and the help of some AI chatbots, you can get through this alright. Take your time.

Among other things, you're looking for:

Obvious failings to meet the specs of your architecture
Security flaws
Really bad or inefficient code. Did it create unnecessary loops? Tons of duplication? Did something that just doesn't make sense? Your job is to find that, and call it out.

Once I've identified the issues, I then get them documented into a new .md file, and send the agent back to work.

I generally keep looping like this until I have a good, working feature or project.

SAVE YOUR PROMPTS

I reuse prompts a lot, so I generally try to save as many as I can. That includes things like the description of what I want from the project, my requirements or constraints. My personal goals or external factors. Anything like that.

Generally, when I have a question of another LLM, that lets me do something like this:

Consider the below project description:




And here are some of the features I'm aiming to achieve:




What I'd like to do is...

Keeping those blocks, so that you can re-use them in new prompts, helps so much. I save tons of time with it.

My settings.json File

This file you'll find in .claude (on the Mac), and it's where you can set the approve, deny or ask permission, as well as special sandbox domains. I lock this bad boy down. I generally will designate a folder that the agent can do whatever it needs to in, but otherwise will block it from the rest of the computer. I'll let it do websearches or fetches, and let it curl GETs, but no PUT/PUSH/DELETE on anything except localhost. A lot of constraints around git and other stuff, too.

My Project Specific CLAUDE.md File

Rather than filling the project CLAUDE.md file with a ton of stuff that will cause the agent to churns through tokens quickly even when it doesn't need that info, I'll instead use it as an index for where the instructions actually live- other files.

These instruction files generally involve things like giving it strict rules that all code quality checks must pass before each commit, and what those quality checks are. Or specifying that it shouldn't try to do certain bash calls, because it runs afoul of my CLAUDE settings file and would trigger me to have to respond to a permission prompt.

Usually I'll end up with several files for backend coding, and several for frontend coding.

Understand What It's Telling You

If you have to deploy, take care when trusting the LLM's instructions if it helps you. It will probably be wrong in a lot of what it tells you. Just accept that. Constantly remind yourself: "What it tells me is going to be wrong". Challenge everything. Either hunt down someone that can help you (ideally), go learn how yourself (also ideally), or at the minimum do Deep Researches like your life depends on it.

But seriously: Watch tutorials. Read guides. LEARN.

Don't do things you don't understand. That's how people lose money.

The End?

That's pretty much it. It's a lot of work, and it's definitely not 10x productivity, but I get so much better results than just having unmanaged agents writing my code.

One day this might change. 2 years ago I said I'd never use agents. Here I am. 2 years from now I may say "I don't need to do this anymore; it now handles architecture as well as I do from day 1". That's fine, too.

But for right now, as developers, quality is our job and our goal. Using AI is amazing for development, and definitely speeds things up, but we have to make sure to use it responsibly and focus on quality, security and reliability above all.

I will always push back when I need to if someone is pressing me to "go faster". Gabe Newell paraphrased a decades old quote quite well: "Late is just for a little while. Suck is forever." They won't remember that I gave in to the pressure and rushed to meet their deadline; they'll remember that I delivered them unusable crap.

Clawdbot...

SomeOddCodeGuy — Mon, 26 Jan 2026 23:37:35 GMT

Everyone and their brother is talking about Clawdbot, but as several others have pointed out- an agent with that many connections could be a security nightmare if it can be prompt injected.

But since it supports OpenAI and Ollama endpoints... I wonder how well it would work if I stuck a Wilmer workflow to act as a middleware between it and the LLM, and had the workflow try to detect for prompt injection?

Fairly straight forward in terms of implementation, though whether the gating will work well is another matter. But even just using local models, I'd think GLM 4.7 Flash or Qwen3 30b should do alright for extracting most standard adversarial prompts. Sure, you'd take a speed hit, but you'd also reduce the risk of it emailing everything it knows about you to some rando.

Still not perfect though. Hmm...

M5 Max Macbook Pro Next Week?

SomeOddCodeGuy — Sat, 24 Jan 2026 00:51:00 GMT

Honestly been waiting for next week for a while. Even if the wait time is 2 months or longer on actually getting the order, having an M5 Max for the hardware matmul is going to be amazing and worth the wait.

One nice thing about getting a new machine now is that the past few updates that I've pushed on Wilmer have helped a lot to streamline my setup; I can get a new machine with full workflows put together in about 30 minutes, compared to hours before. My whole software setup is completely different now, too- combining llama.cpp's model swapping, and being able to do as many workflows as I want on one Wilmer user, I've basically reduced my setup from 30-40 Wilmer instances total and 2-4 llama.cpp instances per machine to 1 lcpp per machine, and only maybe 6 Wilmer instances total. That's a huge help.

While I expect the memory throughput limitations (400GB/s vs 800GB/s) to keep the M5 Max from really competing with the M3 Ultra I already have, I still expect it'll be a beast; especially for on the road.

Come on, Apple... don't let me down.

GLM 4.6 at UD_Q3_K_XL is surprisingly usable

SomeOddCodeGuy — Mon, 19 Jan 2026 01:34:00 GMT

So I currently run GLM 4.7 Q8 on my M3 Ultra, and after wrestling to find a solid model that would work well on the M2 Ultra 192GB, I finally decided to give the older GLM 4.6 UD_Q3_K_XL a try on it, seeing how much the quantization would affect it. (I also just wanted to mess around with 4.6 after using 4.7 for a while, to see how much I missed it lol. They have different styles for doing reviews and giving feedback)

Honestly, I've been shocked at how well it works. The coding isn't terrible and the general ability to look over docs and give feedback feels pretty comparable to full quality. I've definitely seen a general failure to handle numbers of any kind well, but not to the extent I had imagined for an MoE. In the past, these did not handle quantization well.

There was an arxiv paper that found LLMs can only cram about 3.6 bits per parameter worth of memorized stuff from training data. That's not exactly about quantization, so I'm kinda stretching it here, but my brain went "well if 3.6 bpw is some kind of limit for one thing, maybe the rest of the model has similar wiggle room?"

Could be a wholly wrong way to look at it, but after reading that, my mental baseline for "Im about to break this model if it gets any smaller" has been somewhere in that area of 3.6-4bpw. Not exactly doing science here, but it does line up a little bit with the old MMLU-Pro results from a long time back. I just didn't expect modern MOEs to hold up as well as the big dense models.

Agents are Growing On Me

SomeOddCodeGuy — Sat, 17 Jan 2026 00:05:00 GMT

Go back in time a year and a half- it's mid-2024, LinkedIn has discovered AI and now the buzzword of the year is "agentic". Everyone and their brother was trying to convert every single task to be doable by an "agent" and, to be blunt, they sucked.

If you ever looked at my online posts or my LinkedIn, you'd have found that I simply didn't like agents. I think part of my original disdain for them came from this period. Tech "thought leaders" (I hate that term) were pushing agentic everything so hard that it made me feel like the entire concept was snake oil... especially since it seemed like most of the agents were failing at their jobs. The outputs looked terrible, folks were spending tons of time fixing what the agents were doing, etc etc. Thanks to this, I leaned very heavily into the "Workflows are life. Agents suck. I'll do everything by hand" mentality, just because I got so tired of the whole push to turn everything into an agent, even if it made no sense.

Well, times change I guess... because here I am in 2026, thoroughly enjoying Claude Code.

Now, I absolutely understand a lot of the current criticism in the dev world about agentic coding, but I feel like a lot of it really boils down something that you also see in AI image and video generation: most of the "slop" is because it was created with extremely low effort, resulting in something extremely low quality. Too many people just type the equivalent of "Masterpiece, best quality, anime girl poster that I can sell for money" and then declare it the greatest thing since sliced bread. Many vibe-coders do the same thing in code: "I need an app that does A, B and C" and then they let it run wild. These are basically the AI equivalent of sketching something on a napkin and calling it a commission.

I have even less sympathy if you're actually a developer or an artist working that way... at least have some pride in your work.

Note: I will give up development and take up farming before I ever refer to what I do as 'vibe coding'. Add that alongside 'Thought Leader' in Socg's "Upsetting Words" compendium.

For me, especially doing dev in languages I'm more comfortable with (like C# .NET), it gives me a chance to really step back and do more architecture than raw coding.

Case in point: right now I'm working on a small side project that I probably won't open source- something to run inside our home network do some work for us. It's a .NET 10 app using Blazor for the front end. It's nothing special, but I need some practice with it anyhow. For the past 3 days I've been spending my free time working with the claude chat on their website, doing deep researches and talking through design patterns, libraries, plans, etc to come up with a full dev plan before ever handing it off to Claude Code.

Not a single line of code has been written yet, and in fact the solution doesn't even exist yet, but so far the entire app structure, libraries, design patterns we're using, ORM we'll be using, DB tech we'll be using, deployment strategies, cost estimations for an eventual Azure deployment, etc have been talked through, investigated, and ultimately solidified.

The driving force behind all of these decisions is a combination of 2 things:

My own knowledge and experience, plus any studying, videos, documentation, etc that I'm utilizing
13 or so individual Deep Researches, that are among 30+ individual chats with Claude, most of which result in documentation that I then collect and take to the next chat to talk through more stuff.

Yes, it can be grueling. But you know what? It's also fun. It's really fun to abstract the dev out to the architectural level and know that I can just delegate the actual efforting of making the final product to something else. As long as I've specified exactly how it all should be written, I can leave the final work of getting the code written, unit tests added, etc to Claude Code.

I've been a team lead and dev manager for most of my career- probably a combined 12 years of leading at this point. But this level of freedom over the decisions and the doing is something that I never had. Generally as a team lead or dev manager, architecture is either dictated from above you, or decided at the team level. Even if no one was dictating it to you- unless you want your devs leaving out of frustration or boredom, you don't get to dictate it yourself.

And the doing? For that you have scrum, agile, etc processes, run by PMs, to break down work and divvy it out to everyone. And that divvying process is absolutely not fun; that whole process is opposite of fun.

So this ability to be total dictator over the architecture and yet also be completely hands off from the more annoying elbow-grease parts of making the thing? Just dictating exactly how it should look, how it should work, etc and being able to micromanage the crap out of this little robot as it builds whatever I ask? It's all something I'd never do, nor want to do, with an actual team of people... but good lord is it so fun with AI.

Yea... this is enjoyable. I really like working like this. I'll admit that I'm feeling pretty nervous with some of my knowledge gaps... having other people around, on a team, really helps you feel more comfortable in knowing you're less likely to have done a dumb. But all the same- I'm enjoying this era of AI development.

The Resistance to Using Any AI...

SomeOddCodeGuy — Sun, 11 Jan 2026 16:56:55 GMT

When folks say "I can't find a use for AI", I think far too many of them are overthinking the use-cases, or expecting a much more grand difference in their lives.

Without actually relying on LLMs to do the thinking for me, I can say without a doubt that AI has fundamentally changed the way that I work over the past couple of years. No, it hasn't made me 10x more productive or anything like that, but what it has done is GREATLY reduced the number of mistakes that slip through, and added an unbelievable amount of quality of life tools to my belt. The mistake reduction alone is probably the biggest benefit, and why I just can't see myself really going back.

To give a few examples:

Almost every major AI player has some form of "Deep Research"; the short version is that it goes Google deep-diving for you to pull a source-cited response to your query. If you've ever been stuck on something, or needed confirmation on a line of thinking? This is huge. Rather than spending 30+ minutes Googling, I can now find the source I need within 5 minutes or less. It has rarely failed me at this.
When you're working alone on something, mistakes are far more prone to happen; this is especially true for developers. When you're on a bigger team, you have some ability to offset this by getting someone else to put a second set of eyes on it. But the reality is that isn't always possible- sometimes the team is busy, or you are tackling an issue alone, or they just don't have the domain knowledge to assist. Is AI perfect/does it catch everything? Absolutely not. People don't either, though. At least it's more than JUST you.
It's a veritable toolbelt of quality of life features. It used to be that if I hit a post online that was in a language I didn't know/understand, and if the browser didn't offer a translate feature for that page, I'd have to Google Translate the text manually. Now I can just quickly screenshot the page, drop it into a chatbot, and have it tell me. Is it perfect? Of course not. But I don't need perfect if I'm just trying to understand what it's saying. Similarly for sanity checking a large corpus of text for a fact- "here's what the conclusion I'm trying to verify, here's the text, help me find the information relevant to it within this text." kind of thing.

There are other benefits I get as a developer beyond these. But I think that part of my hang-up when I see folks completely write off AI is that just the minor benefits of having it on-hand pay off in spades for me.

Is it easy to quantify that value for a corporate cost benefit analysis? No. But I think that's a problem corporate leaders need to learn to deal with in their own way. You know that it has value, and if you aren't able to estimate exactly what that value is up front, then you really need to decide if you're willing to let that roadblock you from such value, or if you're going to go out on a limb, let folks try it, and calculate the return after the fact.

Though, on the last note- if you do that, please train your folks on it. There is a particular way to handle AI that makes it more valuable, and if you aren't using it that way then it could become more of a hindrance than a help.

I haven't disappeared...

SomeOddCodeGuy — Sat, 15 Nov 2025 22:47:10 GMT

It's been a few weeks since I've posted or made any changes on Wilmer; I haven't stopped or lost interest, but rather I'm about to change jobs and I've been heads down on transition stuff before I leave my current position.

Part of being a dev manager is doing what you can to help your team be fine when you're gone. So to that end, I've basically done nothing but work for the last 4 weeks, and will continue that for the next 2. Nights. Weekends. Even pulled an all-nighter the other week. I do my normal work during the day, and then write documentation after hours. So far, I've written close to 150 pages of detailed docs, amounting to close to 40,000 words.

Amusingly, despite my hobby, 0 words of that documentation are AI generated. All hand written. I would have loved to generate some, but I don't have the same level of AI tooling available to me at work that I do in my homelab, so it's really not an option.

So yea... here soon I'll be back to doing normal socg stuff, but for right now I basically just wake up, work, sleep, and start again, so there's just no time.

See y'all in December!

Understanding MoE Offloading

SomeOddCodeGuy — Mon, 13 Oct 2025 03:36:38 GMT

So last night/early this morning, I decided to go down a "how does this work?" rabbit hole. I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive. I figured I'd write up a description here, too, in case anyone ever stumbled here and was curious how it worked.

So most folks have seen how the models label themselves as N number of parameters, but Y number of active parameters. Like Qwen3 30b a3b- 30b model, 3b active parameters. As the name implies- when you send inference, it uses 3b worth of parameters.

The trick is that the model is built with a "router" that, for each token, selects a small subset of "experts" to process it. Again, old news for most of you, but just covering the bases.

Now that the obvious is out of the way, lets go into what does trip people up.

An expert is essentially a self-contained "feed-forward network" with its own set of parameters. The model has a whole library of them to choose from. Let's use gpt-oss-120b as a concrete example. It's listed as a ~120B parameter model, but only activates 5.1B parameters per token. The active parameter count is made up of two parts:

The "dense" or shared parts (~1.5B parameters): These are components like the self-attention mechanism and the router itself. They're always on and are used for every single token processed.
The active expert parameters (~3.5B parameters): This is the "on-demand" part. This model has 36 layers, and each layer contains 128 distinct experts, but only 4 are used per token, per layer. (this is all in their doc above). That's a pool of 4,608 experts total, and as the token works its way through each of the 36 layers one by one, the router in those layers picks 4 of those experts to use. Since the total expert parameter pool is around 114.7B, that means each expert is about 24M parameters in size. The calculation for the active portion is: 36 layers * 4 experts/layer * 24M params/expert ≈ 3.5B parameters.

Add the dense part (~1.5B) to the active expert part (~3.5B), and you get your ~5.1B active parameters. So for any given token, the computational load is that of a 5.1B model, not a 120B one.

So that, in a nutshell, is what an MoE is doing. But that all just explains the computation, not the memory.

Even quantized, the full 120B parameter set often won't fit in 24GB of VRAM. You'd have to go down to a ~1bpw to do that usually ((1 ÷ 8) * 120) == 0.125 * 120 == 15GB for a 1bpw model). Of course, this model is a little more fun in that it's MXFP4 trained model, meaning its about half the size, so you can probably get away with q2 ((2 ÷ 8) * 60) == 0.25 * 60 == 15GB for a 2bpw model). Of course MXFP4 is a little bigger than just plan 4bpw, but close enough.

The point is- you don't even want to try usually. Quantizing an MoE sucks bad enough without dipping into the really crappy quants. This is where the Llama.cpp's offloading comes in.

Llama.cpp has the ability to chose some of the layers to be run in system RAM/CPU, and some to be run in GPU. That's what the -ngl flag is doing; stating how many layers to offload into the GPU. If you do a value of "99", since it's higher than the number of layers most of the models have, it pretty much means you'll offload the whole model into the GPU.

Anyhow, it has a special offload just for MoEs; you do it by combining -ngl 99 with --n-cpu-moe N.

-ngl 99 tells Llama.cpp to try and load all layers of the model into the GPU's VRAM. Since the model likely has fewer than 99 layers (36 for gpt-oss-120b), this is effectively an "offload everything to the GPU" command.
--n-cpu-moe 20 (as an example value) then acts as an exception. It tells the engine: "For the first 20 layers of the model, take the expert components and move them to the CPU's system RAM."

NOTE: I incorrectly thought in my answer to the person who asked the question that it was the last 20 layers, but looking at the code, it appears to be the first 20 instead

for (int i = 0; i < value; ++i) {
    // ...
    buft_overrides.push_back(string_format("blk\\.%d\\.ffn_(up|down|gate)_exps", i));
    // ...
}

Anyhow, using these two flags: now the model is split. The dense, always-on parts and the experts from layers 21-36 are in VRAM. The experts for layers 1-20 are in slower system RAM.

When a prompt comes in and each token bounces through each layer like a pachinko ball, the router for that layer determines which experts to use for that token.

If it picks an expert that's in VRAM, the computation is extremely fast on the GPU. If it picks an expert that was offloaded to system RAM, that's the bottleneck.

Here's what happens under the hood: Llama.cpp doesn't move the entire 24M parameter expert from RAM into VRAM—that would be way too slow. Instead, it sends the token's small activation vector from VRAM across the PCIe bus to system RAM. The CPU then performs the math using the expert weights residing in RAM, and the result is sent back across the PCIe bus to the VRAM for the GPU to continue its work. Even though the CPU is slower, it's still plenty for what is essentially a 24M model.

This round-trip happens for every single token that gets routed to an offloaded expert. The actual bottleneck isn't the CPU's processing speed, but the latency of the PCIe bus.

So in the end, you're not really processing a 120B model. You're processing a ~5B model, where some of the work takes a slower path through the CPU and system RAM. It's slower than a native 5B model that fits entirely in VRAM, but vastly faster than trying to run a dense 120B model.

As far as I know, finding the right --n-cpu-moe value for your specific hardware is just a matter of trial and error to find the best performance sweet spot. There are also other commands, like tensor-override, which give you even more fine-grained control... but at that point you really gotta understand what's going on lol.

My Next Steps with Wilmer

SomeOddCodeGuy — Fri, 10 Oct 2025 04:38:46 GMT

When I first started Wilmer, it was for a very specific reason: I wanted a semantic router, and one didn't yet exist. The routers that were available were all specifically designed to take the last message, categorize that, and route you that way. I needed more, though; what if the last message was "ok"? How do you route that?

At the time, Llama 2 had been out almost half a year, and there were so many finetunes- everything from coding to medical to language specific. Compared to the big proprietary models, these little models didn't stand a ghost of a chance; but when you got into their specific domain? They at least fared better. This made me think "What if a bunch of these finetuned models were to compete, together, against the proprietary giants like ChatGPT 4?" Combine that with managing speed by sending less important requests to smaller and weaker models, and you get the perfect combination to take on Goliath.

Somewhere in the first couple of months, workflows got added in. They were the result of a suggestion my wife made: let folks modify the prompts. Somehow that turned into chaining prompts, and before I knew it Wilmer was becoming as much workflow oriented as router oriented.

Perception is Everything

Early on, folks really struggled to see the value in the workflow side of it. n8n existed, and folks used it for targeted tasks, but using workflows to improve chatbot functionality? They were far more interested in either speed (meaning no overhead), or trying every agent they could get their hands on.

One of the most common things I remember hearing, when I'd say I had to wait 3-5 minutes for a response, but the response was great, was "I'm not waiting that long". Fair enough.

Eventually, that tune changed. Reasoning models came out and trained everyone to wait for their responses; suddenly waiting minutes for a really strong response, like Gemini's Deep Think, became a worthwhile endeavor.

I still think n8n's explosion of popularity specifically came from the introduction of reasoning models, and folks being willing to wait longer for quality.

And Then Proprietary Catches Up

When ChatGPT 5 came out, the big talking point around it was that it used intelligent semantic routing to send the prompts to the appropriate model, allowing them to mix fast and slow/powerful models for the perfect experience.

... why does that sound so familiar? =D

3 days ago, ChatGPT unveiled yet another amazing feature: workflows =D Their AgentKit, which people say apparently killed off a few thousand startups the day it released lol.

With that, proprietary models have pretty much consumed the existing purpose of Wilmer; and as you can imagine, a multi-billion dollar mega-corporation full of the best and brightest this country has to offer will produce a far more impressive product than a dev manager tinkering on the weekends =D

But you know what? That's ok.

Wilmer's Purpose

Wilmer was always about local. It will always be about local. Wilmer is about you being able to take a computer completely off of the internet, have 1 or many different LLMs on your network, and continue to do work and get the best quality you can out of that setup. Wilmer was never about trying to compete, on their own home turf, with proprietary models and tooling.

For as long as you can't load up ChatGPT's AgentKit or run ChatGPT 5 on a laptop in airplane mode, Wilmer has a purpose.

Plus, at the end of the day, this bad boy is a passion project. I make it because I use it. I improve it because I want to do more things with it. I won't stop because, realistically- it's just fun.

The Roadmap

There's still a lot I want to do. The offline wiki api was never meant to be the only offline search available to users. There are amazing datasets on huggingface for everything from coding questions to medical questions, etc etc. I want to add them all in. I spent the past 6 months doing a massive refactor of the codebase to make it easier to work in, cleaner, and to add high code coverage unit testing (it's up to 91%!)

I've added some proprietary model support (Claude support incoming) and Im working to swap out Flask for something better. I plan to make this thing more friendly for multi-user use, keep heading down the direction of weird stuff like recursive workflows and the like.

I've got some cool ideas for use-cases around it too that I want to try out, and make more videos on... but if we're being honest, I suck at making videos. lol.

So Yea...

I'm not stopping yet. Not by a longshot. Wilmer, and the future projects I'll create, aren't being designed to be hyper-popular or anything like that. But they have a niche; a specific goal, and I'll keep working within that niche for as long as I can.

Microsoft's New User Role Model

SomeOddCodeGuy — Fri, 10 Oct 2025 03:37:40 GMT

So this looks like it could actually be a really fun model

https://huggingface.co/microsoft/UserLM-8b

I like this little specific purpose LLMs the most because it opens up some neat doors. They likely made this to act as the user-proxy in autogen, and they point out on their model card that it would be great for testing assistant LLMs, but this could also be amazing for long-form testing and benchmarking on workflows.

This could actually end up saving me a lot of time. Right now, when I make a new workflow, there's a lot of me involved in the testing of it, even though the outputs can easily be assessed by an LLM. And sure, I could create a set of tailored prompts to test with, but then I'd basically just be building against a happy-path. There's no telling what this little guy will come up with... which is pretty realistic for a user, too lol

Honestly, this is a model I never thought about wanting before and I don't even have a clear picture in my mind of exactly how I'll use it yet, but for some reason I'm really excited about it lol

Reddit woes; but a light at the end of the tunnel

SomeOddCodeGuy — Mon, 06 Oct 2025 02:10:00 GMT

After 3 months, /u/reddit finally messaged me to tell me the account was permanently banned. However, the section that should contain the reason for the ban is empty. It just says

Your account has been permanently banned for breaking the rules.



This account has been permanently closed. To continue using reddit, please log out and create a new account (the username /u/someoddcodeguy cannot be reused).

I'm guessing they usually give a reason there, but since the ban occurred as a likely unintended side effect of me getting auto-security-locked for using a VPN, it may have resulted as the reason being null or empty in their db.

The latter part of that email I didn't expect, since the general consensus is that making new accounts is considered ban evasion. Seeing them encourage me to return to reddit despite that feels kind of odd, but I won't complain.

Guess I'll be a fresh-faced newbie on LocalLlama again sometime this week, assuming I don't get auto-shadowbanned because the system doesn't agree with the "make a new account" instruction =D

Agentic Coding Pt 2...

SomeOddCodeGuy — Mon, 06 Oct 2025 01:09:07 GMT

Every weekend for a while I've put out a release to Wilmer on Sunday; generally a few features I was able to knock out on Saturday and test on Sunday. Almost always using either some combination of local models with Wilmer via Open WebUI, or using Gemini 2.5 Pro.

This weekend I challenged myself to primarily use Claude Code; I had 3 simple features I wanted to implement. I've been spending the past couple of weeks doing every tutorial I could find on best practices, I made sure to set up my project and global settings, I've got fantastic documentation, which each iteration plan is set to utilize...

I managed to get 1 out of 3. I spent almost 30 hours on it this weekend, and hit my $100 max plan limits twice. I'm frustrated.

The past few weeks have been a good learning experience, though. I feel like I'm starting to understand why there is a wider disconnect between how favorably I view AI development (using chat windows) vs the general consensus I seem to see on AI development (using agents). Especially on larger, more complex, applications like Wilmer.

Around 9pm or so I gave up and decided to get Wilmer and Gemini 2.5 Pro to help me code review what was going on, and it turns out that a lot of my troubles are due to little decisions the agent made that I was missing.

For example, my SSE Event stream was getting botched because the agent quietly enabled logging in library instantiation (which I completely missed), causing it to fill the buffer and dump it all at once, breaking the generator. Or the decision to enforce a version of a package to a mid-2024 build, when the fix I needed came in early 2025. When code reviewing hundreds of lines of changes, they were small items that blended right in.

What's annoying is that Wilmer (running GLM 4.6 + qwen3-30b-a3b + magistral 24b) and Gemini Pro 2.5 noticed immediately, despite Claude not noticing at all. Not that I can say anything; I didn't notice either lol.

I feel like a big pain point is the context window. Like it's losing track of important things.

Agentic coding is definitely a different beast. It could be that I'm still just used to using the chat AI for coding; I'm lightning fast with it, and so I'm likely feeling a tendency to want to reject something new and different that is currently slowing me down BECAUSE it's new and different. But it is frustrating...