The Living Deadline

From scribble to searchable: building a sketch-to-text Agent Skill

Alex Guglielmone Nemi — Mon, 16 Mar 2026 00:00:00 GMT

I like sketching on an Onyx Boox. Diagrams, flowcharts and rough system designs can go here as freehand ink.

Onyx Boox Air 3 C With a Fake Diagram

What I do not like is transcribing or re-drawing in a diagramming tool. It’s wasteful and, as I often say, if something is annoying and you do it often, there’s probably a better way.

Text is something agents and humans can both read, and is easy to store in source control. So I built an Agent Skill to do the conversion.

Note

I show you this, not so you necessarily use this, but so you get a sense of how easy it is to apply this same methodology to your own pain points. Michael Kennedy from Talk Python to me called it hyper-personal software.

The Skill

sketch-to-text takes a handwritten PDF or image and converts it into a Quarto .qmd file with Mermaid diagrams. The output renders, links, and lives with the rest of my writing.

Here’s one of my sketches converted (original PDF):

flowchart TD
    start("I want to share a game to teach")
    describe{"Can I describe the game? Intent"}
    tell_ai("Tell the AI to create game")
    explain("Explain it to the AI and play up scenarios that you had")
    iterate["Iterate asking for changes"]
    client_only{"Only client side?"}
    firebase[/Firebase/]
    publish("Publish to easy app")

    start --> describe
    describe -- yes --> tell_ai
    describe -- no --> explain
    explain --> describe
    tell_ai --> iterate
    iterate --> iterate
    iterate --> client_only
    client_only -- yes --> publish
    client_only -- no --> firebase
    firebase --> publish

That came from telling Claude: “convert diagram-1.pdf to quarto”.

How I built it

Rubber duck first

I didn’t start by building anything, I started by complaining about what was annoying to an LLM and thinking through what a good solution would look like.¹

I went through the Boox sync options: BOOXDrop, WebDAV, Obsidian, the various export formats (vector PDF, bitmap PDF, .note) that appear in the UI. Most had friction or lock-in I didn’t want. Then, as a test, I drew a quick diagram and asked the LLM to convert it to Mermaid.

It worked well enough to make the skill idea feel viable. I decided Quarto as a destination and any agent CLI as a runner were both fine, and that my only real work would be building good enough ground truth to test against.²

Talk, then build

Once I knew what I wanted, I described it to my agent and let it write the skill. The skill’s flow is: read → classify → extract structure → generate Mermaid → self-check → write. I didn’t have to think about it that much. I was already drawing a few more diagrams to build the ground truth.

Polish with evals

After I had 6 diagrams (with rushed handwritten that proved hard to read, even for myself) I converted them into the baseline quarto files and manually compared actual vs expected respectively.

This exposed issues fast and forced me to decide details that were hard to foresee upfront. Cloud shapes, for example, aren’t implemented in Quarto’s bundled Mermaid renderer. When I hit that in diagram 4, I had to decide what to do with @{ shape: cloud } which is newer Mermaid and unsupported, and opted for ((("text"))) since it’s supported, and the double circle is visually impactful and also represents a “Stop Point”.

flowchart TD
    bus(("Bus data"))
    tram(("Tram data"))
    prices(("Prices"))
    pt["Public Transport"]
    sim["Simulation"]
    qt["Quality Table"]
    dash((("exposure: Dashboard")))
    result((("exposure: result_table")))

    bus --> pt
    tram --> pt
    prices --> sim
    pt --> sim
    sim --> qt
    sim --> dash
    sim --> result

For each issue I found, I iterated with the agent and it updated the skill. The loop is simple: try on real input, see what breaks, make the decision explicit, validate.

Note

When building ground truth you’re allowed to change the data, the expected output, and/or the skill logic. Whatever works to get a baseline that’s honest and easy to iterate against.

With ground truth in place, I ran promptfoo³ evals: each diagram PDF went through the skill, the output was checked against the reference via deterministic checks whenever I could (icontains assertions) and LLM as a Judge with local deepseek-r1:14b elsewhere. The exact tool used is really not the important part.

One run I got 5/6 since Diagram 5 had failed. The skill dropped the edge labels on the three branches.

flowchart TD
    banner["Design, De-Risk & Jumpstart"]
    intent["Intent"]
    goals["`Reduce effort
      Increase ROI
      Fail fast`"]
    guardrails["Guardrails"]

    banner -- Design --> intent
    banner -- De-Risk --> goals
    banner -- Jumpstart --> guardrails

The fix was one line in SKILL.md to preserve every label written on or beside an arrow; a labelled arrow must use -- label --> not bare -->. I re run for a 6/6 outcome. I didn’t edit the skill manually. It was all conversation.

A failing eval tells you exactly what the skill missed. One quick instruction to the agent, a re-run, and you know it holds.

Why the evals matter more than the skill

The skill is useful immediatly but the evals are what make it safe and easy to change weeks later when you don’t remember any of the details.

sketch-to-text handles flowcharts well for what I tested and is harmless enough to use whenever. If I notice inaccuracies, I can figure out whether the problem is in the source file, the skill logic, or the model. Fix it, add the new case to the evals, and catch future regressions. If I want to expand it, I’ll know I haven’t broken anything for my existing cases.

The investment is a bit of ground truth up front, in exchange for confidence on every future change.

Note

The eval inputs and ground truth files are co-located with the skill for now while the right distribution format for Agent Skill evals is still being worked out. See the evals README.

Building Agent Skills: Intent, Determinism, and Stability

Alex Guglielmone Nemi — Fri, 13 Feb 2026 00:00:00 GMT

I want to offer a mental model and decision tree for building Agent Skills incrementally. It’s meant for anyone experimenting with them - not just for software developers - and focuses on staying in control as complexity grows or you start thinking about sharing or collaborating with others.

Note

It’s awesome to see increased adoption of Agent Skills to package workflows ¹ and I attribute a lot of their success to standardized contracts for central use and managing context through progressive disclosure ² (More mature tools/MCPS and models definitely lowered friction as well).

Mental Model

You can think of Agents as assistants that can take the load from you and Agent Skills as the high level instructions you might leave to them, in a standardized format.

You need to know what you want from them, and the tradeoffs between micromanagement and agency - Intent. You want to offload mechanical work so they’re not reasoning about things that a tool like a calculator or a spreadsheet could handle - Determinism and you want the whole thing to hold up even if you swap one assistant for another - Stability.

Looking at the shape of an Agent Skill, you can mentally map it as:

Intent -> Markdown instructions
Determinism -> tools and scripts
Stability -> Tests and AI Evals

If your intent is clear enough, the built-in skill-builder in many agent CLIs may be sufficient especially if you still validate outputs manually or stay in the loop for approvals. The more you want to change the skill without manually validating every run, or the more you worry about undesired/rogue behavior, the more determinism helps: scripts reduce error compounding, and unit tests plus AI evals reduce drift risk and make iteration safer and faster, especially when sharing or collaborating.

Decision Tree

flowchart TD
  A([Start: I have or am developing an Agent Skill]) --> Q1{Only for you
and you're happy manually reviewing outputs?}

  Q1 -- Yes --> L0["Level 0: Intent (Markdown only might be enough for you)
- Clear inputs/outputs help
- Examples help
- Define good enough"]
  Q1 -- No --> Q2{Need repeatable structure
or mechanical consistency?}

  Q2 -- No --> L0
  Q2 -- Yes --> L1["Level 1: Determinism (Tools) 
- Move mechanical steps into tools/scripts
- Use structured output
- If possible, log tool inputs/outputs"]

  L1 --> Q3{Will others use it
or will you modify it often
without manual re-checking?}

  Q3 -- No --> L1
  Q3 -- Yes --> L2["Level 2: Stability (Tests + Evals)
- Unit tests for tools/scripts
- AI evals for behavior/user stories
- Minimal Golden cases + some edge cases"]

  L2 --> Q4{Can it access sensitive data
or take impactful actions
or run unattended?}

  Q4 -- No --> L2
  Q4 -- Yes --> L3["Level 3: Safety/Scale
- Guardrails + least privilege
- Human approval for high impact
- Security-focused evals"]

Note

Observability: This is a key point I omitted from the levels here. It lets you monitor cost (tokens spent), latency, tool selection, and more. You should add it as soon as you feel that you are missing that information. Investing in this space is important to answer questions about what happened in a particular “agent instruction following loop”.

Go deeper on each level:

Level 0 - Intent: Skills Spec
Level 1 - Determinism: Error compounding + determinism
Level 2 - Stability: Evals primer
Level 3 - Security: If you’re here, least privilege and human approval for high-impact actions are usually a good baseline. For each integration point, ask what the worst-case outcome is. It’s also worth understanding Prompt Injection.

A note on experience

The levels are illustrative and exist to prevent overwhelming yourself and avoid paralysis - either from fear of breaking things or from too many choices at the start.

After you build a few skills, recognizing patterns becomes easier and you can decide where to invest based on your own pain points. You may start thinking about intent, determinism, stability, and safety from the beginning.

That does not mean implementing everything at once. It means being aware of more tradeoffs earlier.

Build only what you need, and keep it as simple as possible.

Practical Takeaway

Use skills to clarify intent. When a step stabilizes, move it into code or tools. Not because the model can’t do it, but because you don’t want to rediscover the same approach on every run. That lowers cost, reduces drift risk, and keeps room for directed experiments.

You can build skills with your agent CLI of choice (Claude/Codex/OpenCode), or use frameworks that support the pattern, like Doug Trajano’s Agent Skills implementation for PydanticAI (docs).

Call to action

If you have a skill you want to take beyond Markdown with determinism or AI Evals, share it. We can discuss which steps are missing to move from one level to the next. It may be simpler than it looks, and we could use it as a public-facing example to help others see specific ways to improve.

Footnotes

Turning a sequence of steps you’d otherwise repeat manually into a single, reusable instruction an agent can follow.↩︎
Progressive disclosure means your agent doesn’t need every instruction in context at once. It can load what’s relevant when needed. See Doug Trajano’s PydanticAI Agent Skills implementation and docs.↩︎

Measure First, Optimize Last: My Approach to AI Evals

Alex Guglielmone Nemi — Tue, 10 Feb 2026 00:00:00 GMT

If you can’t measure it, you’re guessing. Here’s how I think about evals, practical examples at ai-evals.io.

Start with pain, not tooling

My eval approach is pain-point driven:

I can’t compare what I can’t measure.
I can’t trust an AI system to run on its own if I can’t quantify failure.

If those two are unresolved, I am not “doing evals.” I am guessing.

Treat it like automation engineering

I frame agent work the same way as any automation effort. Unlike traditional ML experimentation, automation engineering demands you define acceptable behaviour upfront. Not after deployment:

Can I describe exactly what I want? (intentionality)
What is the worst-case blast radius¹ if this fails?

That framing forces clarity early and keeps risk visible.

Build the smallest useful test loop

I treat early eval work like integration testing plus TDD² habits:

Skip big infra at the beginning.
Put the agent where users already work so it behaves like an extra set of hands.
Recreate real user stories and questions.
Use deterministic checks wherever possible; don’t default to LLM-as-a-judge for everything.

The goal is not “more tests.” The goal is tests that maximize iteration speed and control.

Optimize late

This space moves fast. Over-optimizing too early is usually waste.

I prefer to keep things:

Minimal
Easy to Change (ETC)

In practice that means:

Parameterized experiments
Easy comparison across runs, configs, and components

Once benchmarks are stable, then optimize cost and latency:

Pick the cheapest model/config that still meets the bar.

That’s it: measure first, constrain risk, iterate fast, optimize last.

Footnotes

Blast Radius: refers to the extent of system, data, or user impact caused by a component failure, security breach, or faulty code deployment. See an extreme example in the Falcon Sensor 2024 crash.↩︎
TDD: Test Driven Development: The act of driving your code thinking intentionally about testability and what tests expose the behaviour you want. Dogmatically it can seem slow and unhelpful but thinking about testability and adding tests to catch prevent bugs from recurring are very useful practices to have. Otherwise, overbuilding is very easy.↩︎

Get the Value of a High-Quality Audit, All the Time

Alex Guglielmone Nemi — Thu, 29 Jan 2026 00:00:00 GMT

We keep asking questions we should already know the answer to. And we usually ask them when a decision depends on it.

Sometimes we guess. Sometimes we do a one-off investigation. Sometimes we shrug and move on.

What if you could get the value of a high-quality audit all the time?

The idea

Instead of running audits occasionally, automate the audit itself.

An audit is just a set of questions.

If you make those questions explicit, and make them answerable repeatedly, the audit stops being a one-off activity and becomes something you can run continuously.

The method

Start from pain points you already feel, or decisions you struggle to make.

From each pain point, write the questions that would help you address it.

Those questions imply dimensions (what you want coverage over) and, over time, a set of entities that describe your world.

Don’t try to be complete. Just describe enough of your world to support the questions you care about.

For example, “we don’t know if we can reproduce the data of our projects” is a pain point that may prompt the questions:

Which projects have code packages?
Which code packages have a README with reproducibility steps?
Which README file instructions actually work?

The first two should be relatively trivial to check (in this GenAI world) and you can decide how much value the third one gets but the important point is that you’ve introduced what you know and surfaced what you currently can answer in case it’s worth it later.

The dimensions in our example are Code Packages and Projects.

These are actionable and describe different states of knowledge and reproducibility readiness.

Note

Other questions will likely arise from these such as “Which projects are ongoing? Which code packages belong to which projects?”

The entities in your world

After writing a few questions, you can jump to a whiteboard and try to describe a lot of the entities in your world to brainstorm about what you own, what you interact with and how things cluster together. The entities are the targets of the questions, and dimensions allow you to define the coverage. Example: “We have a total of 12 projects: 5 have READMEs with reproducibility steps, 2 don’t have READMEs, 5 have no packages.”

It’s up to you whether you want to immediately define actions to take or treat it as a helpful data point for others to decide. The point is you stop spending time re-answering ad hoc questions.

The forcing function

Once you have questions, you write evals, checks that verify you can answer them. (See promptfoo’s documentation for one way to implement them.)

An eval is simply:

If I ask this question, I expect to get this kind of answer.

Writing the eval is where the value appears.

The moment you write it, you’re forced to confront whether you can actually answer the question at all.

If you can, the eval is straightforward.
If you can’t, you’ve discovered something you thought you knew but didn’t.
If the answer is subjective, that gap becomes explicit.

You don’t need perfect answers. You need to know which evals pass and which ones reveal that you never had a real answer.

That alone is valuable.

Note

While writing evals in a form that doesn’t break as things change requires some practice, it’s not your primary concern when starting. It’s fine to be explicit and baseline.

What you get

Once questions have evals, answering them becomes an implementation detail. With MCPs (Model Context Protocol servers), Playwright, etc. Programmatic access is easier than ever, the hard part isn’t answering questions, it’s knowing which ones to ask and systematizing what “good” looks like.

Over time, you start to see where your questions apply, which are too vague, which are easier than expected, and where your description of the world is still thin.

You don’t need to go all in. Start with a few questions and a small report that grows over time.

At that point, you’re no longer redoing one-off audits.

You’re running a continuous audit whose scope is defined by the questions you care about.

Why this is low-risk

You don’t need to know everything upfront.
You don’t need to define “good” everywhere.
You don’t need a complete model of your world.

You just need to acknowledge you may not know the answer for questions you’ve not explicitly tried to answer (quite similar to feeling pain when doing unit testing while being unsure of what you want)

Even unanswered questions are useful. They tell you what’s unclear, subjective, or not worth investing in.

Closing

You don’t need to audit everything.

You just need to stop rediscovering the same answers, and the same unknowns, over and over again.

Start with a few questions. Let the audit grow.

If it doesn’t help, stop.

The cost is small. The insight compounds.

Reducing Error Compounding in GenAI Systems

Alex Guglielmone Nemi — Wed, 28 Jan 2026 00:00:00 GMT

GenAI is non-deterministic and can fail or produce different results for the same input.

A typical prompt-to-action flow involves many LLM calls. Each call is a chance for the model to misinterpret, hallucinate, or produce an unusable output.

The question isn’t if errors happen. It’s what happens when they do, and how many opportunities you give them to cascade.

How bad does it get?

Consider a simple case with just 4 steps (note: this is illustrative, consider how your system might have 10, 20, or more LLM calls):

Step 1: 95% chance of correct
Step 2: 95% chance of correct
Step 3: 95% chance of correct
Step 4: 95% chance of correct

End-to-end: ~81% chance everything is correct.

Now compare:

Step 1 (LLM): user intent → structured call (95%)
Step 2 (deterministic tool): execute (98%)
Step 3 (deterministic validation): parse + check (97%)
Step 4 (LLM): result → response (96%)

End-to-end: ~87%

A2["LLM 95%"] --> A3["LLM 95%"] --> A4["LLM 95%"] end subgraph B["Mixed → 87%"] direction LR B1["LLM 95%"] --> B2["Tool 98%"] --> B3["Validate 97%"] --> B4["LLM 96%"] end ::: ``{=html} :::: ::::: :::::: -->

Same model. Different architecture. 6+ point improvement.

Two high-leverage approaches

Remove 1 or many GenAI steps entirely, fewer chances to fail
Replace GenAI steps with deterministic ones, lower error rate per step

Note

These aren’t the only ways to reduce error (e.g. consensus systems, retries, etc) but the fundamentals expressed would apply everywhere, no matter whether it’s one agent or a swarm.

What makes deterministic steps different

Deterministic steps still fail, but the failure characteristics differ from LLM failures:

Bounded: failures come from a finite set of causes (parse error, timeout, missing field), not open-ended misinterpretation
Repeatable: same input, same failure: you can reproduce and fix it
Non-semantic: a crashed process doesn’t convince the next step that “actually the user meant X”

Note

This doesn’t mean deterministic = reliable. It means when things break, they break in less subtle ways and there’s a lot of software engineering history behind their robustness.

The design pattern

LLMs do many hard parts (interpreting intent, choosing tools, dealing with syntax, reasoning through results, deciding what comes next).

In a simplified flow, the model might:

Receive user intent (natural language)
Decide which tool to call and with what parameters
Receive structured output from the tool
Decide: done, or call another tool?
Repeat until ready to respond
Translate the final result back to the user

A lot of reasoning and orchestration happens there and the point isn’t to limit that but to give it good building blocks.

A human user is more effective with better building blocks (e.g. well designed libraries or cli tools) and so is an LLM.

What you want are building blocks that are:

reusable
composable
well tested
easy to change and maintain
cost effective

And a model that acts as a translation layer, not the tool running all the logic.

Practical recommendations

Identify the deterministic core

If you are writing a Claude skill and a step can be expressed as code, ask yourself why you’re not expressing it as code.

The tradeoff is real:

Leaving logic in prose means:

Higher error rate at runtime
Relying on evals instead of unit tests (if you don’t know what one or either of these are, then you’re definitely safer in the frozen code world)
Paying the cost and error on every execution
Yes, it might improve as models improve, but you’re paying for that uncertainty every time

Moving logic to code means:

Lower error rate (deterministic execution)
Unit testable
Cheaper to run
Still easy to write with LLMs, have the model generate the code once instead of regenerating the logic from prose every time
You can still ask LLMs to review or improve the code later if you want

The second option gives you confidence that things actually work. The first option defers that confidence in exchange for alleged convenience.

If the LLM can write code for you, why have it translate markdown to logic on every run? Make the translation once, freeze it as code, and test it properly.

Force structure at boundaries

Don’t pass prose between steps. Use formats that are easy to serialize and deserialize, like JSON/YAML with schemas you can validate against.

Structure lets you validate, detect errors and attempt course correction or fail fast, diff, log, and evaluate deterministically.

Note

This will even save you money, time and compute resources by not having to use LLM as a Judge for assertion in your evals.

Test the building blocks

Write unit tests and integration tests for the core building blocks — same as you would’ve done before LLMs.

Closing

This isn’t about distrusting models, it’s about giving them good building blocks to use.

Use GenAI to translate intent. Use the building blocks to execute. Keep errors where you can measure them.

That is how automation becomes something you can trust instead of manually testing once and being hopeful.

Stop Reformatting Markdown When Pasting into Slack

Alex Guglielmone Nemi — Fri, 16 Jan 2026 00:00:00 GMT

Pain Point

Slack only pastes rich formatting when the clipboard advertises text/html, otherwise it treats everything as plain text.

If my file sample.md looks like this:

:robot_face: Tech updates :robot_face:

# Some Title

- **Project**: Did x in [google](https://google.com).
  - aaa
  - bbb
      - ccc 
  - ddd

# Another title

- Another launch
  - details

Compare pasting directly on the left and what we want on the right.

Side by Side comparison between markdown in plain text and the slack rich formatted version

Solution

Put HTML onto the clipboard the same way a browser would, so Slack pastes it as rich content instead of plain text.

This requires a recent xclip build that supports advertising text/html correctly.

Build xclip from source to get the latest features around html
pip install beautifulsoup4 lxml
Run pandoc sample.md -t html
Optionally modify the HTML to fix things like lists.
and pipe it to xclip -selection clipboard -t text/html

What it looks like:

pandoc -f gfm -t html sample.md \
| python -c '
import sys
from bs4 import BeautifulSoup as BS, Tag

s = BS(sys.stdin.read(), "lxml")

def inline_html(tag: Tag) -> str:
    return "".join(str(x) for x in tag.contents).strip()

out_lines = []

def emit_block(tag: Tag):
    name = tag.name.lower()

    if name in ("h1","h2","h3","h4","h5","h6"):
        txt = tag.get_text(strip=True)
        if txt:
            out_lines.append(f"{txt}")
            out_lines.append("
")
        return

    if name in ("p",):
        txt = inline_html(tag)
        if txt:
            out_lines.append(txt)
            out_lines.append("
")
        return

    if name in ("ul","ol"):
        walk_list(tag, 0)
        out_lines.append("
")
        return

def walk_list(lst: Tag, level: int):
    for li in lst.find_all("li", recursive=False):
        # separate nested lists
        nested = [c for c in li.find_all(["ul","ol"], recursive=False)]
        for n in nested:
            n.extract()

        text = inline_html(li)
        if text:
            indent = " " * (4 * level)  # 4 = indent width per level
            bullet = "•"               # •
            out_lines.append(f"{indent}{bullet} {text}
")

        for n in nested:
            walk_list(n, level + 1)

body = s.body if s.body else s
for child in list(body.children):
    if isinstance(child, Tag):
        emit_block(child)

print("".join(out_lines), end="")
' \
| xclip -sel clipboard -t text/html -alt-text "Updates"

Note: The Python step is only needed if you want to fix Slack’s broken handling of nested lists. For simple formatting (bold, links, headings), Pandoc -> xclip alone is enough.

Inspiration

Authoring Markdown externally and pasting the ‘pretty’ output into Slack (on Linux) does the same thing without the extra formatting to fix the nested lists.

Annex

How to build latest xclip from source in Ubuntu

sudo apt install autoconf automake libtool libxmu-dev
git clone https://github.com/astrand/xclip
cd xclip
autoreconf -fi
./configure --prefix=/usr/local
make
sudo make install

WSL and powershell are different beasts

This won’t work on WSL and slack as-is. You likely need to do it from powershell using a third-party program

WSL cannot directly populate the Windows clipboard with rich HTML in a way Slack accepts; an intermediate Windows application re-copies the content with additional clipboard formats.

Powershell 5

Get-Content out.html -Raw | Set-Clipboard -AsHtml

Open LibreOffice Writer or any GUI and paste.
Select that and copy
Paste into slack

Fix: pip hangs in WSL (IPv6 / gai.conf)

Alex Guglielmone Nemi — Tue, 13 Jan 2026 00:00:00 GMT

Pain Point

pip install hangs in WSL with no useful error, often after it starts fetching from files.pythonhosted.org.

The Rule

If DNS/connection to files.pythonhosted.org hangs but pypi.org works, suspect IPv6 preference + broken IPv6 routing.

Minimal Diagnosis

python -c "import urllib.request; print(urllib.request.urlopen('https://pypi.org/simple/').status)"
# expected: 200

getent hosts pypi.org
# returns quickly

getent hosts files.pythonhosted.org
# may hang

If files.pythonhosted.org hangs, pip will hang. That host is where wheels and sdists are served from.

Fix

Prefer IPv4 for address selection using gai.conf:

sudo tee /etc/gai.conf >/dev/null <<'EOF'
precedence ::ffff:0:0/96  100
EOF

This does not disable IPv6. It changes the precedence so IPv4 is tried first.

Verify

getent hosts files.pythonhosted.org
# should return immediately

Then retry:

pip install ipython

Revert

sudo tee /etc/gai.conf >/dev/null <<'EOF'
# empty override: use glibc defaults
EOF

Notes

If you see multiple stuck installs, clear them before retrying:

pkill -f "python -u -m pip install" || true

Other sources

Talks: Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem

Alex Guglielmone Nemi — Mon, 08 Sep 2025 23:00:00 GMT

Abstract

As LLM tools and agents emerge in the Airflow community, whether as plugins, MCP servers, or embedded agents, we lack a consistent way to benchmark across implementations and across versions of the same solution. This lightning talk highlights the need of an agreed-upon evaluation mechanism that enables us to measure, compare, and reproduce results when working with GenAI solutions in relation to Airflow. I’ll share what such mechanism could look like in practice. If you care about building trustworthy, testable GenAI systems (that could eventually fit into CI/CD workflows) and want to able to have grounded discussions when developing in this space, let’s lay the groundwork to test and compare our tools meaningfully.

Slides and Transcript

Using Data Build Tool (dbt) to Accelerate & Scale Science

Alex Guglielmone Nemi — Sun, 31 Aug 2025 23:00:00 GMT

This post is part of a series: “Factory of Domain Experts”:

What problem are we solving?

“When can we launch this?” is a recurring question in cross-functional teams, and the answer is often “ask the engineers”. But is that really necessary? Do scientists need to build and then hand off to engineers to rewrite code for scalability or reliability?

I challenged this pattern since I wanted to scale without having to grow engineering headcount, and empower our scientists to deliver more impact independently. The original handover approach introduced delays, estimation misses, integration surprises, and iteration overhead.

What we wanted to achieve:

Scale our team: Grow scientific output independent of engineering capacity.
Iterate in parallel, not sequentially: Team members collaborate building and integrating simultaneously, without waiting periods or handovers.
Share easily reproducible code: Produce reproducible code and data that makes cross-team collaboration easy and transparent.

Instead of building custom solutions, we used a small set of industry-standard tools, dbt (Data Build Tool) with SQL, Apache Airflow (using astronomer-cosmos), and Git, to create a simple system. Scientists now develop close to the domain, and their work is automatically orchestrated and deployed without engineers needing to rewrite or manage the code. There’s no custom Graphical User Interface (GUI) or platform, just clear conventions, smart defaults, and infrastructure-as-code. Engineers focus on building reusable capabilities while scientists focus on science and business logic.

How dbt solves these problems

Data Build Tool (dbt) enables engineers and scientists alike to transform data using software engineering best practices. Crucially, there are no tradeoffs between scrappy exploration and production-ready code, the same code serves both purposes:

Production-ready from day one: The code scientists write IS the production code. No handovers, no rewrites, no “let me translate this for production.” Your development SQL becomes the scheduled pipeline automatically.
Collaboration and early integration: Since both engineers and scientists can run the same dbt code, collaboration happens naturally from day one, fostering cross-domain learning and surfacing integration or reproducibility issues early, reducing project risk.
Simple workflows that scale: A simple dbt run -s "model_name+" runs your model and all dependencies. The same code that works for individual data exploration works for production scheduling.
Modularity without orchestration headaches: dbt forces you to break apart monolithic SQL into focused models, but handles all the dependency management automatically, so you get the benefits of clean, debuggable code without the cognitive overhead of managing execution order.
Automatic lineage and documentation: dbt generates interactive dependency graphs showing how your models connect. Schema documentation automatically appears in the warehouse tables.
Built-in quality controls: Define data tests that run automatically.
Built for integration and extensibility: dbt integrates seamlessly with our existing AWS stack (Athena, Glue, Iceberg), internal services and datalakes and industry standard tools.
Compliance and governance: Data policies can be built into packages, ensuring compliance and empowering your users to make the right tradeoffs around data handling.

Impact

Our approach enabled delivery of multiple high-impact scientist-led projects that would have otherwise been delayed or blocked due to engineering constraints. Peer teams adopted or expressed desires to adopt it, when they had a chance to work with us and experiment the productivity speed ups, in different dimensions.

Use aider for free with your local LLMs or cheaply with OpenRouter

Alex Guglielmone Nemi — Sat, 05 Jul 2025 23:00:00 GMT

Many people use LLM (Large Language Models) services to code at work but don’t necessarily see a path to use them at home on a budget.

Here are two quick recipes: one for a fully local, privacy-focused setup, and another using OpenRouter.

Local LLMs

Make sure you have ollama installed and running.
Note down a wich model(s) you have installed and plan to use. We’ll use deepseek-r1 and qwen2.5-coder as example models. Deepseek is general purpose and a good candidate for reasoning while qwen2.5-coder is specialized for coding tasks.

$ ollama list

NAME                                        ID              SIZE      MODIFIED
deepseek-r1:14b                             ea35dfe18182    9.0 GB    2 hours ago
qwen2.5-coder:14b                           9ec8897f747e    9.0 GB    2 hours ago

I’m using the 14-B distilled models based on my hardware. You can experiment with different ones and find what speed vs quality tradeoff you’re comfortable with. The Ollama models site is very handy to get information about models and their distilled versions.

follow the guide which tells you to run:

aider --model ollama_chat/<model>

So in our case, that becomes:

aider --model "ollama_chat/deepseek-r1:14b" --editor-model "ollama_chat/qwen2.5-coder:14b"

We could simply use one model for everything but this “plan vs execution” pattern works really well both locally and remotely.

Use aider --help or visit the options page on aider’s site to understand the differences between --model (main model), --editor-model (editor tasks), and --weak-model (commit messages and history summarization).

Cheaply with OpenRouter

If you’re not satisfied with using your hardware for everything and are ok with sending data to an LLM in the cloud, you can use OpenRouter.

The advantage of using OpenRouter over a specific LLM service like Claude, ChatGPT API or others is that you can have a cloud independent approach and mix and match APIs paying in only one place, while also setting specific budgets that you can’t go over.

user u/Baldur-Norddahl Reddit LocalLLama shared a snippet of what it looks like. You’ll notice it’s very similar to our local example with the addition of the OpenRouter API Key as an environment variable and that we use Claude 3.7 and the full version of Deepseek r1:

export OPENROUTER_API_KEY=sk-or-v1-xxxx
aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model openrouter/anthropic/claude-3.7-sonnet --watch-files

You can easily monitor your activity an estimate what your coding sessions are actually like. This may lead you to switch from Claude 3.7 to something cheaper. Again, it’s all about personal experience and quality tradeoffs.

In Closing

Both patterns are very useful and allow you a great degree of flexibility. There’s a lot of power in customization and avoiding vendor lock-in. You’ll be able to experiement with cline/aider or whatever the next tool is. As hardware becomes more powerful, you could have a very productive experience on a plane, even without internet access.

Shoutout to Georgi Gerganov’s llama.cpp which is the core that allows ollama to work.

Merge and Forget

Alex Guglielmone Nemi — Fri, 17 May 2024 23:00:00 GMT

The Rule

After your change is approved and enters the delivery pipeline, you should be able to forget it.

No following it through pipelines. No watching for when it lands. No manual checking for failure states.

Pain Point

Tracking deployments “just in case” creates unnecessary cognitive load.

It turns delivery into a background worry: tabs left open, dashboards checked, attention fragmented.

If something requires attention, you should be told.

Do

Treat delivery as a system property:

Push “surprises” left: run fast, automated cross-system checks (e.g. integration tests) before merge, at code-review time, to minimise post-merge failures.
You should get a signal if something is blocked or broken.
You should not need manual reassurance.
When the system is healthy, silence is expected.

(see No News Is Good News).

Do Not

Follow a deploy through the pipeline to feel safe.
Keep checking “did it land yet?”
Watch logs/dashboards after merge for reassurance.

Scope

This is for routine, continuous delivery of typical changes.

This does not cover:

major migrations
one-way / high-blast-radius changes
communicating expected delivery times (ETAs)

No News Is Good News

Alex Guglielmone Nemi — Thu, 16 May 2024 23:00:00 GMT

The Rule

Do not check whether things are fine.

Pain Point

Manually checking systems to confirm they are “okay” creates unnecessary cognitive load.

Dashboards and queues become reassurance rituals: they consume time and attention without changing outcomes.

If something is broken, you should be told.

Do

For anything that might require action, ensure there is an automated signal.

The signal should reach the people who can meaningfully act on it.
It does not need to prescribe the action.
Over time, actions may be formalised (runbooks, automation), but that is secondary.

If something requires attention, it should create noise. If it does not, it should remain silent.

Do Not

Create mechanisms to confirm system health.
Regularly inspect dashboards “just to be sure”.
Rely on manual checks for reassurance.

Silence is expected when coverage is adequate.

Scope

This applies to operational, actionable failures: things that are broken now and require attention.

This does not cover:

slow degradation
trend monitoring
preemptive or exploratory analysis

Analogy

Think of a perfect assistant: they interrupt you only when there is something you can act on. If they do not interrupt you, you can assume everything is fine.

Set a Meeting Budget

Alex Guglielmone Nemi — Wed, 01 Mar 2023 00:00:00 GMT

Pain Point

Too many recurring meetings drain your week’s productivity.

The Rule

Set a hard budget for fixed meetings.
Example: 40 h week, 6 h meeting budget.

Total - Budget = Free -> 40 - 6 = 34

If you go over budget, cut or shrink the least important meetings.
You can adjust the budget, but do so rarely, otherwise it loses meaning.
Ad-hoc syncs are fine. It’s the recurring ones that eat up your time.
- Consider doing a similar thing for ad-hoc meetings, if they become a problem.
Like code, less is better. Always look for ways to reduce, even if you’re under budget.

Analogy

This is similar to the U.S. Senate’s PAYGO rule:

if Congress wants to add $N to a program, they must “find room” by reducing $N somewhere else or by increasing taxes to cover it.

Get notifications in ubuntu when command line tasks end

Alex Guglielmone Nemi — Mon, 15 Apr 2019 23:00:00 GMT

Intro

Often, when working in the terminal, you’ll find yourself running a command that takes a non-trivial amount of time and you don’t want to just stare at the screen until it finishes.

So you switch tabs/windows and do something else in the meantime. Problem is, when is the other task finished? You don’t want to waste time checking too often nor too late…

So what you want is a notification. One that lets you carry on merrily until the original command is actually finished.

It turns out that many .bashrc files come with an alias called alert and, some SO answers even improve upon it.

Desktop notifications with notify-send

Here’s the one I’m using lately and has served me well:

# Add an "alert" alias for long running commands.  Use like so:
#   sleep 10; alert
alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)"  
"$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'

As the comment says, using it is just a matter of writing the command you want, a semi-colon and the alias alert (Remember that semi-colon ; means execute after the previous command is finished, no matter the return code, unlike && which only executes the next command if return code is 0 (success).

So if you’re compiling and running tests in a project you could just do:

make test; alert

and you’ll get notified whenever make test ends.

But what if you decided running that lengthy task is a good moment to step away from your computer and take a coffee break or talk with a coworker? How will you know when it’s done if you’re not in front of the computer to see the desktop notification?

Email notifications

That’s when email comes in handy. You just gotta take your phone with you and have access to an SMTP server.

$ sudo apt install mailutils

The logic is the same as before, once the command is done, execute the “alert”.

If you want to do it in python, here’s a simple way to go about it.

Just make sure it doesn’t go to SPAM.

Cheers

Was this helpful? Do you do it another way? All comments are welcome!

Accept a self-signed certificate with git

Alex Guglielmone Nemi — Sun, 11 Feb 2018 00:00:00 GMT

Intro

Some time ago I came into an issue where people served git repositories in a local network using apache but used a self-signed certificate for the server.

Everyone was already trained to add the exception in their browsers to access HTML content but what happened when it came to source code control?

The Problem

It turns out Subversion (SVN) presented no issue since it prompted the user to accept the new server key just once and then didn’t pester them again but git was another story. Git tried to verify that the cert was signed by a proper authority and couldn’t.

user@user-linux:git$ git clone https://user@dev-server-01/git/repo_name.git 
Cloning into 'repo_name'...
fatal: unable to access 'https://user@dev-server-01/git/repo_name.git/': server certificate verification failed. 
CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none

The Solution

After some googling I came across suggestions to disable SSL verification with git config http.sslVerify "false" but that looked like it could induce some bad habits and it actually wouldn’t prevent tampering if, for instance, the user was pointed elsewhere instead of the proper original server.

That’s when Stack Overflow came into play and I found about this neat solution where you can associate a hostname with a given certificate that you store locally.

Steps:

1- Download the self signed certificate from the server and store it somewhere like /etc/ssl/certs

/etc/ssl/certs/ssl-cert-dev-01.pem
/etc/ssl/certs/ssl-cert-dev-02.pem

2- Modify your git config (globally or per-repository) to associate hosts with certs:

(From git config --help)

http.sslCAInfo
    File containing the certificates to verify the peer with when fetching or pushing over HTTPS. 
    Can be overridden by the GIT_SSL_CAINFO environment variable.

In this case we’re going to do it globally by modifying ~/.gitconfig

[http "https://dev-server-01:/"]
    sslCAInfo = /etc/ssl/certs/ssl-cert-dev-01.pem

[http "https://dev-server-02"]
    sslCAInfo = /etc/ssl/certs/ssl-cert-dev-02.pem

Or you can do it with the command line:

$ git config --global http."https://dev-server-01/".sslCAInfo /etc/ssl/certs/ssl-cert-dev-01.pem
$ git config --global http."https://dev-server-02/".sslCAInfo /etc/ssl/certs/ssl-cert-dev-02.pem

Of course, this breaks the flow of those who were using HTTP and the IP address directly since you need the same name that appears in the certificate. That’s the one con I can think of and, if your users where not in the habit of doing so, you’ll better start getting them used to it.

Cheers

Was this helpful? Do you do it another way? All comments are welcome!

The Living Deadline

From scribble to searchable: building a sketch-to-text Agent Skill

The Skill

How I built it

Rubber duck first

Talk, then build

Polish with evals

Why the evals matter more than the skill

Links

Footnotes

Building Agent Skills: Intent, Determinism, and Stability

Mental Model

Decision Tree

A note on experience

Practical Takeaway

Call to action

Footnotes

Measure First, Optimize Last: My Approach to AI Evals

Start with pain, not tooling

Treat it like automation engineering

Build the smallest useful test loop

Optimize late

Footnotes

Get the Value of a High-Quality Audit, All the Time

The idea

The method

The entities in your world

The forcing function

What you get

Why this is low-risk

Closing

Reducing Error Compounding in GenAI Systems

How bad does it get?

Two high-leverage approaches

What makes deterministic steps different

The design pattern

Practical recommendations

Identify the deterministic core

Force structure at boundaries

Test the building blocks

Closing

Stop Reformatting Markdown When Pasting into Slack

Pain Point

Solution

Inspiration

Annex

How to build latest xclip from source in Ubuntu

WSL and powershell are different beasts

Fix: pip hangs in WSL (IPv6 / gai.conf)

Pain Point

The Rule

Minimal Diagnosis

Fix

Verify

Revert

Notes

Other sources

Talks: Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem

Abstract

Slides and Transcript

Using Data Build Tool (dbt) to Accelerate & Scale Science

What problem are we solving?

What we wanted to achieve:

How dbt solves these problems

Impact

Use aider for free with your local LLMs or cheaply with OpenRouter

Local LLMs

Cheaply with OpenRouter

In Closing

Merge and Forget

The Rule

Pain Point

Do

Do Not

Scope

No News Is Good News

The Rule

Pain Point

Do

Do Not