The Analytics Engineering Roundup

How to Actually Move Up the Stack

Jason Ganz — Sun, 12 Apr 2026 12:00:59 GMT

Last week I wrote about why analytics engineers are being called, again, to move up the stack. The piece tried to make the case that the work many of us have been doing for the past several years was, in retrospect, preparation for exactly this moment. The primary response was - ok that sounds great. What do I actually do, right now, to set myself up for the coming wave.

Before I get into any of the practical stuff, the caveat I want to repeat throughout this piece: nobody knows how this is all going to play out - and in fact the smartest people have very wide bars on the possibility space for the next few years. We know this train is going somewhere, but exactly where is unclear.

The following is not a perfect plan but it is what I have seen work, mostly from watching people I trust, partly from things I have stumbled into myself. The best set of moves I currently know how to make, offered in that spirit. So let’s get into it.

Timing matters

These transitions tend to have a sweet spot.

If you try to push too early, you have to drag everyone else with you, and that is exhausting and often unrewarding (hopefully we’re past this bit in most orgs).

If you wait until everyone else has done it, the easy wins are gone. The window where it is just slightly early is the window where it feels like surfing a wave instead of getting overtaken by it. (My team has gotten a little sick of me scheduling meetings called “ride the wave” where we prototype AI demos but you know what it works).

We are in that window right now. And it is moving faster than any of the previous ones, which means the window is also narrower than the ones before it.

What that means in practice is a mindset shift, and the mindset shift comes before any of the tactical stuff. It also means making the emotional transition, while this is fun and exciting, it is also scary and highly uncertain. Take the time to sit with that and reflect in it, then decide your action plan.

Your job, starting now, is to move up the stack and figure out how to be impactful in the coming paradigm.

Not as a side project or a 10% time experiment after you finish the your ticket queue. As the actual point of what you are doing. If you are lucky, your organization will recognize this and give you space to do it.

The honest truth is that, in most cases, nobody is going to walk over to your desk and tap you on the shoulder and say we have decided you should spend a quarter learning how to build agentic systems on top of our data. It mostly does not work like that. It almost always requires some amount of courage, or initiative, or both. It means carving out time. It means making this your priority even when it is not the official priority.

Finding Signal in All the Noise

Just as important as committing to doing this is finding good information inputs that will actually help you accomplish this transition.

The signal to noise ratio out there right now is not great If you go looking for “best practices for AI in data,” you will find a thousand posts of varying quality. The patterns and best practices for actual data work in the agentic era are still being built. They are very much still being figured out and still up for debate. The signal is real but it is sparse, and you have to work to find it.

Some of what I would point you at: the OpenAI in-house data agent post, the Ramp data agent writeup, and a small handful of others. Read them carefully. Not skim. Actually read them, and ask yourself what the people who wrote them did differently from what your team is doing right now. Consume everything they put out publicly. Be prolific in how you absorb their work. And then build your version of what they did, scaled to whatever your situation allows.

Depending on the size of your organization, the resources you have, and the political surface area you control, “your version” might look very different from theirs. That is fine. The point is not to copy. The point is to internalize the pattern of how good work in this space gets made, and then, to apply it locally.

The One Thing You Cannot Skip

At the end of the day, nothing beats hands on experience which is why you need access to real production data, with best-in-class agent tooling on top of it, and you need it now.

If you cannot get an agent pointed at production data in your current role, your first priority is either to fix that internally or to find a role where it is possible. I am not saying that lightly. I know that internal security and access controls exist for very good reasons, and I am not suggesting anyone try to route around them. But I am saying that the experience of working with a real data agent on real data is so different from reading about it, or watching demos of it, or playing with toy datasets, that until you have done it you are essentially flying blind on the most important question of the next few years.

Push to get the access. Make the case. Find the security-approved path. Get the budget for the tools. If after a lot of effort you still cannot get there, take that as serious data about the environment you are in.

Case studies in finding projects with an edge

This also sounds a little abstract - I want to give you some examples of how I’ve applied this exact formula over the past few years to move myself, my team and all dbt users up the stack and prepare us for this moment.

I am not a product manager, but it is my job to make sure that you all have the best tools you can have in order to ride the AI wave. So I’ve been keeping my eyes and ears open for things that we at dbt can do to help dbt users adopt AI. By making it my job to do so, even when it wasn’t obvious, by making sure I had the right information input and then by acting locally when the information made it clear it was time to move.

First - it was by proving that dbt is useful for connecting language models to your data to ask business questions.The semantic-layer versus text-to-SQL work we have been prototyping at dbt got a much sharper external reference point when a paper got published putting numbers around the same intuitions our team had been chasing, and we suddenly had something concrete to benchmark against.

Next it was by reading a blog post from Ethan Mollick Now is the Time for Grimoires - it changed how I thought about prompt-as-artifact, and led me to work on building dbt Assist, the first official copilot for dbt.

Next was the first prototype of the dbt MCP server. I vibe-coded it on a weekend because I had been tracking MCP for a while, and then I saw to a talk at Swyx’s AI Engineer conference about MCP, and somewhere in the middle of that talk I realized: we need this, and we need it now. There was no notification. There was no Slack message from leadership saying it was time. The signal came from being in the room, paying attention, and trusting the prickle on the back of my neck when I felt it. And then being fortunate that talented engineers across the company picked up my half-baked prototype and turned it into a real product growing exponentially to this day.

The same thing happened with skills. I saw a talk on Claude’s skills feature, also at one of Swyx’s events, and a month later we launched dbt agent skills.

The pattern, if there is one, is this: immerse yourself in the work, in the community, in the writing, in the talks, in the experiments other people are running. And then when something clicks, when you feel that we need this and we need it now feeling, trust it and act on it. Get involved in the conversation. Build the thing internally. Post about it publicly. Submit conference talks. Submit meetup talks. Write the LinkedIn post.

The flywheel of finding interesting ideas or patterns, doing the work to apply that to where you are as best you can and then sharing the work is, I think, one of the most important career moves available to anyone during periods of high change.

There are a huge number of specific things you could be working on. The dbt MCP server. Agent skills. Building an analyst agent. Building a dbt-native harness on top of Codex. Building automated CI workflows. There are many, many more, and the list is growing every week. The specific project matters less than the fact that you are doing one.

Once you get in the game everything feels different.

Of course it’s not that simple

I want to bring in something that Salim wrote in response to last week’s piece, because it gets at one of the important differentiators between the last phase change and this one. Quoting from the comment directly:

If you were a passionate analytics engineer before the existence of coding agents with the boring work included, the future should be just as exciting. At least, I feel that way. But this time, the transition has one difference in my opinion, which I think makes it harder: it requires a mindset change across the entire company, not just the data team.

When dbt came along, you could largely adapt your own workspace in isolation, and the external environment in the company did not need to move with you to a large extent. Moving up the stack with agents is different. If the goal is to democratize data across the organization, make everyone a data person, and free the analytics engineer for high value work, then the whole company has to rethink how it operates internally. For example, a business update shared by the head of marketing in an all hands, previously held in an analyst’s head, now needs to be captured in a format an agent can consume. This organizational knowledge management problem is everyone’s job in the company. Similarly, it will not be enough to deploy a data agent to Slack, but make sure that every stakeholder has a base understanding of how to ask a question to the agent. These problems are not actually related to any context engineering problems that we have mostly been talking about in the data community.

So change management will be the biggest barrier to capturing the value of the agentic era for data, and a visionary data team will not move an organization alone.

This is a very fair point. Analytics engineering has always involved organizational change, of course. It created a whole new job title, a career ladder, an org structure. That was hard!. When dbt came along, you could adapt your own corner of the world in relative isolation. The marketing team did not have to change anything about how it operated for you to start using version control on your transformations.

The agentic transition is different. If the goal is to truly democratize data, to make every employee a data person, and to free analytics engineers for higher-value work, then the whole company has to rethink how it operates.

These are not problems that look much like the context engineering work the data community has been talking about. They look like culture work. They look more like organizational design and change management. And change management may very well end up being the biggest barrier to capturing the value of all of this. A visionary data team does not move an organization on its own.

I do not have a clean answer for this. I am not sure anyone does yet. But I think the people who figure out how to work this dimension, the people who can build the agentic data systems and help their organizations metabolize the change at the same time, are going to be doing some of the most important work in the industry.

Charting the future, together

Before I close I want to come back to the thing I said at the top.

The level of uncertainty in this moment is extraordinarily high. Higher than it has ever been in my career, by a margin that makes my head spin a little when I sit with it. I do not want any of this piece to read as I figured out how to do this in the dbt transition, so now you can do the same thing here. I am not saying that. I do not think anyone knows how this plays out.

What I am saying is something more limited. If your goal is to set yourself up as well as possible for a world in which analytics engineers are deeply integrated with agentic workflows, then in aggregate, from what I have seen, this is the highest-leverage set of moves I currently know how to recommend. It is not a guarantee. It is the best bet I can offer.

And if you find yourself having to choose between clearing the ticket queue and spending a morning building something with agents on real data, I think the second one compounds in ways the first one does not.

It’s the bet I’m making myself.

We have been here before, in a smaller way. We moved up the stack once already, and the things we learned along the way, the modeling instincts, the systems thinking, the hard-won understanding of how organizations actually use data, all of that came with us and made the next thing possible. I think that is going to be true again.

The wave is here. It will not wait for us. And the work of figuring out what comes next is some of the most interesting work I have ever gotten to do. I hope you find a way to do some of it too.

Jason

Appendix - what I’m reading to stay up to date on AI

It’s important to have a lot of input, to recognize the strengths and weaknesses of various commentators and learn over time how to sensemake across them. Here are my sources, ordered from most measured to most speculative, but all sources I consider high quality for the niche they occupy.

Ethan Mollick - One useful thing - for high level, accessible overviews of the AI landscape

AI daily brief - For solid daily analysis of the latest AI news from an enterprise perspective

Anything from the AI engineer conference or associated properties

METR and Redwood research for technical research on AI capabilities including the personal writings of their team’s including Ajeya Cotra

The Cognitive revolution podcast - practical conversations with people across the AI industry, focus on people working at the forefront of interesting problems

Hyderdimensional by Dean ball for reflections on AI progress and what it means for policy and the longer arc of history. Particularly recommend the most recent post on Mythos

Don’t Worry About the Vase - good collection of relatively high signal information from across the internet - a lot of content but still more manageable than trying to track it all yourself

Andrew Curren on X - breaking news and theorization from industry insiders. Shares rumors and speculation but as far as I can tell one of the more accurate accounts to do so

Moving Up the Stack: Analytics Engineering in the Age of Agents

Jason Ganz — Sun, 05 Apr 2026 13:05:35 GMT

How would you feel if you were looking at the website of a potential new employer and you read this?

“We believe that all team members should seek to replace themselves on an ongoing basis by building processes, technology, and documentation that obviate their existing work. We have an abundance mindset: there is always more, and more valuable, work to do. Moving up the stack presents growth opportunities for both the individual and the team.”

In today’s climate, defined by AI anxiety, it strikes a very specific chord. Maybe even ominous.

And you might be surprised to hear that has been one of the core values of dbt since 2016. We call it “moving up the stack”. I think it’s really easy to read the first half of the value, about attempting to obviate your existing work and miss that the second half is just as important - that doing so creates growth opportunities for the individual and the team.

The goal of moving up the stack is not ominous, it’s not to make humans irrelevant. It’s to empower them, to allow them to solve creative problems in new ways. It’s based on a fundamental belief that people are smart, the world is complicated and we all have so much more good work we could be doing, given the right support.

Moving up the stack is most relevant when a role is hitting a phase change, a threshold point during which the new version of the role is fundamentally different from before, usually precipitated by a new technology. We’re at such a point right now. But before we talk about that, I want to tell the story of the last time we moved up the stack as a profession.

Analytics Engineering Everywhere

Almost exactly five years ago, I wrote a blog called Analytics Engineering Everywhere. It was a post about how I was pretty confident that within 5 years, the principles behind dbt would be the worldwide standard for how data work is done. I’m generally nervous about making big sweeping predictions but:

I was very confident that this was correct
Five years felt impossibly far away, so if anything went wrong it would be future Jason’s problem to deal with
But mostly it was that I believed in dbt

I had such strong conviction because I’d seen firsthand the way that analytics engineering entirely reshaped my job and it was just obvious that this was a better way to do things.

And now five years later, I’d say that most of the article held up pretty darn well. The massive transition in data work that the early analytics engineers were seeing, pretty much happened. You can see it in the adoption numbers: dbt now has over three million daily downloads and will have been downloaded for the billionth time at some point this month.

And you can see it in the community. Literally millions of humans have improved their impact, increased their salary and helped the world get better at using data.

So in one sense, I feel pretty good about my prediction for five years out. But—and this is a big but!—I missed an even more important point.

It turned out that there was an even bigger wave following directly behind as LLMs emerged to totally reframe how every knowledge worker is thinking about the future of their work.

Now is the time to make sure that data practitioners are best positioned to ride this wave and be set up for success in the agentic era, even as the tasks, skills and value drivers in data work fundamentally will change in the near future.

It’s time to move up the stack again.

Data work has already changed unrecognizably in the past decade

I know because I lived it once.

Pre-dbt, I was working at a tech startup as a data analyst. My job, more or less, was to handwrite a whole lot of SQL queries. We had a set number of weekly and monthly reports, and we mostly did those, with some occasional larger investigative or experimental projects. Just a few years out of school, it was a great way to get a deep, hands on understanding of how a business operates and how all the pieces fit together. It was nice, if a bit comfortable.

Until one day - it got very uncomfortable. Our CEO asked me to pull metrics for a board meeting. Not the usual handful of charts. Way, way more than we’d ever had to pull before - essentially every data point you could imagine. Our system simply wasn’t set up to operate at that scale. So I went home and spent two weeks writing SQL queries fourteen hours a day. Hundreds of queries, each one meticulously hand-built, each one requiring me to hold an enormous amount of context in my head simultaneously. It was the most intense, highest-drudgery period of my professional life. Honestly it was kind of fun in a masochistic way. And when it was over, I came back and said one thing: we can never, ever do that again by hand.

That search for a better way led us to dbt. I read the viewpoint and I was hooked. Within a couple months of adopting dbt, I had more or less automated my entire job up to that point.

Years of accumulated skills (knowing our databases inside out, knowing how to pull every report, knowing the quirks and workarounds) all of it, automated. That intense flurry of work for the board meeting turned out to be the last time in my career I would ever use that particular skillset.

But here’s the thing: I hadn’t become useless. I had become more valuable. I had moved up the stack. Because now instead of spending my days crafting artisanal SQL scripts, I was building and maintaining our dbt project. I was able to spend my time working on data driven experiments and process changes to improve the business, because I wasn’t spending all day writing queries.

And with the rise of agents, we’re once again being asked to move up the stack. I’m not claiming this is a perfect or even neat parallel; I, like many of us, have real fears about the labor market and indeed the basic social contract as we’ve known it so far continuing to hold under such dramatically changing winds. But the thing that I am very confident in is that the best way to navigate this, for individuals, and for our industry, is to focus on moving up the stack. Automating what we do now and finding bigger, bolder things to take on.

This is the right thing to do in that it’ll make us more effective in our roles, but I also believe that the best thing each of us can do to help smooth this transition is to determine how to effectively navigate the new technological landscape we find ourselves living in. Who knows, it might even be fun.

The new world is already here

This is not about some distant future state. AI usage is exploding. From time to time I pull this graph up and just stare at it. This is, to put it lightly, not a normal growth trajectory.

This is from February. It’s now much higher

A big part of the reason for Anthropic’s explosion in growth has been the meteoric rise of Claude Code for software engineering. Regular readers of this newsletter will recognize the ongoing theme here - tooling and best practices come first to software engineering and then to data.

To get this working for data requires the ability to interact with data systems at scale, which requires an additional layer of capabilities and tooling.

That being said, there’s already strong movement on agent adoption in the data world.

Hex has stated that more than 50% of their new cells are created by agents. Think about that for a second. The tool that analysts live in, the environment where the actual analytical work happens, is already half agent-driven. The dbt MCP server has been growing usage by 40% month-over-month and is starting to become a central piece of data infrastructure, with agents consuming dbt projects as context across a remarkable range of use cases. dbt Agent skills let you package up expertise. Forward looking companies like Ramp are deploying agentic analysts to exponentially increase the value of their data.

These systems work. They work well today and they’re getting better fast. There are real questions left to answer (does anyone have opinions about the role of a Semantic Layer in all of this?) but fundamentally you need to do a lot of mental gymnastics to not believe that these systems are going to fundamentally reshape data work.

Five years ago, it was obvious it was a matter of time until everyone was using dbt. Today, it’s obvious that it’s a matter of time until everyone has agents at the heart of their data work. But unlike last time, this transition will not take 5 years.

The question then - if we’re moving up the stack, what are we moving to?

But that’s not actually one question. It’s about 100. Questions like:

Why does knowledge even need curation in a world of agents?
What does an analytics engineer actually do in a world where AI writes SQL?
How do we maintain institutional knowledge about our data models if AI is generating them?

There’s a lot to say about each of these, as well as a lot to build. Now is the time to start deeply thinking about the future of data work and to start building the systems, processes and teams that can support it. I’ve talked to many of you already doing it and it’s incredible to see the things that this Community is building.

Check back soon for more dispatches from myself, Tristan and others as we chart the brave new world together.

Agent Skills: Disseminating Expertise

Tristan Handy — Mon, 30 Mar 2026 13:58:33 GMT

A few weeks ago I pointed Claude, equipped with our new migrate-to-fusion skill, at a real, decently-sized dbt Core project running 1.10 and told Claude to do its thing.

It performed the entire migration with zero help from me; Fusion compiled and ran flawlessly.

I sat there for a second after it finished. That skill encodes hundreds, maybe thousands of hours of collective human experience across our team and the community: the edge cases, the config quirks that trip everyone up, the judgment calls about what to deprecate and what to preserve. Things you’d only know if you’d done multiple migrations. All that, now in 12kb of markdown, callable by any agent that supports skills.

And that’s just a drop in the bucket. The rest of the skills we shipped encode a decade of best practices expertise that built up across the entire dbt community over the course of the past decade.

I recognize that this is not a well-formed question but … what does that mean? It feels big, important. We’ve built hundreds of hours of training and certification content, written hundreds or thousands of pages of documentation, all for humans. And certainly, we haven’t replicated the expertise of a human analytics engineer…yet. But it’s a lot more than nothing, too.

I’ve been sitting with that question ever since that migration, and I don’t have a complete answer. But I have some thoughts.

What We Built

Agent skills are bundles of prompts and procedural guidance that AI agents — Claude Code, Cursor, Copilot, Codex, etc. — load dynamically when you ask them to do relevant work. They’re not documentation. They’re not MCP tools. They’re something in between: encoded expertise that an agent can load and apply without you having to explain how to do a task every time you open a new session.

We’ve shipped 8 dbt-related skills so far. I haven’t used them all, but our team has—from solutions architects to resident architects to the DX team that did most of the work to build them—and the overall feedback is that my experience is typical. They work, often shockingly well.

I think that at least part of that is that skills are optimized for a different reader than anything we have written for before. When you’re writing for agents, you can be significantly more declarative (“do this”) whereas when writing for humans you have to preserve a lot more space for individual opinions and tastes. The former, combined with current models’ performance, just produces really excellent results.

What Doesn’t Exist Yet

Eight skills is a start. It’s great, and I’m pumped. But there is certainly a lot more to do. Here are some things we haven’t even scratched the surface on yet:

Development workflow
Code review, dbt Mesh, exposures, metadata
Data modeling, deeper technicals
Snapshots, Python models, warehouse optimization, open table formats
Data modeling best practices
Auditing for consistency, detecting duplication

These are just the incredibly obvious ones and I’m sure you can think of many more. If any of these is something you’ve spent real time on and developed opinions about, the repo is open for contributions.

Skills + MCP vs. Skills + CLI

If you’ve already set up the dbt MCP server, you’re probably wondering how skills relate. Same? Different? Complimentary?

Short answer: MCP and skills are different things; they’re both useful; the relationship between them is pretty interesting. The original narrative was “they’re complementary: MCP helps with tool calling and skills help with expertise.” And that’s not wrong, but it’s insufficient.

The problem with that perspective is that it underemphasizes a real tradeoff. Simon Willison put it more bluntly, titling his October 2025 post on the subject: “Claude Skills are awesome, maybe a bigger deal than MCP.“ The drawback of MCP he pointed to was token consumption, as it injects full tool schemas whether or not they’re relevant, making every interaction less efficient.

The alternative approach, for developer-oriented products is skills + CLI. Benchmarks across 75 runs show CLI agents completing tasks at 1,365 tokens versus MCP agents at 44,026, almost entirely because the MCP server injected all 43 tool schemas in the Github Copilot MCP server into every conversation regardless of whether they were used. The CLI approach won on cost by 10–32x for these tasks, and hit 100% task completion versus MCP’s 72%, and adding an 800-token skill file to the CLI agent reduces tool calls by a third and latency by a third on top of that.

Of course, that’s all in a single fairly constrained study, and there are plenty of reasons why that may or may not apply in other contexts. The point is that the right way to do tool-calling is currently a bit up in the air; it will take some time to figure out best practices more definitively.

The Skills-Package-Manager Problem

The skills distribution layer is, let’s say, nascent. Lots of folks see the opportunity and are building similar products simultaneously, and there just hasn’t been convergence on requirements yet. This is fun to watch—we’re watching the infrastructure for a new category get built in real time.

There are a bunch of “skills package managers” out there but from what I can tell there are three that are in the lead: Vercel / skills.sh, Tessl, and SkillsMP.

My read: this infra is primarily done outside the context of the model providers (multi-platform benefits are real) and there doesn’t necessarily need to be convergence. The pre-AI analogy is npm, Homebrew, apt, PyPI: there has never been package manager convergence and I don’t think that needs to change.

The more interesting question to me is whether dbt’s package manager should build in native skills support. There’s an active discussion in the dbt-core repo right now proposing exactly that, essentially, dbt deps would install both packages and skills in one command. I kinda love the idea of dbt packages bundling their own skills—install dbt_utils and get the skills that teach your agent how to use those macros correctly. Zero-friction onboarding, skills as a first-class part of the project dependency graph.

At first glance, that feels neat. But the longer I think about it, the more it feels … pretty effing transformative. Imagine referencing dbt-datavault and not only getting a bunch of macros but also an entire set of best practices that your agent can automatically deploy.

I find this direction compelling and I imagine we’ll likely move in this direction, though (standard disclaimer) this isn’t a commitment. We’ll share more as we think it through, and please feel free to weigh in on the above discussion.

What I am confident about: native dbt skills package management and shared registries like Tessl and skills.sh aren’t competing. We list dbt-agent-skills on both and I don’t expect that to change.

Technical Knowledge vs. Best Practices Knowledge

If all of the above was this big download of background info on where we’re at with skills, this is the part that I’m genuinely curious about. What’s the role of “traditional” product documentation moving forwards? Training? Certification? We have invested a ton of time / energy / resources into building the expertise of an entire ecosystem of analytics engineers; will companies like us still do that in the future? Should they?

Here’s an interesting indicator: MSFT recently built a pipeline that automatically converts Azure product documentation into agent skills, continuously updated when the docs change. This is neat and serves a real need. But I think there is something missing in this approach.

Documentation typically answers one question: how does this product work? It tells you the syntax, the parameters, the valid inputs. That’s important but it’s not all that skills can do.

What documentation typically doesn’t tell you: how should you use this product? When should you reach for this feature versus that one? What does a well-structured project look like three years in? What are the patterns that seem fine today but create tech debt? What are the traps that experienced practitioners warn each other about in Slack but that never make it into the reference docs?

That second kind of knowledge—let’s call it best practices knowledge—is part of what we’ve tried to encode in dbt-agent-skills. Not just “here is the syntax for a unit test” but “here is how you should think about when to write a unit test, what assertions are worth making, and how to structure tests so they give you signal without slowing your CI down.”

Microsoft may not see best practices as their responsibility. That’s probably fair: they’re in a fundamentally different position than most software vendors. Auto-generating skills from docs may make sense for them, although over time I wonder if it doesn’t start going the other way around. In that world, skills, authored for agents and more empirically testable, get written first.

What This Is Really About

Here’s the thought I keep coming back to: just as the dbt community came together over the past decade to figure out the best practices of analytics engineering—what a good model looks like, how to structure a project, when to use a snapshot, how to write a test that’s actually worth running—I think it will come together over the coming year(s) to distill that knowledge into agent skills.

And IMO this skill-ification represents meaningful progress for us as a community. Best practices encoded in a skill propagate faster than best practices in a blog post. Disseminating knowledge in a blog post includes a tremendous amount of friction, where every single human reader has to do the work of reading, updating their mental model, and practicing the new skill. Distributing skills to an agent is frictionless.

They’re also forkable. There doesn’t have to be one right answer, and each divergent perspective is one that can potentially “win” in the open marketplace of ideas. It’s open source, but instead of OSS software, it’s OSS expertise.

dbt Labs has always had a value we call “moving up the stack.” The exact text: “We believe that all team members should seek to replace themselves on an ongoing basis by building processes, technology, and documentation that obviate their existing work. We have an abundance mindset: there is always more, and more valuable, work to do. Moving up the stack presents growth opportunities for both the individual and the team.”

Agent skills are one of the most direct expressions of this value I’ve ever seen. They push expertise—syntax, design, experience—down into the agent layer. That frees the human practitioner to operate at the top of their license: asking the questions that matter, interpreting results, making the judgment calls that can’t (yet) be “skill-ed”.

As always, I welcome your thoughts. And if you build dbt-specific skills, please send them my way.

- Tristan

SQL, Typescript, and Agents

Tristan Handy — Sun, 22 Mar 2026 10:38:45 GMT

Sorry for the radio silence here. It’s been a good little while since I wrote. The fact that there may have been a few things going on notwithstanding, I did miss the regular routine and discipline. Excited to be back, and I’m going to try to keep a more consistent pace moving forwards. There is so much going on, so much to say.

Onwards.

- Tristan

===

I’ll be honest…a few weeks ago I didn’t know that much about TypeScript. I’m not a big front end guy, and now with vibe coding I don’t imagine I’ll ever learn to write JavaScript. But someone on our team made a comment to me recently—essentially, “dbt’s Fusion engine is doing for SQL what TypeScript did for JavaScript,” which wormed its way into my brain.

It turns out…it’s a really good comparison. And today I wanted to spend some time exploring this thought. Because I think it matters a lot to both the dbt developer ecosystem for humans but even more so for agents.

The TypeScript Story

In October 2012, Microsoft released TypeScript 0.8 after two years of internal development. Anders Hejlsberg, the architect, was modest about what success would look like. His stated goal: “Maybe we’ll get 25% of the JavaScript community to take an interest — that would be success.”

The JavaScript community’s reaction was, roughly: “Why would I want this? JS works fine. I don’t need types. Types are what you use in Java, and nobody is having fun in Java.”

They weren’t wrong. JavaScript is flexible, fast, forgiving; you prototype quickly, throw things together, ship. That dynamic nature isn’t a flaw to be corrected. Hejlsberg’s team knew this, which is why TypeScript was a superset: you could opt in gradually, one file at a time, without torching the ecosystem you’d built.

Why types matter (even if you’ve never declared one)

If you’ve written mostly SQL and Jinja, you may have never declared a type. SQL doesn’t ask you to (except, of course, when you create columns!). You write select revenue from orders and the database figures out the rest at runtime. This feels like a feature (and it is!). But it also means no tool can look at your code before execution and know much about it.

That’s what types buy you: pre-runtime knowledge. When a language knows that orders.revenue is a decimal and orders.customer_id is an integer, tools can tell you at write time whether you’re doing something nonsensical: averaging a customer ID, joining on a column that doesn’t exist in the downstream model, passing the wrong type to a function. More importantly, it means IDEs can offer autocomplete that actually understands your schema, catch errors the moment you type them, and perform refactoring without guessing at what each reference means.

Coming back to Javascript: developers have adopted types, and TypeScript, not because of some I-should-have-included-types-from-the-beginning mea culpa from Brendan Eich. Rather, the TypeScript tooling ecosystem just outstripped that of Javascript and developers migrated as a result. Types made large codebases refactorable without fear, developers moved faster, stayed in flow state, and spent less time on stupid crap. The magic, as Hejlsberg put it, was making TypeScript “feel like JavaScript, but with superpowers.”

So: the types weren’t the point, they were just what made the tooling possible. Today TypeScript is the most-used language on GitHub, with adoption that went from 12% of developers in 2017 to 37% in 2024.

SQL Is Living the Same Story

SQL has been around since the 1970s. It’s the most widely-used language in data by a significant margin. And like JavaScript before TypeScript, it has thrived on flexibility: you write queries without defining schemas upfront, without a local compiler. Just write the SQL, hit run, see what comes back.

That super-simple DX is a big reason SQL is everywhere.

It’s also exactly why SQL tooling has historically stayed thin. Decades in, most SQL editors still offer little more than syntax highlighting and basic autocomplete, while JavaScript developers got full Intellisense. Without type information, tools can’t know what a column reference means, whether a join is valid, or whether a function exists in the target dialect. This is why AI agents working with SQL today hit the same ceiling JavaScript developers hit as their codebases scaled: at a certain level of complexity, you need language features that make your development loop both safer and, as a result, faster.

SQL has historically had none of this. Write a transformation, run it against the warehouse, get rows back or an error. The loop is slow, expensive, and blind to structural problems before runtime. An agent generating SQL has no way to know if what it wrote is correct until it hits production data.

Fusion: Mature Language Features for SQL & dbt

dbt’s Fusion engine is, at its core, the TypeScript transition applied to SQL. A real SQL compiler that parses, understands, and type-checks SQL across multiple warehouse dialects before anything runs. It uses the Arrow type system from drivers through adapters into the compiler and runtime, producing a logical plan via static analysis for every rendered query in a project.

(Just the idea of layering a single type system across all existing data platforms is a fascinating problem that I have every confidence we don’t yet fully understand the significance of! But I digress…)

And Fusion ships with a language server. Real-time error detection. Autocomplete that understands your models and columns. Hover insights. Inline lineage. Refactoring that propagates changes to downstream automatically. Everything that made TypeScript-in-VS-Code feel categorically different from JavaScript-in-a-text-editor.

Author’s note! Fusion is officially going GA in the next ~2 months. Public Preview has been highly productive and we’re now in the final process of pre-GA refinement. It’s ready for production deployments today (over 3k projects running it in prod today) but if you’re waiting for GA that’ll come soon.

The Agentic Development Loop

Ok, cool. That is great. But I don’t actually write my own code anymore since Opus 4.6. So: do I actually care about language DX?

I do. And you should too.

AI coding agents work in loops. Write something, check whether it’s right, fix it, check again. The quality of those loops is what determines whether agents produce reliable output or garbage.

Spotify’s engineering team documented this directly: without feedback mechanisms, “the agents often produce code that simply doesn’t work.” What makes their agents reliable is a verification loop — compilers, formatters, tests — running after every change. Agents can confirm they’re on the right track before committing.

Anthropic agrees: in December 2025, Claude Code shipped native LSP support. When Claude Code modifies a file, it can query the language server for diagnostics from a language server in milliseconds, with type errors, undefined references, and structural problems flagged before anything runs. The tight feedback loop allows agents to move faster, with higher trust, and use fewer tokens. Same model, better infra >> better performance.

A language server and type system give SQL agents the first part of that loop: fast, structural feedback on whether what they wrote is valid. Meaningful improvement over the current state. But structural soundness is necessary, not sufficient. Correctness requires not only structural correctness, but also semantic correctness.

Agents Need Tests!

dbt tests come in two types: data tests (designed to test underlying data) and unit tests (designed to test code). And they can be written with two specific intents:

structural tests: designed to test that data adheres to technical specifications, like uniqueness and referential integrity
semantic tests: designed to test business logic, like making sure debits equal credits

Most dbt projects focus heavily on structural data tests. We analytics engineers love testing to make sure that our models generate data that follows obvious structural rules. And this is good insofar as it goes. But it’s not enough for agents. Wes McKinney talked about this in his recent Python is Dead. Long Live Python! podcast: without semantic tests, agents have no signal for whether what the agent wrote is actually correct. You get structurally valid code that is logically wrong.

Unfortunately, most dbt projects have very few unit tests defined. This is not shocking; unit tests (pre-agent) took a long time to build, and they weren’t something that analytics engineers were in the habit of doing (first introduced in 1.8, in May of 2024). But this is, IMO, one of the biggest friction points to getting agents that can safely operate inside of large, complex data repos. Tests asserting semantic correctness would make long-running data coding agents exciting rather than stressful.

Of course, for a great agentic development loop the tests also have to run really damn fast. And we’ll have a lot more to say on that soon! 🧪🧪

For now, we believe that Fusion’s compiler and LSP unlock an absolutely next-level set of capabilities for both humans and agents. And we’re pushing hard to improve the entire development loop with tests that are easier to author and run fast.

This is the most fun I’ve had writing dbt code for … I don’t know. A long time. I hope you’re having as much fun as I am.

- Tristan

The Iceberg ecosystem today (Anders Swanson)

Dan Poppy — Sun, 08 Mar 2026 13:02:53 GMT

The data industry is moving towards open standards. The migration towards open standards throughout the data ecosystem is happening rapidly despite all the oxygen getting sucked out of the room from the rapid progress of AI and agents.

The dbt Labs data team is moving to an all Iceberg lake with a mix of compute engines to power transformation, analytics, and agentic experiences. The team has been able to move quickly towards this architecture because the entire ecosystem has been laying the groundwork for years. All of it’s coming together to make this new open world a reality fast.

On this episode, Tristan discusses the reality on the ground for data practitioners. Where’s the Iceberg ecosystem today? What can practitioners realistically expect when attempting to run on top of Iceberg in production?

Tristan is joined by Anders Swanson, a developer experience advocate at dbt Labs. Anders has spent a lot of time over the years navigating open-source data ecosystems and tracking their progress.

They unpack the open standards shift, define the core building blocks (query engines, object stores, catalogs), and dig into why external catalogs have become a fourth namespace tier across platforms. Anders outlines a pragmatic, phased adoption model for Iceberg integrations, explains why metadata performance and resiliency are hard requirements, and clarifies why vended credentials exist and what they solve.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

The call for papers is open for dbt Summit 2026. We invite data practitioners, platform leaders, and executives to share real stories of how data gets done at the world’s largest gathering of dbt community members. If you ship fast, reduce costs, improve trust, or bring governed AI to life, the dbt community wants to hear from you.

Submit a talk

Coalesce is now dbt Summit. Join the world’s largest gathering of dbt users, where data leaders and practitioners come together to shape the future of data analytics and AI.

Listen & subscribe from:

Key takeaways

Tristan Handy: I wanted to have you on because of work you’ve been doing internally to summarize the state of the Iceberg ecosystem. We’ve talked about Iceberg a bunch lately with folks deep in specific parts. Your work is more of an overview: where we’re at with platform integrations, what’s easier now than a year ago, and what’s still hard. Before we dive in, I want to define a few terms. When you say “query engine,” what do you mean?

Anders Swanson: It’s the thing that does your work. When you issue a CREATE TABLE or a SELECT statement, it’s what returns data or stores it somewhere for later.

Object store.

It’s the cloud service where you can store an object. An object is anything: a blob.

Catalog.

In this context, a catalog knows what tables and views exist and where they are, and how you can fetch or write to them.

Let’s talk internal versus external catalogs.

An internal catalog is what you get by default in a system like Snowflake or SQL Server. An external catalog is more like another directory, often managed by a different system. As you connect more disparate platforms, you can’t assume one system controls everything.

The complexity comes from duplication. How do you make namespaces unique? Can you plug in many external catalogs?

Abstraction matters. A common pattern emerging is one‑to‑one mapping of an external catalog into a database. That pushes a move to a four‑part namespace: catalog, database, schema, identifier. Spark moved toward this; Databricks Unity Catalog and Snowflake‑style catalog link approaches are in this family.

So the downside?

The devil is in the details, especially metadata performance and resiliency. For example, information schema listing. Users expect listing tables to be fast and reliable. In a federated world, if listing tables takes five seconds, users blame the vendor they’re using—even if the external system is slow. DuckDB draws a line by not mixing external catalog tables into information schema listing today. Snowflake’s catalog link databases appear to cache or mirror metadata so it feels as performant as native tables.

With catalog link databases, Snowflake is doing mirroring.

Yes. Mirroring exists in different flavors across platforms. Delta is sometimes seen as “simpler” because metadata can live in object store, but as soon as you want multiple engines writing, you still need a real catalog.

Sharing across multiple platforms adds another layer. What’s the state of platforms reading and writing to the same Iceberg catalog?

There are phases of integration.

Phase one is the naive approach: you have Parquet and JSON in object storage, and an engine reads it. Reading is easier than writing. You can get a toy example working.

Then you run into versioning and “what’s latest.” The next phase is connecting to an Iceberg REST catalog so engines can ask for the latest table version without users thinking about paths.

Phase three is schema‑scale: it’s never just one table. You need discovery of new tables, keeping schemas up to date, and eventually things like multi‑table transactions.

This maps to dbt Mesh and cross‑platform mesh. Producer vs consumer.

A consumer‑led model requires the downstream team to create pointers (DDL) to external tables. It’s operationally messy. Producer‑led is cleaner: the producer writes to the catalog and it’s just there, immediately queryable downstream.

Are platforms there yet?

Some support writing directly to external catalogs. When it works, it’s great, but there are still kinks. We’re retrofitting race cars designed for isolation to be interoperable without losing performance.

Identity is one of the hairiest issues. Vended credentials.

Vended credentials solve the “two keys” problem. You authenticate to the catalog, the catalog tells you where data lives, but then you need separate object store credentials to read files. Vended credentials means the catalog vends short‑lived credentials so you can access the object store location without managing separate keys.

That doesn’t solve user identity and grants.

Correct. Vended credentials isn’t global authorization. Identity and access across platforms is still hard. Ideally you grant access once and it works everywhere, but enterprises have different identity providers and platforms have different permission models. Today, admins often have to configure grants separately in each platform.

Is this mission creep?

The goal is to reduce how many people have to think about storage details. Big tech had whole data platform teams solving reliability problems in Hive‑era lakes. Iceberg reduces that toil dramatically, but the long tail is still auth, mirroring, and cross‑platform governance.

How does this reshape data teams?

Analytics engineering abstracted a lot of work. Data engineering has also been simplified by replication/orchestration vendors. What remains is the open ecosystem complexity: identity, object store policies, and cross‑platform connections. Many enterprises already have teams with these skills (infra as code, Terraform, Snowflake management), but others will need to grow into them.

Are vendors embracing Iceberg in good faith?

The goodwill and collaboration in the past 18 months feels unprecedented. We’re getting “more problems” because we solved prior ones. The industry aligning on standards feels like F1 teams standardizing components so they can innovate elsewhere.

In your internal writeup about Iceberg, you quoted Wolf Hall: “The making of a treaty is the treaty. It doesn’t matter what the terms are, just that there are terms, it’s the goodwill that matters. When that runs out, the treaty is broken, whatever the terms say.” Explain the relevance here.

When I joined dbt, it was taboo to mention one partner to another. Now vendors openly acknowledge mutual customers and invest in interoperability. On the Iceberg repo you see competitors collaborating on proposals. The goodwill is the standard.

Wrap us up with three things you’re excited for next year.

Push‑based catalog updates so platforms can subscribe to changes rather than repeatedly listing and polling. Progress on the small files problem so Iceberg works better for smaller data too. And more platforms supporting writing directly to external catalogs, unlocking producer‑led sharing and cross‑platform mesh.

Chapters

00:00:00 — Intro: why open standards are accelerating

00:01:20 — What practitioners can expect from Iceberg in production

00:05:00 — Lightning round: query engine, object store, catalog

00:06:20 — Internal vs external catalogs

00:09:30 — The “four-part namespace” and catalog-link style abstractions

00:11:30 — The downside: metadata performance, resiliency, and caching

00:17:10 — Sharing across multiple platforms: reality and tradeoffs

00:19:10 — Iceberg integration phases (1: naive table, 2: REST catalog, 3: schema-scale)

00:24:10 — Producer vs consumer model and cross-platform mesh

00:29:10 — Identity and “vended credentials”: what it is and what it isn’t

00:33:30 — The hard unsolved part: grants and global identity across platforms

00:37:00 — Is this mission creep? What Iceberg is optimizing for

00:39:50 — How roles on data teams evolve in an open ecosystem

00:43:40 — Are vendors genuinely aligned? Why Anders is optimistic

00:46:50 — “The making of a treaty is the treaty”: goodwill as the standard

00:51:50 — Three things Anders is excited for next year

This newsletter is sponsored by dbt Labs. Discover why more than 80,000 data teams use dbt to accelerate their data development.

Demo on-demand

Apache Iceberg and the catalog layer (w/ Russell Spitzer)

Dan Poppy — Sun, 25 Jan 2026 13:59:27 GMT

In this episode of The Analytics Engineering Podcast, Tristan talks with Russell Spitzer, a PMC member of Apache Iceberg and Apache Polaris and principal engineer at Snowflake. They discuss the evolution of open table formats and the catalog layer. They dig into how the Apache Software Foundation operates. And they explore where Iceberg and Polaris are headed. If you want to go deep on the tech behind open table formats, this is the conversation for you.

A lot has changed in how data teams work over the past year. We’re collecting input for the 2026 State of Analytics Engineering Report to better understand what’s working, what’s hard, and what’s changing. If you’re in the middle of this work, your perspective would be valuable.

Take the survey

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Tristan Handy: You spend a lot of your time thinking about Iceberg and Polaris. Give the audience background on how you found yourself in this niche of high‑volume analytic data file formats.

Russell Spitzer: It’s a bit random. I started at DataStax on Apache Cassandra as a test engineer and quickly got drawn into analytics. I saw big compute clusters and wanted to be involved. A coworker, Piotr, noticed Spark 0.9 and began a Spark–Cassandra connector. That got me into Spark. Over six to seven years I focused on moving data between Cassandra and Spark and into other systems. The interoperability problem across distributed compute frameworks was compelling.

This was pre‑Apache Arrow and pre‑table formats. We were just putting Parquet files everywhere and no one quite knew what they were doing. Pre‑Spark, people explored DSLs like Apache Pig. Eventually the industry converged on SQL for end‑user interfaces.

I later applied to Apple for the Spark team.

Helping build Apple’s Spark infra, or working directly on Spark?

Apple has an open-source Spark team and a Spark‑as‑infra team. I was trying to join the open source team, pushing Apple’s priorities into the project and supporting Spark as a service. During interviews, Anton—another Iceberg PMC—convinced the hiring manager I should join the data tables team, essentially Apple’s Apache Iceberg team.

They ambitiously planned to replace lots of internal systems with Iceberg. Iceberg existed but was early (Netflix started it around 2018/2019; I joined Apple in 2020). At Apple it was Iceberg all the time; convincing teams to move off older stacks, adopting open‑source‑as‑a‑service to save money, and getting onto ACID‑capable foundations. We were successful.

Migrations are hard. How did you make it accessible?

We replaced complicated bespoke reliability fixes with Iceberg. In Hive/HDFS, small‑file problems lead teams to write custom compaction and locking. Removing that toil is a big win. For big orgs, migration is a long‑term investment with ongoing engineering cost. For smaller companies, the key is offloading runtime responsibilities—ideally to SaaS—so engineers aren’t in the loop. Open source limits lock‑in so you can move between systems. Most companies are paid to deliver business value, not to build data infra. dbt is a great example of avoiding hand‑rolled pipeline code. Same logic applies to table/file formats.

Let’s talk Apache governance. What’s a PMC? How do projects run?

Apache projects aren’t owned by one company. Influence is earned by contributing to the community. The PMC governs merges, releases, membership. People move companies; the project stays with them. The goal is to make the project broadly useful. There’s no CEO dictating roadmap and no company can change the license.

Most big projects—Spark, Kafka, Iceberg, Flink—are maintained by employees of companies with vested interests, but governance is consensus‑driven. Vetoes are for technical issues (security, future‑limiting design), not ideology.

Is Iceberg for the top 20 tech companies or for everyone?

Not everyone needs Iceberg. OLTP belongs elsewhere. But for analytics, we should move past raw Parquet partition trees with folder‑name partitioning. In the Hadoop era, lakes were dumping grounds; schema evolution was painful. Many are still moving from CSV to Parquet. Over time, better encodings and table formats become default.

Decoupling compute and storage changes everything versus co‑located HDFS. Defaults tuned for HDFS (like 128MB Parquet files) don’t always hold for S3. We want elastic storage and compute; no one wants to pay for compute because storage grew.

Walk us through Iceberg versions.

v1: transactional analytics—ACID commits instead of fragile Hive/HDFS patterns. v2: row‑level operations—logical deletes via delete files so you don’t rewrite 10M‑row data files to remove one row; later compaction physically purges (key for GDPR). v3: expanded types—geospatial and variant for semi‑structured data; Variant was standardized across vendors and Parquet so everyone can write/read consistently.

v4: two thrusts—streaming and AI. Reduce commit latency, make retries faster under contention. Historically writes took 10–20 minutes, so commit latency didn’t matter. For streaming (writes every minute/five), it does. We’re evolving commit and REST catalog protocols so clients can specify intent (add these files, ensure these exist, then delete those) and let the catalog resolve conflicts server‑side.

On AI: Iceberg doesn’t yet serve some vector/image‑heavy patterns well. We’re exploring changes in Iceberg, Parquet, or both, without breaking existing tables.

Talk about Polaris and the catalog layer.

Polaris is an Apache incubator project (PPMC). Incubation proves we operate like an Apache project (community‑driven, trademarks donated). Iceberg defines the REST catalog spec/client; Polaris implements a catalog that speaks that spec. Many of us work across projects (Parquet, Iceberg, Polaris), which helps align boundaries.

Horizon, Polaris, external catalogs—what’s the story?

We’re simplifying: Snowflake can act as an Iceberg REST catalog, or you can use an external REST catalog. External can be Polaris (managed by Snowflake or self‑hosted) or another REST implementation. Interoperability means everything talks the same REST.

What is Polaris trying to be best at?

A broad, interoperable lakehouse catalog. It can act as a generic Spark catalog (HMS replacement) and aims to support multiple table/file formats. Architectural choices differ (KV vs. relational storage, where transactions live, policy enforcement vs. recording, identity integration). Polaris aims for base implementations that are pluggable—e.g., AWS/GCP/Microsoft identity.

Identity and scope—where does the catalog stop?

There’s a “business catalog” for discovery/listing versus a “system catalog” that must know table layout to govern access. Polaris can vend short‑lived credentials for the exact directory of a table’s files for a load operation; that requires understanding layout. Purely relational metadata often needs to delegate that decision.

Will identity/grants slow broad adoption?

Possibly. But many once‑complex things become default—compressed files, columnar formats, soon encryption. With collaboration (like Variant), we’ll land broadly accepted patterns.

Chapters

00:01:30 — Guest welcome and interview start

00:02:00 — Russell’s path: DataStax Cassandra, Spark connector, interoperability

00:05:20 — Joining Apple’s Iceberg team and early Iceberg momentum

00:06:20 — Why migrations resonated: replacing bespoke Hive/HDFS compaction/locking

00:09:10 — Apache governance 101: PMCs, consensus, and corporate influence

00:15:40 — How decisions land without votes; when vetoes apply

00:18:30 — Who needs Iceberg and where it fits

00:22:20 — Lake → lakehouse and warehouse → lakehouse in the cloud era

00:25:20 — Iceberg versions: v1 transactions, v2 row‑level ops (GDPR), v3 types

00:28:10 — Standardizing Variant across vendors and Parquet

00:31:10 — Iceberg v4 goals: streaming commit/retry improvements and AI use cases

00:33:40 — Commit latency and server‑side conflict resolution

00:37:20 — Polaris as an Apache incubating project (PPMC)

00:39:30 — Iceberg REST catalog spec and Polaris implementation

00:42:30 — Clarifying Snowflake Horizon, Polaris, and external REST catalogs

00:45:10 — What Polaris aims to be best at; pluggable identity providers

00:48:00 — Identity scope: business vs. system catalogs and credential vending

00:51:00 — Will identity/grants slow mass adoption?

00:52:50 — Wrap‑up

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Demo on-demand

AI agents and the data lake (w/ Lauren Anderson)

Dan Poppy — Sun, 11 Jan 2026 14:03:05 GMT

One of the interesting commonalities of AI and the data lake is that they both require new thinking around how we manage identity. For AI, the big question is how do agents interact with underlying data? For the data lake, the big question is how do we make open data stored outside the purview of any given data platform act like you’d expect?

In this episode of The Analytics Engineering Podcast, Tristan talks with Lauren Anderson, who leads the enterprise data platform at identity company Okta. Lauren discusses how identity sits at the center of two seismic shifts in data—AI agents and the open data lake—and why central governance and a shared semantic layer are critical. She lays out how analytics engineers and data engineers should divide responsibilities as agents begin to write a growing share of analytical queries.

Take the survey

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Tristan Handy: Before we dive into the current day, can you share a little bit about your background and how you came to the role that you’re in today.

Lauren Anderson: I’ve had a 20‑something year career at this point. I have basically spent my entire career in analytics some way, but my first data job was at a big bank. I won’t name it. There’s only a few big banks you could probably guess. I worked for the finance org and I did compensation planning and administration, with a side of sales tracking and analytics. I was part database analyst, part customer support for people that made a lot more money than I did.

I was there for seven, seven and a half, eight years. Towards the end of it, I became the owner and creator and almost business architect for our brand‑new sales tracking data warehouse. At a very young age, I got to think about how relational databases should come together for the outcome of both analytics and reporting—dashboards and whatnot—but also operations, which was paying compensation every month. It got me super excited about this world of data and being able to architect pipelines and the end‑to‑end flow for real‑world outcomes.

What do you think allowed you to be successful in that era? I often think the things that enabled success then aren’t the same as what make data folks successful today.

When I took it over, we ran compensation out of an Access database. I was new, the person who designed it left, and there wasn’t much documentation. It worked the first month, then broke the second—right before a payroll deadline. I rebuilt it as a long series of SQL queries with inline comments and step‑by‑step checks that produced a clean file. That willingness to throw away the brittle thing and rebuild with clarity and documentation gave me early success. The meta‑skills:ability to learn, take chances, and figure out the best path—still apply, but the technology is completely different now.

You’ve split time at Okta into two stints. How would you characterize the work?

Okta was my first truly B2B company. I realized quickly B2B data is my sweet spot. I love thinking about customers as businesses and how business users interact with our products and features. Okta data is complex—many products, features, and highly configurable use cases—especially with large customers. That variety is exciting. In simpler retail flows you see a lot of the same patterns; in B2B, the variety is the appeal.

What’s your current role?

I lead our enterprise data platform, engineering, and architecture function. For enterprise data used to make business decisions, we own ingestion into the warehouse, transformations, and delivery—dashboards, reverse ETL to third‑party applications, other data stores, and internal apps.

How big is the central function and how do you engage with the business?

We’re about 50 people across data engineering and analytics/data science in a company south of 7,000 employees. We support every business unit. Engagement spans a maturity curve. One end is platform self‑service: teams land data via approved connectors, build transformations in dbt on our implementation, and build dashboards in Tableau we administer. Governance and roles are defined centrally, and teams assign people to those roles. The other end is a white‑glove model where we partner through the full lifecycle—question, discover existing assets, requirements, data work, build, interpretation, validation, and end‑of‑life of the data product. Our sweet spot is the middle: we own enterprise “gold” pipelines for company‑level metrics—monitored and governed—while domains build and later graduate via a path‑to‑production under stronger governance.

Okta is known for identity and security. How does security‑first actually work in practice?

Reinventing controls every time slows you down. We invest in repeatable frameworks. Any new source goes through third‑party risk review, classification, and decisions on masking or exclusions. We help teams through that; after a couple times, they can engage directly with risk while we stay in the loop and monitor. As our classifications and expectations got clearer, review cycles shrank from weeks to days. It’s not all roses—it takes time—but we all operate as security practitioners. That shared mindset builds trust and reduces corner‑cutting.

How much do users need to know?

We don’t expect everyone to know everything. We provide dbt frameworks and minimum testing standards, plus SMEs to guide teams. The culture is to ask when unsure.

Will agents write more analytical queries than humans in the next 12–24 months?

Macro, yes. For us, more like 24–36 months because we’re careful. The key is safe, ethical AI consistent with being a security company.

How are you thinking about agent access?

Central governance. Ideally, agents query centralized, agent‑ready stores. Run governance once: policies, roles for users and for data, tracking and logging on a central plane. The semantic layer is essential. Creating semantic views must get easier and more automated, and semantics should inform policy application.

Why are agents different from humans in access patterns?

Row‑level security to the extreme. Conversational intelligence data should be limited to what the requesting user can access. Aggregations could be broadly accessible with anonymization, but detailed content should remain constrained. You might also limit allowed functions on large unstructured objects. Identity for agents matters—Okta Secures AI looks at distinct identity patterns to secure agents across applications.

Where are you with MCP and agent building?

Early, building support and insight use cases. Progress is fast, but nothing broad in production yet.

How should analytics engineers and data engineers participate?

Analytics engineers should own semantics—tooling, vendor choices, onboarding use cases, and the shared business language. Data engineers should optimize for consistency and scale, notice overlap across agents, and provide a platform others can build on with confidence in governance and security.

Will you standardize an agent development platform?

Yes, in partnership with engineering and shared services. Our current pull skews to the business, so we’re leaning toward accessible, governed platforms that serve both business and engineering with central governance.

Any assumptions you’re rethinking?

Treating everything like a relational model. Many initial agent questions are intentionally simple, where speed and reasonable accuracy trump perfect sophistication. The important thing is to start, observe, and mature.

Chapters

00:02:28 — From bank analytics to owning a sales DW

00:05:00 — Rebuilding brittle Access → SQL with documented checks

00:08:30 — Ops accountability then vs. optimization today

00:11:00 — TripIt, marketing analytics, and moving into tech

00:13:14 — Why B2B data became Lauren’s sweet spot

00:16:00 — Current role: ingestion → transform → delivery at Okta

00:18:10 — Operating models across business units and the path to production

00:22:20 — Security-first in practice: repeatable frameworks over friction

00:24:23 — Third‑party risk, classification, and shrinking review cycles

00:28:00 — Policies, masking, and the need for a central governance plane

00:30:20 — Frameworks for dbt, testing, and SME guidance

00:32:11 — Will agents outwrite humans? Macro yes; Okta timeline nuance

00:33:48 — Central governance and agent access patterns

00:37:19 — Semantic layer as bridge and policy carrier

00:41:00 — Function limits on unstructured data and Okta Secures AI

00:42:35 — Early MCP experimentation and support use cases

00:43:03 — Roles: analytics engineers (semantics) and data engineers (scale)

00:46:10 — Enabling an org-wide agent platform with shared governance

00:47:43 — Solve governance once, serve business and engineering

00:49:30 — Simpler questions first; rethinking relational assumptions

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Demo on-demand

Inside Snowflake’s AI roadmap (w/ Chris Child)

Dan Poppy — Sun, 14 Dec 2025 14:06:35 GMT

This season of The Analytics Engineering Podcast is focused on how the current data landscape is impacting the developer experience. Snowflake plays a major role in what that developer experience looks like.

In this episode, Snowflake VP of Product Management Chris Child joins Tristan to unpack Snowflake’s AI roadmap and what it means for data teams. They discuss the evolution from Snowpark to Cortex and Snowflake Intelligence, how to govern agents with row- and column-level controls, and why Snowflake is investing in Apache Iceberg and the Open Semantic Interchange initiative. dbt Labs recently open sourced MetricsFlow, the technology that powers the dbt Semantic Layer, to align with the goals of OSI.

Chris also shares a vision for the next five years of data engineering: fewer bespoke pipelines, more standardization and semantics, and a bigger focus on business context and data products.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Check out the dbt VS Code extension

Listen & subscribe from:

Key takeaways

Tristan Handy: Where have you spent your time professionally?

Chris Child: I didn’t end up in data on purpose. I found myself here through a series of hops. I was working at Redpoint Ventures and got excited by a company we invested in, RelateIQ. I left to join RelateIQ, building an intelligent CRM. We captured emails and meetings and built profiles of everyone you interacted with. We were acquired by Salesforce. Looking at what sales teams needed, I realized they also needed product usage data, marketing data, and campaign data, with a platform to pull it all together. That led me to Segment. I joined when it was about 50 people. Segment was mostly analytics.js then, loading different JavaScript on your webpage for tracking. We had just built the first warehouse connector to Redshift and got huge usage sending click and user data to Redshift.

The original Redshift connector was a nightmare to work with.

Like many startup things, one engineer built it in a week. Suddenly a ton of people used it, and enterprise customers depended on it. We had to rebuild it several times. You could see the future there. Folks I worked with went on to start companies like Census and Hightouch, thinking the CDP should be built on top of the warehouse, which Segment evolved toward. We also built a Snowflake connector because customers demanded it in addition to Redshift.

It’s funny to think back a decade to how small Snowflake was.

A couple customers demanded it; we built it, and we were sending a ton of data. That led to the realization that a customer data platform is one instance of a data warehouse, and there are others you need. Seeing how fast Snowflake was growing, I wanted to build the next layer of infrastructure.

I joined Snowflake seven and a half years ago. I’ve had three key roles. First, I built areas of the product: the UI, billing, product-led growth engines and free trial infrastructure, and application capabilities for connecting into and building on Snowflake. After Sridhar became CEO, he asked me to reconnect product and sales by leading solutions engineering, reporting to the CRO. Leading a global technical seller org was very different for a product person, but it helped align teams at scale.

About eight months ago, I returned to lead data engineering: how people bring data into Snowflake, how they transform it—spending a lot of time with dbt—and work around Iceberg and interoperability for worlds where not all data sits in Snowflake.

I didn’t realize the path started in investing. Are you a finance person way back?

My undergrad is in computer science. I started programming in fifth grade on an Apple IIe, learned C before high school, and followed that thread. In college I noticed business folks often made the decisions. I wanted to learn that side. After college I joined a consulting firm, then private equity, then an MBA. I realized I didn’t want to be a finance person. I moved to venture as a bridge to building products, but I wanted to build, so I jumped into operating roles.

Tell the story of Snowflake and AI. In the 2010s there was huge demand for easier, scalable, cloud-oriented data solutions. Then 2022 happened, ChatGPT launched, and the world changed. How did Snowflake respond, and where are you today?

Even pre‑2022 we saw customers putting their most important business data into Snowflake, then pulling data out for things they couldn’t do inside: training ML models and other analyses that SQL wasn’t a great fit for. Customers told us they didn’t like losing governance and lineage when data left. We invested in ways to bring more of that work to Snowflake.

Snowpark was the first big step: a runtime for non‑SQL code (Python, Java, Scala) with APIs inspired by Spark, plus capabilities like forecasting. It’s great for some workloads, but most customers don’t train most ML models inside Snowflake yet. We also acquired Applica for document extraction using early LLM techniques, and Neeva for web search based on LLM approaches.

When ChatGPT arrived, we saw two major influences. First, people wanted to chat with data they’d brought into Snowflake and transformed with dbt. That’s hard because LLMs are great with unstructured data and less great at turning business questions into correct SQL. Second, LLMs are very good at writing code, including Python and even dbt code. They’re not perfect for data engineering code yet, but they help.

Our goal is to help customers activate important enterprise data safely in AI models, deploy agents at scale under existing governance, and keep up with exploding data volumes without 10x headcount.

What are the key product pieces—Cortex, Snowflake Intelligence, etc.—in the Snowflake AI stack?

First, you need a great data foundation. That isn’t new: get the data in one place, apply good governance and permissions, know your data, tag PII, and raise the standard of care.

AI raises the bar because agents can expose sensitive data faster than dashboards. OSI (Open Semantic Interchange) work is part of this; LLMs need explicit semantics and cataloging they can consume, not tacit knowledge hidden in downstream tools.

Companies with strong hygiene move faster with AI. Roles matter; if a product manager role has access to certain rows and columns, an agent acting within that role can safely answer questions. Agents can run inside or outside Snowflake, but should assume appropriate roles when querying.

On the AI stack, after the data foundation, Cortex provides higher‑level APIs for unstructured processing, RAG, and structured processing. You can choose models (OpenAI, Anthropic, Mistral, Gemini, Llama, etc.), but most folks don’t want to manage prompts and GPUs. Cortex AI SQL lets you express intent like sentiment filters or fuzzy joins. It’s powerful for exploration but non‑deterministic, so you need care in production. Costs map to tokens at higher abstractions, with budgets and guardrails similar to variable compute in the cloud.

At the top, Snowflake Intelligence is a UI and agent framework. You define agents with access to specific datasets and semantic models, plus gold queries and usage guidance. It looks like a chat interface over your governed data. Inside Snowflake, we’ve deployed a GTM assistant that blends product usage, Salesforce, notes, docs, and content—structured and unstructured—respecting row‑level security for every seller while giving leaders broader access.

Let’s talk open formats and Iceberg. Why lean in when it opens up the data?

Our aim isn’t to lock up data, it’s to help customers get value. Snowflake began as a reaction to Hadoop—betting on SQL at cloud scale with our own formats and catalog because they didn’t exist then. Those proprietary pieces let us evolve quickly. Iceberg is now almost as good, and we’re contributing to make it better.

Openness is a win for customers and expands the universe of data Snowflake can query, run Cortex on, and power Intelligence with. The tradeoff is standards move slower. Variant type support is a good example—we contributed our approach and shepherded it into the v3 spec.

Next up, the community is wrestling with fine‑grained access control beyond table‑level policies. It’s hard and will take time, but the outcome should be better for everyone.

Give us your view on the future of data engineering.

Data volume is exploding, including unstructured data that’s now usable. You can’t hand‑build every pipeline. Demand is also exploding as agents query more things in more ways. Teams must operate at a higher level: automate, standardize, and reduce bespoke pipelines.

Expect more shared semantic models across consumers and packaged semantics coming from systems like SAP. You’ll also build data‑engineering agents to do work and monitor pipelines. The role looks more like architect and manager, allocating budgets, deduplicating work, and—most importantly—deeply understanding the business. The best data engineers shift from code output to data products, with clear semantics and context.

Talk more about context.

The day‑to‑day activity shifts, but the output is still data products. Great data products come with instructions, definitions, lineage, quality expectations, and how to get correct answers to common questions.

We need that context captured where work happens—models, visualization, quality systems—and made available everywhere: catalogs, agents, and UIs. As you build, you should also document, and those semantics should flow consistently into tools like Snowflake Intelligence so agents can reason correctly.

A big part of the challenge is selecting just‑enough context per question.

Chapters

00:01:50 — Chris’s path: RelateIQ, Segment, Snowflake
00:05:40 — Roles at Snowflake: product, solutions engineering, data engineering
00:09:00 — Snowflake and AI: foundations before ChatGPT
00:11:40 — Why keep ML and non-SQL work closer to governed data
00:13:40 — Applica and Neeva acquisitions, enterprise search context
00:14:50 — Two big AI influences: chat with data and code generation
00:16:50 — Scaling agents while preserving governance and cost controls
00:18:40 — Why governance must live at the data layer (roles, rows, columns)
00:22:00 — Inside vs. outside Snowflake: how agents assume roles
00:23:02 — Cortex: higher-level APIs over many LLMs
00:24:06 — AI SQL: joins/where by intent and the non-determinism tradeoff
00:27:40 — Cost models, tokens, and guardrails
00:29:10 — Snowflake Intelligence: agents over a governed foundation
00:32:10 — Open formats and Iceberg: Why Snowflake leaned in
00:36:00 — Standards tradeoffs: variant type and community progress
00:38:40 — Fine-grained access control for Iceberg: thorny but necessary
00:40:40 — The future of data engineering: scale, unstructured data, agents
00:43:20 — No more bespoke pipelines; standardized models, and semantics
00:44:50 — Data engineers as architects and business partners
00:50:00 — Code vs. context: data products and shared semantics
00:53:10 — Capturing context where work happens (models, viz, quality)
00:55:00 — Selecting just enough context for agent reasoning
00:56:30 — Closing

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

Building a multimodal lakehouse for AI (w/ Chang She)

Dan Poppy — Sun, 23 Nov 2025 14:03:30 GMT

Welcome back to The Analytics Engineering Podcast! Last season, we explored a host of topics on the developer experience (something the dbt Labs crew has been pretty vocal on recently). This season, we’re expanding that theme to look at how the current data landscape is impacting the developer experience. Open data infrastructure is on the rise; AI is pushing teams to rethink how data is modeled, governed, and scaled; and the developer experience is evolving.

In this episode, Tristan Handy sits down with Chang She—a co-creator of Pandas and now CEO of LanceDB—to explore the convergence of analytics and AI engineering.

The team at LanceDB is rebuilding the data lake from the ground up with AI as a first principle, starting with a new AI-native file format called Lance and building upward from there.

Tristan traces Chang’s journey as one of the original contributors to the pandas library to building a new infrastructure layer for AI-native data. Learn why vector databases alone aren’t enough, why agents require new architecture, and how LanceDB is building a AI lakehouse for the future.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Check out the dbt VS Code extension

Listen & subscribe from:

Key takeaways

Tristan Handy: You’re the founder and creator of the Lance file format and LanceDB. Before diving into vector search and vector databases, tell us about your background.

Chang She: I love talking to analytics engineers because that’s my background. I started about 20 years ago in quantitative finance. As a junior analyst, you do a lot of data engineering and analytics, which got me into open-source Python. I became one of the co-authors of the pandas library—initially to solve my own problem of not wanting to do analytics engineering in Java or VBScript.

You worked for a hedge fund?

Yes, AQR.

Did they know you were contributing to pandas? Hedge funds aren’t known for open source.

My roommate and colleague at the time was Wes McKinney. He showed me a proprietary Python library he was working on. It was life-changing. I started using and contributing. He spent about six months convincing the fund to open-source it. This was around 2010, and they were ahead of the industry in that respect.

I didn’t know pandas started at AQR. That’s fascinating. So much of your circa-2010 analytics work was done in early pandas?

Exactly. We went through several iterations, even debated the name. Because it was a hedge fund, there was a lot of econometrics and “panel data,” so Wes named it “pandas” for panel data analysis.

That origin story isn’t widely known. You then founded two companies, sold one to Cloudera, and were there during an interesting time.

Wes and I created DataPad—cloud BI before cloud BI really took off—and sold it to Cloudera. I spent about four and a half years in the Hadoop “big data” world, where I met my co-founder. He worked on HDFS at Cloudera, and several ex-Cloudera folks are at LanceDB today. After that I moved into machine learning at Tubi TV, working on recommender systems, ML serving, and experimentation/AB testing. That exposed me to embeddings. We dealt with videos, poster art images, and synopses—data that doesn’t fit neatly into pandas or even Spark data frames. That inspired me to build better infrastructure for these data types—what we now call “classical” machine learning—which led to LanceDB.

So that’s our bridge to vectors. You experienced these problems at Tubi, then founded the company. And Tubi used dbt?

Heavily. Thank you for creating it—it was critical to our stack.

Give us a non-technical intro: what are vectors used for?

Many people focus on the latest models and techniques. My perspective: everyone has access to similar models—your differentiation comes from your data and how effectively you connect data to AI. Vectors are a way to represent any kind of data in a form models understand: high-dimensional arrays of floating-point numbers—1,500, 3,000 dimensions, etc. Early statistical models might have a few interpretable dimensions; now you can have thousands where individual dimensions aren’t necessarily interpretable, but the space captures semantics.

Beyond RAG, vectors power internal model representations, recommender systems, and personalization—the original mainstream use case.

Search is also a good use case. How is vector search different from full-text search or Command-F?

Full-text search (e.g., Elasticsearch) returns documents containing the exact terms you searched. If you search for “customer,” it finds “customer/customers,” but might miss “user,” “adopter,” “organization,” etc. Vector search uses dense representations where semantically similar words and documents live near each other in high-dimensional space. Search for “customer,” and you get results that include semantically related terms.

Would you combine vector and full-text search?

Yes—hybrid search. Early RAG demos often used pure vector search for speed. Now enterprises need production-grade relevance. Many combine keyword and vector search with a re-ranking step to reach higher precision/recall.

Early RAG pipelines often chunk text, embed, and call it done. But more thoughtful pipelines do something closer to feature engineering, right?

Absolutely. Thought goes into what you feed the embedding model. For example: add a document- or section-level summary alongside each chunk before embedding; include multimodal features—artistic descriptions, literal captions, tags; create multiple embedding columns (e.g., different prompts/modalities) and search across them with re-ranking. High-quality retrieval requires feature-engineering-like decisions before embedding.

Let’s talk vector file formats (Lance) and vector databases (LanceDB). My crude belief: a vector database is a standard database with additional indexes. True?

Not wrong, but my hot take: with Lance and LanceDB, we’re building a lakehouse for multimodal data that includes vectors. Many “vector databases” are optimized only for vectors and struggle with other data types and workloads. The category needs to evolve—either toward new-generation search engines or new-generation lakehouses. We set out from day one to build the broader lakehouse, not just a vector index.

Outline your AI-enabled data lake vision. I’m familiar with Snowflake and Databricks’ lakehouse. How do you see the world differently?

We assumed everyone would use Parquet and tried for months to support AI workloads—search, training, preprocessing—on it. We couldn’t make it work well. Talking to computer-vision and ML practitioners, no one had something effective. That gave us confidence to build a new format.

In AI you manage vectors, long documents, images, and videos. The first problem is storage. With Parquet, mixing wide blob columns with narrow metadata columns leads to out-of-memory issues due to row-group design. If you shrink row groups to fit blobs, read performance tanks.

Even once data is in Parquet, AI needs random access and secondary indexes. Parquet doesn’t support efficient random row access: retrieving scattered rows forces reading entire row groups. With media, that’s prohibitively expensive—both for search and for training (e.g., global shuffle). Data evolution is also hard: with table formats like Iceberg, backfills often mean copying entire datasets. Copying petabytes of media is a non-starter. These issues motivated Lance.

I have a good mental model of Parquet with structured data. With images or video, do you put them in blob columns?

Yes. We use Apache Arrow types. Images/audio/video are large binary columns. Vectors are fixed-width list columns (e.g., 1,536-dimensional). But Parquet’s row-group mechanics and lack of random access make these workloads painful.

So Lance was the first thing you built. It has solid traction on GitHub. Who uses a file format—users or vendors?

Both. Frontier labs use Lance to store training data—e.g., for image/video generation—replacing stacks like TFRecords, WebDataset, Parquet, and BigQuery. Large tech companies and vendors also build on Lance: Databricks, Tencent, Alibaba, Netflix, NVIDIA, Uber, among others.

Databricks uses Lance?

For parts of their AI-specific offerings.

You’ve raised several rounds—the format is Apache-2 licensed. How do you commercialize?

Our commercial offering is a data platform for large-scale AI production: vector search, data preprocessing, training/serving cache, and an analytics engine for curation and exploration. It supports ML training workflows and AI application development, solving the hard distributed-systems problems along the path. We partner closely with big vendors; we’re generally not competitive because goals and customer bases differ. Cloud providers seek platform consumption; we focus on an AI-optimized data platform for specific workloads and users.

The commercial product is called LanceDB, but you prefer to position it not just as a database.

Right—we’re an AI-native data platform/lakehouse for multimodal data, with Lance as the common format.

How does this space play out over the next two to three years?

Two big predictions. First, multimodal will be 100× bigger—more usage and more data. Audio is exploding; video generation is resurging; robotics is next. Second, our data infrastructure isn’t ready for agents driving search and retrieval.

Let’s unpack both. On multimodal: unlike structured analytics, where every company needs it, multimodal workloads seem concentrated. Do all enterprises really need this?

I think every enterprise becomes multimodal. Take insurance: tons of documents to digitize, extract, search, and analyze; drones capturing images/video to assess risk and improvements over time. Existing businesses become more efficient; AI-native entrants gain structural advantages. Multimodal data underpins both.

It’s a heavy lift. Will every Fortune 500 insurer build these capabilities in-house, or will vendors package them?

Likely both—just like analytics engineering emerged as a role, with adjacent talent re-skilling. We see the same with AI engineering.

What titles are hands-on with your product?

AI researchers and AI engineers. Many app developers building AI features now carry the “AI engineer” title.

On agents: how do their access patterns change platform requirements?

RAG was one-shot: ask, retrieve, answer. Agents iterate: they decompose problems into sub-questions, refine queries and results, and run many steps in parallel. Load skyrockets—humans type slowly; agents can issue hundreds of queries simultaneously. Queries are more varied and selective, and agents are creative in combining modalities and sources: schemas, SQL over structured data, prior analyses and charts, document stores, image/video metadata, etc.

Traditional vector databases aren’t designed for this breadth and scale. If you bolt together multiple specialized systems, your “agent stack” balloons into a maintenance nightmare. Our approach: put all data in one place with a single system that supports vector search, keyword search, filters, key-value lookups, re-ranking, analytics, and efficient random access—on top of an AI-native file format (Lance).

For listeners whose curiosity is piqued, any resources you recommend?

Chang She: Yes—our blog series by Weston Pace, the tech lead for Lance format. It dives into encodings, I/O, and has great reads for analytics engineers: lancedb.com/blog .

Chapters

00:00 – Intro: Analytics meets AI
03:20 – Chang’s background and how Pandas began
06:40 – Lessons from Cloudera and metadata
08:30 – Multimodal data and LanceDB’s origin story
10:00 – Why vector search matters (beyond RAG)
12:00 – What are vectors and why do we use them?
15:00 – Full-text vs vector search
18:00 – Feature engineering in AI use cases
21:15 – Lance format
28:00 – Storage, scale, and the problem with Parquet
35:30 – Building a business on open source
41:00 – Two big bets: multimodal data and agents
46:00 – Every company will become multimodal
50:00 – Agent access patterns will redefine data
54:00 – Why dbt-style workflows matter now more than ever

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

Agentic coding in analytics engineering (w/ Mikkel Dengsøe)

Dan Poppy — Sun, 07 Sep 2025 12:01:00 GMT

What does agentic coding look like in analytics engineering? Mikkel Dengsøe, co-founder at SYNQ, recently wrote a series of posts on his experiences as an analytics engineer with agentic coding tools. In this episode of The Analytics Engineering Podcast, he walks through a hands-on project using Cursor, the dbt Fusion engine, the dbt MCP server, Omni’s AI assistant, and Snowflake.

Tristan and Mikkel cover where agents shine (staging, unit tests, lineage-aware checks), where they’re risky (BI chat for non-experts), and how observability is shifting from dashboards to root-cause explanations delivered to the right person at the right time. Along the way: practical prompts, why “one model at a time” keeps you in control, and a testing philosophy that avoids alert fatigue while catching what matters.

To see real-world use cases of agentic coding and to learn directly from data and AI leaders, join us at Coalesce 2025 in Las Vegas, Oct. 13-16.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Can you talk a little bit about your background?

Mikkel Dengsøe: Yeah, so I can start from the beginning. I've been in data for, I think it's coming up to 15 years now, and started my career in data at a Danish shipping company, which was very much zero to one. When I came in, there was no data warehouse, and the only way we could know how many containers were shipped was by an IT guy pulling that out of the system every six months. I then spent two years there building up their data warehouse on SQL Server, which was super fun. After that, I spent five years at Google, which was a very different gear.

That's a natural transition. Just global shipping company straight to Google.

Exactly. And that was very much a hundred-to-end where, in my case, I worked with the ads data and you get a perfectly curated data table that you can work with and everything kind of works. Then after that I joined a company called Monzo. For those who are not familiar, it's a scaling fintech out of the UK and that was very much the one to a hundred. When I joined we were 30 data people, but scaled to a hundred over two years. We had 10,000 dbt models and we built every internal tool under the sun for dbt. Super interesting. And then three and a half years ago I went on to found SYNQ alongside Peter and Steve, which is a data observability platform.

Tell us a little bit more about SYNQ.

We are a data platform that primarily works with companies using tools like dbt already, but have issues going from important data to business-critical data. That might be customer-facing dashboards, machine learning models, or something else. They want better monitoring—we often deploy anomaly monitors—and they also want workflows such as incident management for when things go wrong. We were founded in 2022, so now we're in early stages of working with scale-ups and startups, and now also onboarding enterprises and larger companies. It's been a fun journey.

In your series of blog posts, you went through the modern data stack and said, “What's the most current version of this tool and how effectively can I AI-ify that?” Whether that's using Cursor to build dbt models or using the agent experience inside of Omni—what made you decide to get into this and write about it?

The first part of it is just: it's super fun to tinker with these tools and try them out. It's magic. And we were also building an MCP server at SYNQ, so I had a lot of interest in seeing how it works with others and what we can learn. It was also driven by being able to have conversations with our customers. When they ask about it, being able to speak from the point of view of having actually tried this and seen what works and what doesn't.

The early days of using Redshift were such a visceral experience relative to what came before. If I hadn't interacted with it directly, I wouldn't have understood how big a state change cloud data was. This feels like another one of those moments: if you don't have hands-on experience, you're not going to really get it. Fair?

Spot on. And I think pretty much every data team should be doing this unless they have a very good reason not to. The risk and the stakes can be pretty low if you use it for internal workflows like data modeling and writing tests. You're still in control. I recommend everybody do it.

What tasks did you try to accomplish?

It's three different blog posts: the data modeling part, the testing part, and then exposing it in Omni's AI agent where people can ask questions about the data. There's a fourth post: once the data is live, how can you use the SYNQ MCP to do things like root-cause analysis and planning changes. I started with data modeling. I had raw data from different JSON sources, some XMLs, some profiles—extracted and put into Snowflake—and then did the data model.

So the data was already loaded into Snowflake?

Yeah, exactly. For the data modeling, I started from the sources and then worked through staging, marts, and finally metrics using the semantic layer. Each step looks a little different when you use AI tools because the behavior differs. In terms of tooling, I used Cursor with the dbt-MCP plugged in. If you're not familiar, dbt-MCP lets you, via prompt, interact with dbt tools—execute dbt build, get models, or get everything upstream of a given model—so you can chain work without explicitly doing it.

Cursor + dbt-MCP. What model did you use?

I just used the default in Cursor, which I believe is Claude. There's an important distinction: Cursor is really good at writing code, but it can't execute queries on your behalf. If you want to extract raw data and query Snowflake to get rows out, you have to do that in Claude Desktop. That became key. Early on, as I built models, the first thing I did was get a snapshot of sample data from Snowflake—10,000 rows of a source. I fed that into Cursor and said, “These are examples of what this data looks like.” Using that data, Cursor could model in a clever way. For example, a column called quarter like “2025 Q1”—Cursor understood to translate it into a datetime and do the transformations.

I've used the dbt MCP server a decent amount—less in Cursor, more in Claude Desktop. Your stack was Cursor + Claude models + Claude Desktop. And Cursor cannot directly execute queries in Snowflake, but Claude Desktop can. Is that because there’s tool use Claude has that Cursor doesn't?

I believe so. In Claude Desktop, if you write queries against dbt-MCP, Claude can visualize a graph, show outputs of a SQL statement, etc. Cursor, as far as I know, couldn't. My middle ground was to take sample data out of Snowflake, put it into a CSV, and feed that back into Cursor so it could look at raw data.

As part of its own context window?

Exactly. That was key for my workflow. Then when I wanted to write unit tests, I could use real data examples from the sample. Or when automatically documenting the data, I asked Cursor to specify examples in the docs based on the most common occurrences within a column. Letting Cursor peek at raw data was a core pillar.

It's a little hacky, right? Cursor should really be able to interact directly with Snowflake or Databricks to investigate the shape of the data. Agents should be empowered to do that.

I would say so. There might be a way I didn’t know about, but I patched the gaps by uploading into the context window.

So that's the state of the art today.

Seems so. To be clear, I think the limitation is IDE differences—Cursor vs. Claude Desktop—rather than dbt-MCP itself.

Once you had sample data in context, did you have to suggest conversions, or did it naturally do them?

It got the defaults pretty right, but I guided it on what I wanted from the source data. I wanted control over everything, so I asked it to do one model at a time rather than auto-generate a whole stack. That way I could review each step and stay in control.

Your prompt workflow was “Build me a model with this name that stages the data from this table,” basically?

Yeah. When it proposed code I didn't like, upstream it was usually simple (regex to parse dates, etc.). Downstream, in marts and metrics, I started describing my ideal data product: user jobs-to-be-done and the final output. That’s when Cursor got creative and invented metrics I hadn’t anticipated—like “apartment price relative to time on market.” I pruned ones I didn’t want, but some were good surprises.

Which layer did it help most?

Testing. Modeling was good—especially staging—but testing accelerated significantly. SQL is a bit like English; for simple datasets you can express intent easily. Testing can be much harder and more verbose.

Roughly how much more effective did you feel?

Modeling: multiples faster. It nailed the tedious parts—regex, casting, pass-throughs—so staging/intermediate layers flew. In marts/semantic metrics, the benefit was brainstorming. It helped me think of metrics I wouldn't have.

Did the dbt Fusion engine help?

Yes. Fusion shows lineage and whether a column is pass-through. For example, if a column is pass-through with no transforms, don't add another not_null or unique if there's one upstream. I bounced between the IDE to check this and codified it as a testing strategy. That's already top-10% testing hygiene.

Any MCP feature requests surface?

The more context and tools the agent has, the more it can do. In the fourth post, for root cause analysis, we used the SYNQ MCP. We collect all your Git commits and have history, so the agent could correlate recent code changes with incidents. Requests depend on the job at hand.

Let's move to testing—why was it the most additive?

Testing is hard; many teams don't know how to do it and alert fatigue is common. A huge share of tests we see are not_null/unique, which doesn't reflect real data risks. First thing I did in Cursor for testing was provide our internal testing philosophy as guidelines: test heavily at the source, don't retest pass-through columns, focus on business and metric anomalies in marts. That worked really well. For sources and staging, it generated relevant tests. Then for marts, I asked for unit tests and gave it a thousand sample rows from Snowflake. It wrote very relevant unit tests I’d otherwise spend a lot of time on.

Examples?

Simple ones like: when you pass a string value in the date column, does it transform correctly to datetime and match the expected format? These just worked. Then at the metric level, it looked at raw data and proposed assumptions—like square-meter price should be between X and Y—sometimes segmenting by postcode. Very thoughtful, though I'd replace static thresholds with anomaly monitors so they don't go stale as prices move.

So at least 5× on testing?

At least. Apart from swapping static thresholds for anomaly detection, it nailed testing and did so in a lineage-aware, layer-appropriate way.

Tell me about the BI layer.

Many teams start at the BI layer with a chat interface. I think that's risky because it's used by business users and you only get so many chances before trust drops. I moved into Omni. You create a “topic” (a data model you can join with others) and then specify an AI context: instructions for how the LLM should behave. For example: if a user asks about price, always return square-meter price; never make up fields not present in the mart; if asked about provenance, mention the source. Writing AI context is a new skill for our industry.

Were you using Omni’s AI assistant to create assets faster, or to let users self-serve?

The latter—so users could ask questions instead of going to a dashboard. It could have been any BI tool with similar functionality; we just use Omni internally.

And how was the experience as a consumer?

Amazing when it works, but I'd hesitate to give my VP of Marketing access. It gets things wrong maybe one in five times, and it's not obvious why if you're not a data person. For analysts doing exploratory work, it's great—they can inspect and dig in. I wouldn't replace company-wide dashboards with a chat bot yet. Omni does log freeform queries and feedback, so there's a path to iterate the AI context over time.

The last thing you did was use AI plus SYNQ to monitor production infrastructure. What does observability look like in the future? Historically it's looked like dashboards—Datadog for data pipelines. Is it just more effective monitors, or fundamentally different?

Fundamentally different. We’re heading to a place where observability tools can tell you what's wrong at the right time, with just the right context, delivered to the right person—inside or outside the data team. Done well, there may be few dashboards; instead you get an LLM-summarized root cause delivered from a monitor that might be auto-created. Less “active tool you poke at,” more “proactive explanation.”

Still technical observability (pipelines/data issues), or business observability?

More the former. Teams at the edges—Sales Ops managing Salesforce, engineering teams creating web events—often need to be notified about data issues. Business KPI movements require a different experience for marketers, etc.

Automated remediation?

Gradual. You can imagine an issue occurs without a dedicated test; the system proposes a new test. But 80% of issues come from root systems elsewhere (someone typing in Salesforce), and closing that loop is still hard. In the article’s fourth part, we had a data issue and I asked the SYNQ MCP through Claude Desktop to do root cause analysis. It walked the same steps a data person would: inspect the model, check errors, examine lineage and upstreams, review recent commits, and documented each step to the root cause. That works now.

At the beginning you said there’s no good reason not to use these tools today. What reasons do you hear for not trying?

People are busy. But if you look at a risk curve, lowest risk is modeling and testing—you're in the driver's seat and it boosts productivity. Higher risk is replacing your BI tool with a chat bot; higher still is customer-facing experiences. The first two are hard to argue against.

Enterprise IT approvals might be one blocker—approved models, data access, etc.

True. For example, our MCP can query raw data to detect if an issue happens in a segment, and enterprises might hesitate there. Also, “MCP” as a term can be confusing. But it's actually simple and explainable, not a black box. Setting up dbt-MCP can still feel hacky in enterprises; if it lived natively in cloud environments, it’d be easier to adopt.

You can set it up locally—no permissions/procurement—and just play. We also shipped the MCP server as a remote MCP in cloud, though that introduces auth/permissions considerations.

If I had to pick a persona, it's the analyst. Analysts have had a tough decade: more tools, harder workflows, less time to tinker. MCPs and AI workflows are a turning point. At Monzo, we had a philosophy that you should be able to have an idea on your commute and have it implemented by midday. As we grew to 10,000 dbt models and long CI checks, that faded. I can see a world where this returns. MCPs can help. I'm excited.

I love that. Analytics engineers think “infrastructure, correctness.” Analysts think “idea to validation fast.” Excel was always the analyst’s best friend because it's fast and flexible. MCPs make it easy to plug tools together and get answers quickly again.

One company we work with—Voi, a scooter company out of Sweden—has a strong data leader, Magnus, who is very bought into metrics. Their data team doesn't produce dashboards; they produce metrics. In an AI world with MCPs, flows, and curves, that's a clear decision.

I believe there's no such thing as the wrong BI tool—different tools have different trade-offs. Probably true for models/IDEs too: Claude Desktop vs. Claude Code vs. Cursor—no single “right answer” as long as the underlying context and metric definitions are shared.

Agreed. What really matters across workflows: consistent metric definitions, documentation for columns and fields, and high-quality data. Those foundations matter even more when an LLM is in the loop; you may not have a human sanity-checking every result.

Chapters

00:00 — Tristan’s intro
01:10 — Mikkel’s background: shipping → Google → Monzo → SYNQ
03:08 — What SYNQ does (data observability for business-critical data)
04:15 — Running the experiment
06:23 — Scope: modeling, testing, BI agent, observability
07:17 — Tooling: Cursor + dbt MCP server + Snowflake + Omni
09:38 — Sampling real data into the agent’s context
13:14 — Modeling workflow: one model at a time
15:14 — Where agents help most: testing > modeling
18:10 — dbt Fusion engine: lineage-aware checks, fewer redundant tests
19:50 — Feature requests and root-cause via commit history
20:57 — Testing philosophy: source-heavy, pass-through aware, metric-level
22:49 — Unit tests from samples; thresholds vs anomaly monitors
25:10 — BI agents: great for analysts, risky for broad rollout
31:54 — The future of observability: explain first, dashboards second
36:10 — Adoption curve: safe places to start
40:49 — Analyst superpowers return
42:04 — Metrics over dashboards

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

Under the hood of Apache Iceberg (w/ Christian Thiel)

Dan Poppy — Sun, 24 Aug 2025 13:03:00 GMT

If you're a data practitioner, you likely understand Iceberg as a user, why it's important, and how it's changing the way that we build data systems. But you may not know a lot about what going on beneath the surface.

There are multiple ways to interface with Iceberg catalogs, multiple versions of the Iceberg REST spec. There's several leading catalogs that implement that spec. All this in an ecosystem that includes companies of all sizes, in proprietary and open-source code, and in academic and commercial contexts.

In a few years, all this ambiguity will be behind us, but right now it's very much evolving in real-time. To get an update on the status of the Iceberg ecosystem and to walk through all the developments, Tristan talks with Christian Thiel. Christian is one of the lead architects of Lakekeeper, of one of the most widely used Iceberg catalogs.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Walk us through your background

Christian Thiel: I started in natural language processing, then moved into machine learning applications in manufacturing. Like many people, I realized that the biggest barrier wasn’t the algorithms but the data—its availability, quality, and accessibility. That led me deeper into data architecture and engineering, eventually to building Lakekeeper.

What is Lakekeeper, and what are you building now?

Lakekeeper is an Iceberg catalog implementation—a technical requirement for building distributed, composable analytic systems based on Apache Iceberg. But our vision goes beyond that. We see the future in data collaboration and reliable sharing of data, supported by clear contracts.

For listeners new to Iceberg, what makes it so important?

Iceberg allows organizations to store data once, in an open format, and then use the compute engine best suited for each workload. It’s a foundation for building modern, composable data platforms while avoiding vendor lock-in. If there’s one thing that should be open, it’s the data at the center of your platform.

Some folks might say this sounds like Hadoop all over again—lots of open standards that are hard to integrate. Why is this time different?

The ecosystem has matured. Even big vendors like Snowflake and Databricks are embracing Iceberg, which shows there’s a strong shift toward openness. Plus, the tooling and infrastructure are much easier to deploy today. A modern Iceberg setup is far less complex than a Hadoop environment used to be.

Let’s talk about what’s happening under the hood. How does Iceberg work?

Iceberg organizes data using a metadata hierarchy. At the top, there’s a JSON file that stores high-level table information: snapshots, schema, and locations. Below that are manifests and other layers that keep track of files. This hierarchy is what makes things like time travel, atomic transactions, and schema evolution possible.

What about ongoing maintenance?

There are two key tasks. First, expiring old snapshots so you don’t accumulate unnecessary files. Second, compaction—combining many small files into larger ones

Catalogs are another critical piece. What role do they play?

Catalogs manage the top layer of metadata and coordinate transactions. They make atomic updates possible, allow multiple writers, and handle governance—things like access control and multi-table transactions.

How enterprise-ready is Iceberg today?

Very ready. A year ago, there were still gaps, but today, performance and feature parity with native tables on platforms like Snowflake and BigQuery are strong. Governance and authorization models are still evolving, and different catalogs implement them differently, but the core functionality is there.

Speaking of catalogs, how should someone pick between options like Lakekeeper, Polaris, Unity, AWS Glue, or Gravitino?

Christian Thiel: It depends on priorities. Lakekeeper focuses on performance, extensibility, and ease of use. Polaris is developer-focused but less user-friendly. Unity is tightly integrated into Databricks. Glue now supports the Iceberg REST spec, which makes it more interoperable than before. Gravitino is another option aimed at enterprise-scale environments.

Recently, DuckDB announced DuckLake. What’s your take on that?

It’s interesting, but there are two concerns. First, it uses a database schema directly for the catalog, which creates interoperability issues—similar to the early JDBC catalog in Iceberg that the community eventually moved away from. Second, it was built without community involvement, and openness without adoption isn’t really openness.

That said, for heavy DuckDB users, it could offer optimizations that make queries extremely fast, and if the broader ecosystem adopts it, it could become a viable open format.

What’s next for Lakekeeper?

We’re continuing to invest in table optimization, enterprise features, and data collaboration tools. Our vision is what we call the “unbreakable lakehouse,” where contracts and collaboration guardrails make shared data more reliable. Long-term, we see Lakekeeper as enabling truly collaborative, open data ecosystems.

Chapters

00:00 – Introduction
Tristan Handy introduces the episode and the focus on Apache Iceberg.
01:40 – Christian Thiel’s background
From natural language processing to data engineering.
04:30 – Introduction to Lakekeeper
What Lakekeeper is and its role in the Iceberg ecosystem.
06:00 – Why Iceberg matters
How open table formats enable flexibility and reduce vendor lock-in.
11:40 – How Iceberg works under the hood
Metadata hierarchy, catalogs, and how state is managed.
21:30 – Maintenance and optimization
Snapshot expiration, compaction, and keeping tables performant.
24:20 – Catalogs and governance
Access control, multi-table transactions, and security.
31:40 – Enterprise readiness
How Iceberg is evolving for production use in large organizations.
42:10 – Choosing the right catalog
Overview of Lakekeeper, Polaris, Unity, Glue, and Gravitt.
47:20 – DuckLake discussion
Pros, cons, and ecosystem adoption challenges.
52:00 – The future of Lakekeeper
Data contracts, collaboration, and building the “unbreakable lakehouse.”

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

The pragmatic guide to AI agents in the enterprise (w/ Sean Falconer)

Dan Poppy — Sun, 03 Aug 2025 13:02:53 GMT

What does it mean to be agentic? Is there a spectrum of agency?

In this episode of The Analytics Engineering Podcast, Tristan Handy talks to Sean Falconer, senior director of AI strategy at Confluent, about AI agents. They discuss what truly makes software "agentic," where agents are successfully being deployed, and how to conceptualize and build agents within enterprise infrastructure.

Sean shares practical ideas about the changing trends in AI, the role of basic models, and why agents may be better for businesses than for consumers. This episode will give you a clear, practical idea of how AI agents can change businesses, instead of being a vague marketing buzzword.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Sean, can you give us the TLDR on your career and what you're working on today?

Sean Falconer: I've always worked at the intersection of data, engineering, and AI. From academia studying computer science, into industry as a founder, then to Google, I worked on conversational systems and privacy/security in AI. Currently, at Confluent, I'm leading our AI product strategy, balancing both technical and go-to-market roles.

You moved from being deeply technical into marketing and sales. What drove that transition?

I was forced into it as a founder. Initially uncomfortable, but it taught me huge respect for marketing and sales. I had to learn by making many mistakes, eventually building out entire marketing and sales functions. I realized how challenging and critical these roles are.

You were at Google before ChatGPT launched. Did you foresee the transformative nature of these technologies?

Honestly, no. Having seen earlier disappointments in conversational AI (like Microsoft's Alice), I was skeptical initially, even as ChatGPT emerged. It wasn’t obvious we'd soon experience this revolution.

You’ve written about three waves of AI. Can you describe these?

Yes. Wave one was predictive AI, traditional ML models trained for specific tasks like fraud or spam detection—effective but rigid. Wave two introduced generative AI, or foundation models, trained on vast general datasets, flexible but lacking specific business context. The third wave, agentic AI, involves AI systems that can reason, dynamically choose tasks, gather information, and perform actions as a more complete software system.

Do foundation models replace traditional ML methods?

Sometimes they can, but it doesn’t always make sense. An LLM might do sentiment analysis well enough, but a traditional model may be more efficient and cheaper. Think of using an LLM as cutting steak with a chainsaw—possible, but unnecessary.

Let's clarify "agents." What makes software truly agentic?

It’s software that can dynamically decide its own control flow: choosing tasks, workflows, and gathering context as needed. Realistically, current enterprise agents have limited agency to ensure reliability. They're mostly workflow automations rather than fully autonomous systems.

You mentioned a spectrum of agency. Is this similar to autonomy in self-driving cars?

Exactly. Highly autonomous agents are appealing but not practical yet. Most enterprise success stories involve structured workflows with clearly defined boundaries.

Why have agents taken off more in enterprises than consumer apps?

Enterprises have many well-defined, high-value tasks perfect for automation. Consumer scenarios demanding high agency—like planning complex trips—are still too unreliable. Enterprises can benefit significantly even from limited agentic capability.

Is an agent just a microservice?

In many ways, yes. An agent functions like a microservice with extra capabilities (using LLMs for decisions). Deployment considerations like state management and long-running tasks differ slightly, but fundamentally it’s similar.

What tools and frameworks help build effective agents?

Start with frontier models like GPT-4 or Claude. Frameworks include LangChain, Microsoft Autogen, and CrewAI. But for real-world deployment, treat it as rigorous software engineering with observability, scalability, and robustness in mind.

Are organizational barriers bigger than technical challenges?

Yes. AI efforts are often mistakenly tasked to data science teams rather than cross-functional software teams. Successful companies create dedicated teams blending software engineering skills and data expertise to build reliable agentic systems.

What pitfalls should teams avoid?

Avoid building monolithic agents. Break systems into smaller, well-defined units in a multi-agent architecture. Use event-driven frameworks to avoid rigid, hard-to-maintain dependencies.

Chapters

[00:00] Introduction: What's all the hype about agents?
[01:10] Meet Sean Falconer: A journey from engineer to AI strategist
[04:10] Learning marketing as an engineer-founder
[05:50] Inside Google's AI efforts before ChatGPT
[09:00] What does it mean to run AI strategy?
[10:45] Three waves of AI: Predictive, Generative, and Agentic
[16:30] Will foundation models replace traditional ML?
[18:30] Defining agents clearly: Beyond the buzzword
[22:00] The spectrum of agency: From controlled workflows to open-ended tasks
[25:30] Why agents fit better in enterprises than consumer apps
[28:00] Agents as microservices: A practical view
[35:00] What tech stack is needed to build effective agents?
[37:50] Organizational challenges in adopting agents
[39:30] Models that are favorites for developers
[43:30] Why software engineers are best placed to build agents
[46:00] The technical stumbling blocks in building agents
[48:00] Concluding thoughts: Beyond POCs to production agents

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

How Amazon S3 works (w/ Andy Warfield)

Dan Poppy — Sun, 20 Jul 2025 12:02:56 GMT

In this season of the Analytics Engineering podcast, Tristan is deep into the world of developer tools and databases. If you're following us here, you've almost definitely used Amazon S3 it and its Blob Storage siblings at Microsoft and Google. They form the foundation for nearly all data work in the cloud. In many ways, it was the innovations that happened inside of S3 that have unlocked all of the progress in cloud data over the last decade.

In this episode, Tristan talks with Andy Warfield, VP and senior principal engineer at AWS, where he focuses primarily on storage. They go deep on S3, how it works, and what it unlocks. They close out talking about Iceberg, S3 table buckets, and what this all suggests about the outlines of the S3 product roadmap moving forward.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Operating systems, garage sales, and Xen

Tristan Handy: You’ve done a lot over the last 20 years. Before we get into specifics, can you just share a little about your journey as a software engineer?

Andy Warfield: I just like playing with computers. I studied computer science in Ontario for undergrad, then moved to Vancouver for grad school, then to the UK for a PhD. I worked on operating systems, low-level stuff. I got to work on a hypervisor called Xen, which ended up being used by a lot of cloud providers, including Amazon.

After that, I did a couple of startups, one around Xen. Then I became a professor at UBC, teaching operating systems, networking, and security. Later, I did another startup in storage, and eventually I joined Amazon.

Now I have this highfalutin role—VP and engineer—working across S3, other storage services, and now a bunch of analytics services too. I get to cause trouble in lots of different parts of the cloud.

VP slash distinguished engineer—does that mean you just get to march around telling people how to improve their stuff?

People love that! I’d say about half the time I’m causing trouble—starting things and encouraging new ideas—and the other half I’m helping teams dig out from those ideas. Sometimes I take over a team if we’re doing something especially interesting or innovative, just so I can be closer to the action.

That sounds like a pretty good gig if you can get it.

It’s amazing. I’ve been here nearly eight years, and I still love this job.

The rise of virtualization and the origin of Xen

I want to talk about Xen. You said you were always interested in operating systems, which is kind of a niche fascination. What drew you in?

When I was a kid, we didn’t have much money, so I built computers from garage sale parts in Ottawa. In high school, I found this federal government warehouse that sold off old equipment. I started a little business buying pallets of hardware for cheap, fixing them up, and reselling.

It was chaotic—but I learned a lot. I dealt with machines like IBM DisplayWriters with 8-inch floppy disks and massive dot-matrix printers. Getting them working meant diving into their software and systems.

Eventually I played with Linux, hacked on the kernel, and that all led me into OS research and development.

Tristan: So what is a hypervisor, and why did virtualization become so important in the 2000s?

Andy: There were two big drivers: server utilization and isolation.

Companies had racks full of 1U servers, most of which sat idle most of the time. But they couldn’t share workloads because apps weren’t isolated well—config conflicts, shared resources, etc.

Virtualization allowed multiple operating systems to run on the same hardware, with isolation. It also let you consolidate servers, which had big cost and efficiency benefits.

There was also a technical challenge: x86 processors weren’t designed to be virtualized. That made it a really interesting research problem. We wanted to see if it could even be done—and done efficiently.

Tristan: And Intel eventually started building virtualization support into the hardware?

Andy: Exactly. Our work on Xen and similar projects showed it was possible. That pushed Intel and AMD to add features like VT-x, which made it easier and more performant to run hypervisors.

Tristan: How did AWS end up using Xen?

Andy: I wasn’t part of those internal conversations, but the story goes that a small startup in Cape Town, South Africa, was building a control plane for Xen. That team got picked up by AWS and became the basis for EC2.

Understanding Amazon S3

Tristan: Let’s switch to S3. I think a common mental model is that S3 is just a big pool of SSDs. But that’s clearly not the whole story. How do you explain what S3 actually is?

Andy: That’s one of my favorite questions.

Early on, S3 was like a storage locker. You’d rent space to stash things you didn’t need right away—backups, static files, CDN origins. Latency wasn’t great, but durability and availability were.

Things really changed when the Hadoop community built S3A—an adapter to let Hadoop use S3 instead of HDFS. Suddenly, we had people doing real analytics on S3. The system had enough drives to support massive parallel reads.

Today, workloads are way more demanding. Performance, consistency, and latency matter. We’ve been evolving the system constantly to meet those needs.

Tristan: Are we talking about billions of hard drives?

Andy: I can’t share exact numbers, but yes—it's a lot of hard drives. Some of our largest customers have data spread across millions of drives. And most drives are shared across multiple customers.

Tristan: And these aren’t SSDs?

Andy: Mostly spinning disks, actually. Hard drives are terrible at latency, but they’re cheap and good for bursty workloads. Spreading your data across many disks lets you take advantage of parallelism.

S3’s durability, performance, and scale

Tristan: Let’s talk about S3’s durability promise: 11 nines. How do you achieve that?

Andy: We use erasure coding—a form of RAID-like redundancy that lets you split data into parts and parity blocks. Then we store those shards across different availability zones.

We constantly monitor for failures. Disks die all the time, so we have fleets of processes repairing and maintaining durability. It’s not static. It’s a living system.

Tristan: You must have incredibly precise failure models.

Andy: We do. We track failure rates, temperature sensitivity, vendor behavior—everything. That allows us to be proactive and surgical in how we manage risk.

From Parquet to Iceberg to S3 table buckets

Tristan: I want to talk about table formats. Parquet is everywhere now. And then we got Hive Metastore, then Iceberg. Why did S3 launch table buckets?

Parquet is great, but it’s just files. Customers kept asking for more structured semantics: schema evolution, upserts, ACID transactions.

We saw Iceberg adoption grow rapidly—especially among our largest analytics customers. But they were struggling with operational complexity: too many small files, custom compactors, brittle catalogs.

So we launched S3 table buckets to bring native Iceberg support to S3. That includes:

Automatic compaction
A REST catalog
High-performance access

We wanted to make it easier to treat Iceberg as a storage primitive, not just an analytics backend.

So this is a shift in philosophy—S3 isn’t just object storage, it’s now table-aware?

Exactly. Historically, S3 was just where you stored objects. Now, we’re thinking more about what those objects mean.

We also launched S3 object metadata tables—a way to semantically describe and query your object store, especially useful for AI workloads using retrieval-augmented generation (RAG).

The future of open data and S3

What does the future of S3 look like? Where’s this going?

We’re headed toward more structure, more semantics, and more performance.

Inference workloads are scaling fast. AI models are hitting S3 hundreds of thousands of times per second to do vector lookups. That’s changing how we think about indexing, metadata, and latency.

We want to make S3 the best place to do open, flexible, high-scale data work—from tables to training data to retrieval.

Chapters

[01:42] Meet Andy Warfield

Andy shares his background, including startups, professorship, and his current role as VP & Senior Principal Engineer at AWS.

[05:10] From garage sales to hypervisors

Andy describes his early passion for hardware, OS development, and the origin story behind the Xen hypervisor.

[08:50] Why virtualization took off in the 2000s

Exploring why isolation, utilization, and technical curiosity fueled the rise of hypervisors.

[14:30] Xen vs. VMware and the road to AWS

How Xen became the default for EC2 and the technical differences between virtualization approaches.

[17:35] The origin of EC2 and S3

How a team from Cape Town helped launch AWS compute—and the early days of cloud services.

[20:00] What is S3, really?

Andy breaks down the mental model behind S3: not just object storage, but a scalable data platform.

[22:49] How many drives? More than you think

Why S3 storage spans millions of drives—and how AWS uses scale to deliver performance.

[28:10] The 11 nines durability model

Inside S3’s approach to reliability, failure tolerance, and background repairs using erasure coding.

[32:00] Tail latency and engineering for bursty workloads

Why slow requests matter, and how S3 teams optimize for streaming, AI, and analytics use cases.

[35:20] Iceberg, metadata, and table buckets

The emergence of Apache Iceberg as a table format—and AWS’s new structured storage approach.

[38:00] Why S3 added a REST catalog and compaction

How AWS is simplifying the operational burden of working with Iceberg at scale.

[40:00] A new mental model for object storage

S3 is no longer just about storing files—it’s about managing semantics, lineage, and trust.

[44:00] Looking ahead: S3, RAG, and semantic metadata

How S3 is preparing for the next wave of AI, inference, and context-aware applications.

[47:20] Is Iceberg ready for enterprise?

Andy shares thoughts on enterprise readiness, performance tradeoffs, and real-world adoption of table formats.

[49:05] Wrap-up and reflections

Tristan and Andy reflect on the conversation and where data infrastructure is headed next.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

It is time to take agentic workflows for data work seriously

Jason Ganz — Sun, 29 Jun 2025 12:53:17 GMT

Last week, I cleared two hours on my calendar to do a deep dive into the current state of agentic development for data work.

Specifically, I gave myself a challenge - could I go from a never-before-seen dataset to a production-ready Semantic Layer using a combination of tools:

An agentic coding CLI (I used Claude Code for this experiment)
The dbt MCP server
A terminal interface (in this case Warp)

Before we go any further, if this is at all interesting to you, I suggest that instead of reading my findings here that you sit down and try this yourself. I'm quite confident you'll find it both illuminating and worth your time.

We'll get to my findings in a bit. Long story short - it was successful enough that it shifted my thinking about the near-term trajectory of data work.

But first, let's talk about why experiments like this matter so much right now.

Sensemaking in the age of AI

You've probably been hearing some variant of these takes multiple times a day:

"An agent is just an LLM run in a loop"

"AI agents are coming to replace white collar work"

"I don't even know what an AI agent is, this is just marketing hype"

And about a billion more. All of these represent our collective attempts at sensemaking in this unique technological moment. But honestly, the noise can be so overwhelming that it's tempting to just tune it all out and wait for the dust to settle.

I don't think that's an option for data practitioners. Instead, we need to develop our own internal compass for sensemaking - and that means getting our hands dirty.

To do great data work is to be a great sensemaker. My theory of sensemaking requires holding two paradoxical skills in tension:

Build strong mental models about the world and use them to take decisive action
Constantly scan for misalignments between your models and reality, then adjust accordingly

Organizations and institutions need time to metabolize change and adjust their mental models. There's a physics to it. And that physics takes time.

But when the underlying reality is changing rapidly, the best thing you can do is go make direct contact with that reality. Don't wait for the consensus to form - go see for yourself.

Because things are not the same as they were even 6 months ago:

We've gotten the first wave of models optimized for agentic work (OpenAI’s O3, Claude 4 and Gemini 2.5)
We've started building real infrastructure to connect these models to our systems (MCP and other emerging protocols)
LLM-based coding has shifted from autocomplete to actual agents (something longtime Roundup readers saw coming)

That’s a bunch of big changes! It can sometimes feel like keeping up with everything here is a full time job. And with my last couple months being pretty tied up with other things I felt like I owed it to myself to set aside some time and go deep here.

The experiment: Two hours from zero to Semantic Layer

I chose the weather source dataset on the Snowflake marketplace precisely because it was both interesting and completely unfamiliar to me. I booted up Warp (dbt MCP server already configured - that might add additional time here) and got started.

In two hours, I went from raw data to a working dbt project 1 with:

Documented source definitions
Tested data models
A functional Semantic Layer with queryable metrics

It felt incredible. A bit unbelievable. Of course this was just a simple project and nothing in here would be particularly difficult for an experienced analytics engineer - but it would have taken a whole lot of time and effort.

Some observations from the process:

The experience was exhilarating. Watching an abstract goal decompose into concrete tasks, then seeing those tasks execute in real-time feels like witnessing something total new. It was also addicting - in this interface has a “just one more level” feeling of playing a great video game.

The cognitive load is different. It was cognitively demanding but not in the same way that coding is cognitively demanding - I have a sense that I’d be able to sustain longer blocks of “pairing” with Claude code before getting mentally depleted than normal coding.

The tools aren't optimized for data work yet.

It first attempted to build out a bunch of models that depended on each other, but didn’t check if the first model actually ran. Then there was an error part of the way through it’s dependency and we had to do a bunch of unthreading.
It’s competent at writing SQL (and dbt-style SQL). I don’t expect this to be the bottleneck for AI augmented development.
It is not very good at understanding what columns or models it has access to at a given time - I expect this to be an area where the models will be most useful when assisted by deterministic tooling.

What this proved (and didn't prove)

This experiment convinced me that agentic workflows have moved beyond “pure speculation” and into “definitely worth exploring and net useful for many teams today”. It feels pretty similar to the earlish days of coding assistants like copilot. Not yet for every team but definitely for some and on a steep acceleration curve.

This was just a simple experiment and I walked away thinking just as much about what I don’t know as what I learned.

I still don't know if my models are logically sound (validation would take as long as building)
Enterprise-scale datasets might break this approach entirely
The actual utility of what I built remains untested
And even with all of this, there are just as many organizational bottlenecks that the data team face as technical. What implications does this have there (if any).

But here's the thing: in two hours, I accomplished what would have taken me at least a full day manually - not just the modeling, but documentation, testing more. That is worth paying attention to.

Your move

When facing a question as vast as "How will AI reshape data work?", it's easy to get paralyzed. But the answer isn't in think pieces or Twitter debates - it's in running experiments.

My mental model shifted because I made contact with reality. Right now, data teams not using agentic workflows are doing just fine. But things are moving fast. It’s worth it, at the very least to get a sense of what the state of the world is here and think about how you might adapt to it.

So here's my challenge: Block two hours next week. Pick a dataset you don't know. Try to build something real with these tools. Report back and let me know.

The future of data work is being written right now, in thousands of small experiments by practitioners who refuse to wait for the dust to settle. If you're reading this, you have the expertise to contribute to our collective sensemaking.

What will you discover when you stop reading about AI and start building with it?

There’s a lot that I’d improve here for a production project - making this public to show a checkpoint for where I got in a timeboxed experiment

From Docker to Dagger (w/ Solomon Hykes)

Dan Poppy — Sun, 22 Jun 2025 13:00:24 GMT

In this season of the Analytics Engineering podcast, Tristan is digging deep into the world of developer tools and databases. There are few more widely used developer tools than Docker. From its launch back in 2013, Docker has completely changed how developers ship applications.

In this episode, Tristan talks to Solomon Hykes, the founder and creator of Docker. They trace Docker’s rise from startup obscurity to becoming foundational infrastructure in modern software development. Solomon explains the technical underpinnings of containerization, the pivotal shift from platform-as-a-service to open-source engine, and why Docker’s developer experience was so revolutionary.

The conversation also dives into his next venture Dagger, and how it aims to solve the messy, overlooked workflows of software delivery. Bonus: Solomon shares how AI agents are reshaping how CI/CD gets done and why the next revolution in DevOps might already be here.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Tristan Handy: I want to get you to give a little background on yourself, where you've been, what you've been up to for the last couple decades. I think many people will know you as the person who kicked off an avalanche that changed how we interact with compute environments by inventing Docker?

Solomon Hykes: Docker is the thing I'm known for. Pre-Docker, I grew up in France. I studied programming in a French school called EpiTech. It was a brand-new, unconventional school where you learned through nonstop programming, which I loved.

Eventually, I got exposed to startups, despite being a complete outsider. I met someone who told me about them, and it stuck in my mind. Still in France at the time, I moved into my mom's house in the suburbs of Paris and worked out of the basement.

By complete luck, I got into an early version of Y Combinator in 2010. That got us on the path to what would become Docker three years later. In 2013, we pivoted to Docker from our previous company, dotCloud.

Tristan Handy: The original thing was called dotCloud, right?

Solomon Hykes: Yep. It was about container technology and its potential, but we didn't quite know how to take it to market. DotCloud was about deploying and hosting people's apps—platform as a service—competing with Heroku and many clones.

Tristan Handy: When did Heroku become a thing?

Solomon Hykes: I became aware of it in 2009. Just as I was struggling in France with container tech. When we joined YC in 2010, we packaged that tech into dotCloud, our hosting platform. Our differentiator was using containers under the hood when others didn’t. That let us support many language stacks and even run databases in containers—which was unheard of at the time.

Platform as a service was a tough business. Most startups went out of business or got acquired early. Eventually, we pivoted from selling the car to building an ecosystem around the engine—that became Docker.

Tristan Handy: Did you pivot because selling the car wasn't working? Or because people kept pointing at the engine saying, “Give me that”?

Solomon Hykes: Both. It was hard to market platforms. Developers expected free hosting, and hosting costs money. Margins were tight because of AWS. It always felt like pushing a boulder uphill. Meanwhile, people wanted to run things locally. There was no good ecosystem for that. Docker provided transparency, flexibility, and portability.

Tristan Handy: Can you define Docker and containerization, and how it differs from virtualization?

Solomon Hykes: Sure. Virtualization splits a physical machine into virtual ones using VMs—each with its own memory, compute, and storage. It gives flexibility, but with overhead.

Containerization does something similar but at the operating system level. Instead of virtualizing the machine, you split the OS itself. It’s mostly done with Linux, which can subdivide itself into isolated units. Containers are more lightweight, letting you run hundreds or thousands, unlike VMs where you might manage a handful before hitting limits.

Docker didn’t invent this, but we solved new problems with it.

Tristan Handy: I remember creating my first Docker container around 2015. I expected a slow boot-up like a VM, but it was instantaneous. Where is the OS in that setup?

Solomon Hykes: Great question. Docker relies on Linux. When you're on a Mac, it runs Linux behind the scenes—today via virtualization. Back then, we used lots of early, rough tools and kernel patches to make Linux containers work. Docker put all the pieces together in a coherent way.

Tristan Handy: So containerization wasn’t new, but Docker made it accessible?

Solomon Hykes:
Exactly. The Linux kernel had features like namespaces and cgroups—building blocks for containers. But they weren’t user-friendly. We made a developer-centric abstraction on top of those tools.

And Linux provided a massive compatibility layer. Unlike Java, which required writing your app in Java, Docker containers could wrap apps written in any language, as long as they ran on Linux.

Tristan Handy: So Docker is like infrastructure as code—a primitive that enables the whole concept?

Solomon Hykes: Yes! And because we wanted ubiquity, we avoided pushing too many opinions. We let developers build on top of it in many different ways. That’s what helped Docker become a de facto standard.

Tristan Handy: How fragmented is the Linux world under the hood? Did you have to do much abstraction work?

Solomon Hykes: We were lucky. The Linux kernel is extremely stable and consistent. But everything above it—distros, package managers, tooling—was chaotic. That chaos created the opportunity for Docker to provide a consistent experience.

Tristan Handy: Were there any drawbacks? Like “Docker sprawl” the way VMware saw VM sprawl?

Solomon Hykes: Definitely. With power comes chaos. Teams would run dozens of Docker containers, each configured differently. Docker doesn’t enforce opinions—by design.

Tristan Handy: And what happened when you left Docker in 2018?

Solomon Hykes: I took time off, became a full-time dad. But I also realized how many unsolved problems remained. Especially around CI/CD pipelines and software delivery—what we now call the software factory.

That led me to start Dagger.

Tristan Handy: So Dagger is like “containers for pipelines”?

Solomon Hykes: Yes. Just as Docker standardized app deployment, Dagger aims to standardize and containerize software delivery. CI/CD pipelines today are often duct-taped together with YAML and bash scripts. We’re bringing consistency and modularity to that space.

Tristan Handy: Will there be a “Daggerfile” like there’s a Dockerfile?

Solomon Hykes: Sort of. But this time, we’re opinionated. Dagger is narrowly focused on CI/CD. That lets us provide APIs, SDKs, and a deeper abstraction stack. We give platform engineers a DAG-based system to define repeatable, containerized steps.

Tristan Handy: And what’s the role of AI and agents in all this?

Solomon Hykes: Great question. We didn’t plan for it, but our community showed us the way. People started building AI agents that run in Dagger pipelines—automating things like writing tests, submitting PRs, and optimizing builds.

That blew our minds. Agents blur the line between development and delivery. They need programmable environments. Dagger is becoming an ideal platform for that.

Chapters

01:30 – Early Days: From France to dotCloud

Solomon shares how his early programming experience and startup journey led to the creation of dotCloud.

04:00 – The PaaS Struggle and Birth of Docker

The team pivots from platform-as-a-service to focusing on the container engine itself—what would become Docker.

07:00 – What Is a Container, Really?

Solomon explains containerization vs. virtualization in plain terms and why it changed the game for developers.

11:00 – The Developer Experience That Won the World

The magic of fast, lightweight Docker containers—and how that first “wow” moment felt.

14:00 – Building a Ubiquitous Standard

Why Docker stayed narrow by design, resisting feature bloat to maximize compatibility.

18:00 – DevOps Before DevOps

How Docker avoided language tribalism and achieved mass developer adoption by choosing Go and CLI-first tooling.

21:00 – Complexity and Container Sprawl

Docker made infrastructure easy—but created new operational challenges at scale.

24:30 – Why CI/CD Pipelines Are Still Broken

Solomon outlines the gap Docker never got to fix: modern software delivery remains brittle and ad hoc.

27:00 – Enter Dagger: DevOps for the Modern Age

How Solomon’s new company is treating pipelines as composable software, not brittle scripts.

30:00 – Building an OS for the Software Factory

Dagger helps platform teams manage the complexity of software delivery with reusable, testable components.

33:00 – Agent-Native Workflows: A Surprise Use Case

AI agents begin using Dagger to reason about pipelines, generate code, and submit pull requests autonomously.

37:00 – Reimagining the Dev Loop with AI

Why the boundary between development and CI/CD is collapsing—and how Dagger fits the agent-powered future.

41:00 – Scaling Trust in Delivery

Tristan and Solomon reflect on how developer tooling evolves and what a stable, fast delivery layer enables.

45:00 – Final Thoughts: What’s Next for DevOps

The conversation closes with predictions on intelligent automation, composability, and the future of platform engineering.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

The history and future of the data ecosystem (w/ Lonne Jaffe)

Dan Poppy — Sun, 08 Jun 2025 13:02:46 GMT

In this decades-spanning episode, Tristan talks with Lonne Jaffe, Managing Director at Insight Partners and former CEO of Syncsort (now Precisely), to trace the history of the data ecosystem—from its mainframe origins to its AI-infused future.

Lonne reflects on the evolution of ETL, the unexpected staying power of legacy tech, and why AI may finally erode the switching costs that have long protected incumbents. The future of the AI and standards era is bright.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Episode chapters

00:46 – Meet Lonne Jaffe: background & career jurney

Lonne shares his career highlights from Insight Partners, Syncsort/Precisely, and IBM, including major acquisitions and tech focus areas.

04:20 – The origins of Syncsort & sorting in mainframes

Discussion on why sorting was a critical early problem in hierarchical databases and how early systems like IMS worked.

07:00 – M&A as innovation strategy

How Syncsort used inorganic growth to modernize its platform, including an early example of migrating data from IMS to DB2 without rewriting apps.

09:35 – Technical vs. strategic experience

Tristan probes Lonne’s technical depth despite his business titles; Lonne shares his background in programming and a fun fact about juggling.

11:55 – Why this history matters

Tristan sets up the key question: what lessons from 1970s-2000s ETL tooling still shape the modern data stack?

13:00 – Proto-ETL: The real OGs

Lonne traces the origins of ETL to 1970s CDC, JCL, and early IBM tools. Prism Solutions in 1988 gets credit as the first real ETL startup.

15:40 – Rise of the ETL market (1990s)

From Prism to Informatica and DataStage—early 90s vendors brought visual development to what was once COBOL-heavy backend work.

18:00 – Why people offloaded Teradata to Hadoop

Exploring how cost, contention, and capacity drove ETL out of the warehouse and into Hadoop in the 2000s.

20:00 – Performance vs. price: Jevons Paradox in ETL

Why lower compute and storage costs led to more ETL, not less—and how parallelization changed the game.

22:30 – Evolution of data management suites

How ETL expanded into app-to-app integration, catalogs, metadata management, and why these bundles got bloated.

25:00 – Rise of data prep & self-service analytics

Tools like Kettle, Pentaho, and Tableau mirrored ETL for business users—spawning a whole “data prep” category.

27:30 – Clickstream, logs & big data chaos

How clickstream and log data changed the ETL landscape, and the hope (and letdown) of zero-copy analytics.

29:10 – Why is old software so sticky?

Tristan and Lonne explore the economics of switching costs, the illusion of freedom, and whether GenAI could break the lock-in.

33:30 – Are old tools actually… good?

Defending mainframes and 30-year-old databases like Cache. Sometimes the mature option is better—just not sexy.

36:00 – The new vs. the durable

Modern tools must prove themselves against decades of reliability and robustness in finance, healthcare, and compliance.

38:20 – GenAI in data: The early movers

Lonne highlights why companies like Atlan and dbt Labs are in the best position to win—distribution, trust, and product maturity.

41:00 – TAM and the Jevons Paradox, again

Revisiting how price drops expand TAM. Some categories vanish, others explode—depending on elasticity of demand.

43:15 – Unlocking new personas with LLMs

Structured data access for non-technical users is finally viable, but “it has to be right”—trust and quality remain the barrier.

46:00 – Real-world examples: dbt’s MCP server win

Tristan shares how dbt’s Metadata API became a catalog replacement for a traditional financial institution—an unplanned AI GTM success.

48:30 – Agents, not interfaces

New pattern: LLMs as agents interacting directly with infrastructure via APIs. Tool use is becoming table stakes for AI integration.

50:30 – Are LLMs birthright tools yet?

Discussion around adoption of ChatGPT Enterprise, Claude, etc. Lonne suggests adoption is accelerating fast—and the usage model matters.

52:00 – Looking ahead

The conversation ends with a reflection on GenAI’s near future in data workflows, TAM expansion, and what the next episode might tackle.

Key takeaways from this episode

Tristan Handy: You've had a long career in tech. Maybe start by giving us the 30,000-foot view of what you've been up to over the last couple decades?

Lonne Jaffe: I’ve been at Insight Partners for about eight years now, working mostly on deep tech investments—AI infrastructure companies like Run AI and deci.ai, both acquired by Nvidia. I’ve also done work with data infrastructure companies like SingleStore. Before Insight, I was CEO of a portfolio company called Syncsort, now Precisely. It was founded in 1968.

Prior to that, I was at IBM for 13 years, working in middleware and mainframe technologies. Products like WebSphere, CICS, and TPF—foundational systems for enterprise computing.

Tristan Handy: And Syncsort's origin was in sorting, right? Literally sorting files?

Lonne Jaffe: Exactly. In the early days of computing, sorting was a huge part of what you did. Much of the data was hierarchical—stored in IMS—and had to be flattened into files to process. The algorithms were optimized to run in extremely resource-constrained environments.

Tristan Handy: Fascinating. And I assume as compute and storage improved, the data integration landscape evolved?

Lonne Jaffe: Yes. We saw a move from hierarchical to relational databases, then toward ETL tools in the 80s and 90s. The first real ETL startup was probably Prism Solutions in 1988. Informatica and DataStage showed up in the early 90s, followed by Talend and others.

Tristan Handy: It seems like we got a whole bundle of tools over time—ETL, CDC, app integration, metadata, and so on.

Lonne Jaffe: Yes, often bundled together, even though data prep and app integration were treated separately. That persisted for longer than you'd expect. At Syncsort, we acquired a company with a "transparency" solution that allowed IMS applications to use data stored in DB2 without rewriting code—a clever way to manage switching costs.

Tristan Handy: Speaking of switching costs—why are these legacy tools so sticky?

Lonne Jaffe: Great question. In many cases, no customer loves the product. They’d switch in a heartbeat—if it were easy. But rewriting jobs and ensuring reliability is a heavy lift. The best outcome is a new system that replicates old functionality. And for many organizations, that’s not worth the risk.

Tristan Handy: But if generative AI could reduce those switching costs?

Lonne Jaffe: That’s the potential. Code generation, agents that explore and iterate—those could erode the moat that’s protected these incumbents for decades. Not tomorrow, but it’s a real possibility.

Tristan Handy: It also seems like some of these systems are more robust than people give them credit for.

Lonne Jaffe: Absolutely. Mainframes are IO supercomputers. Products like InterSystems Cache, used by Epic, are incredibly performant. But new systems must match or exceed those capabilities in reliability and scale, which is a high bar.

Tristan Handy: As you look at the evolution of the modern data stack, how do you think about its impact on the market?

Lonne Jaffe: In the 2010s, we saw disaggregation—tools like Fivetran, dbt, and Snowflake each tackled a slice of the old enterprise bundle. But the TAM isn’t infinite. Some categories may compress or vanish entirely if price drops aren’t offset by new demand.

Tristan Handy: Do you think AI expands or compresses the data stack?

Lonne Jaffe: It depends. High elasticity of demand—like with dashboards or analytics—can drive massive TAM expansion. But some categories, like logo redesign or simple data movement, might get commoditized. For more complex workflows, AI agents accessing platforms like dbt or Atlan could dramatically increase value by automating common tasks and enabling new personas.

Tristan Handy: We’ve seen an example already—a customer replaced their data catalog with our dbt Cloud metadata server and AI interface.

Lonne Jaffe: That’s telling. If AI interfaces can connect to tools like dbt and generate value—self-service, documentation, lineage—it changes the game. Especially for organizations already standardized on those platforms.

Tristan Handy: What’s your view on how these AI interfaces get distributed?

Lonne Jaffe: ChatGPT Enterprise, Claude, and others are spreading fast. Eventually, you’ll want those tools to search files, access internal metadata, and interact with your data stack—not just answer questions from the open web.

Tristan Handy: It makes a lot of sense. If AI is going to serve enterprise users, it needs access to the real data. Otherwise, it’s just a toy.

Lonne Jaffe: Exactly. A model that can’t query or verify against your actual environment won’t be reliable. And data quality and observability—something dbt Cloud is already good at—become foundational.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

Everything terminals (w/ Zach Lloyd)

Dan Poppy — Sun, 25 May 2025 13:01:47 GMT

In this episode, Tristan talks with Zach Lloyd, founder of Warp—a terminal built for the modern era, including for AI agents. They explore the history of terminals, differences between terminals and shells, and what the future might look like. In a world driven by generative AI, the terminal could once again be the control center of computer usage.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.

Listen & subscribe from:

Chapters

01:00 – Introducing Warp and Zach Lloyd
- Zach Lloyd explains Warp's origin, mission, and initial vision.
02:40 – Why redesign the terminal?
- Zach describes why traditional terminal UX was ripe for reinvention.
04:43 – Enter LLMs: A new direction for Warp
- Warp evolves into a natural language interface for developer workflows.
06:34 – What is a shell?
- Zach defines shells, how they process text, and their role in the CLI ecosystem.
07:58 – Shells vs programs vs built-ins
- Distinguishing between shell commands and standalone programs.
10:00 – Why do developers debate shells?
- Features, syntax, and licensing behind the Bash vs Z Shell discussion.
12:17 – Why terminals still matter
- The enduring power of text-based computing and scripting.
16:40 – What is a terminal, really?
- Clarifying the difference between terminal hardware, emulators, and modern terminal apps.
20:13 – The Warp interface
- Zach breaks down Warp’s UI: input editor, output blocks, and mouse support.
22:48 – Will Warp replace your IDE?
- The vision of AI-driven development and the convergence of terminal, editor, and chat.
27:20 – Rethinking development interfaces
- Finding the ideal hub for AI-native software development.
35:00 – Why the terminal has an edge
- Advantages of the terminal for cross-project, full-lifecycle developer tasks.
37:10 – Bottom-up adoption strategy
- How Warp approaches growth: focus on individual developers, not top-down mandates.
39:50 – Is Warp redefining the terminal?
- The challenges of innovating in a legacy-dominated space and creating a new category.
42:45 – Developer control & context in Warp
- Customization, context-awareness, and MCP integration in Warp’s AI tooling.
46:32 – Closing reflections
- Zach and Tristan wrap up their thoughts on the future of terminals, AI, and developer tools.

Key takeaways from this episode

Tristan Handy: Can you tell us about Warp, where the idea came from, and where you’re at today?

Zach Lloyd: Warp reimagines the command line to make it more approachable, powerful, and useful for developers. I've been a software engineer for over 20 years and always used the terminal, but never understood why it worked the way it did. I used to learn the minimum I needed and rely on team members when I ran into issues.

After my last startup, I looked at tools I used frequently that could have a big impact if improved. The terminal stood out. I realized better UX—like being able to use a mouse to position the cursor or select output for copy-paste—could unlock a lot of productivity. That was the initial idea about five years ago.

We spent the first couple of years redesigning the interface. Today, Warp is more than a terminal—it's a natural language interface to the command line, powered by large language models (LLMs). You can use it to set up projects, write code, debug production, and more.

Tristan: I want to dig into fundamentals. Can you define what a shell is?

Zach: A shell is a program that parses text input, runs commands, and returns text output. You can run it interactively or through scripts. Terminals, by contrast, are the graphical layer that displays text and captures keyboard input. Shells like Bash, Z Shell, and Fish offer different features, syntaxes, and configurations. Some programs like cp are shell built-ins, which don’t require forking new processes.

Tristan: Why do terminals persist in a GUI-dominated world?

Zach: A few reasons. First, it’s easier to write command-line apps than GUI apps. Second, the interface is infinitely flexible—you can pass endless flags and parameters. Third, command-line programs interoperate cleanly via text streams. And lastly, they’re scriptable. Developers can automate repetitive workflows easily, which is powerful.

Tristan: So a terminal just runs a shell. But I never think of terminals as having features. What makes a terminal more than a simple interface?

Zach: Terminals emulate old hardware—keyboards and text displays. Today’s terminal apps are GUI shells that simulate this behavior. Most are "dumb terminals," just rendering characters. But they can support features like theming, control characters for advanced UI (e.g., in Vim), and even bitmap rendering.

Tristan: Warp looks very different. Can you describe it?

Zach: Warp looks more like a chat or notebook interface. Each command's output is grouped in a logical block instead of being dumped in a scroll. The input area behaves more like a code editor, with syntax highlighting and first-class mouse support. We're aiming for modern UX.

Tristan: So you're blending terminal, editor, and chat. Will people eventually write all their code in Warp?

Zach: My vision is that developers will increasingly describe what they want in natural language, and agents will do the work. Developers supervise the results. That interface needs to support managing many tasks at once. That’s what we’re building towards. It won’t even be called a terminal—it’s a new category of software.

Tristan: The boundaries between these tools are blurring. And maybe the best interface for AI-assisted development isn't an IDE or chat app—it could be the terminal.

Zach: The terminal spans all phases of development—from setup to deployment and debugging. It also supports cross-project work, which IDEs don’t. That’s a huge strength.

Tristan: But terminals are a personal choice. How do you think about adoption and your business model?

Zach: Like editors, terminals are developer-choice tools. We don’t go top-down. Our motion is bottoms-up: get individuals to love Warp, then expand into teams and enterprises for security, privacy, and data controls.

Tristan: Are you trying to reset the baseline for what a terminal is?

Zach: We're not open source, though we’ve considered it. It’s risky. But our focus isn’t on redefining "the terminal." It’s on building the best tool for developers to ship software. That might require a new category name.

Tristan: What’s the dev experience in Warp like? Is it customizable?

Zach: We support theming and shortcuts. But the most important part is AI context. Warp can use any CLI tool to gather context—GitHub CLI, GCloud, etc. We’re also implementing the Model Context Protocol (MCP) and plan to better support custom/internal tools as well.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

Why compilers matter (w/ Lukas Schulte)

Dan Poppy — Mon, 12 May 2025 12:02:26 GMT

Tristan Handy dives deep into the world of compilers in this episode of The Analytics Engineering Podcast with Lukas Schulte, cofounder of SDF Labs (not to be confused with last episode’s guest—Lukas’ dad and fellow SDF cofounder Wolfram Schulte). Tristan and Lukas discuss what compilers are, how they work, and what they mean for the data ecosystem. SDF, which was recently acquired by dbt Labs, builds a world-class SQL compiler aimed at abstracting away the complexity of warehouse-specific SQL.

The conversation covers the evolution of compiler technology, what software engineering has gotten right over the past several decades, and why the data ecosystem is poised for similar transformation. Lucas and Tristan explore why SQL has lagged behind other programming ecosystems, and how new compiler infrastructure could lead to package management, interoperability, and greater innovation across data platforms. It’s a fascinating (and timely) episode: Get ready for the new dbt engine.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.

Listen & subscribe from:

Chapters

02:40 The vision behind SDF Labs
04:00 What is a compiler?
05:00 Components of a compiler: frontend, IR, backend
08:00 Syntax vs. semantics and the role of parsing
10:00 Logical vs. physical plans in SQL compilers
13:00 Historical context: mainframes to LLVM
16:00 Cross-architecture portability in Rust & other compilers
18:00 What is LLVM and why it matters
20:00 Bootstrapping and the self-recursive nature of compilers
21:00 Compilers in Java, TypeScript, and dbt
23:00 Why compilers are foundational to software ecosystems
26:00 The SQL dialect problem in data warehouses
29:00 Can SQL get its own LLVM?
31:00 How Substrate and DataFusion aim to standardize SQL
35:00 Package management and the path toward SQL abstractions
38:00 The future of the data ecosystem with a common SQL compiler

Key takeaways from this episode

What is a compiler?

Tristan Handy: What is a compiler?

Lukas Schulte: It's something that takes higher-level human-readable code and translates, compiles, rewrites it into lower-level machine code that is much harder for humans to understand and much easier for machines to understand.

Compilers typically have phases. They have a frontend that deals with the language you're working with, a middle component—usually called an IR or intermediate representation—and a backend that takes that IR and compiles it into machine code.

Compiler phases: frontend, IR, backend

Tristan Handy: How does it all come together?

Lukas Schulte: There’s a preprocessor that handles macros, removes comments, and prepares the text. Then a lexer converts it into tokens. These tokens get assembled into a tree that the compiler can understand. That’s where syntax validation and semantic analysis happen.

From there, we build a logical representation of the operations we want to perform. That transitions to a physical plan, which starts considering the hardware: how many cores, how much memory, which files we’re accessing. After that, optimizations are applied and it compiles to actual machine code using a toolchain like LLVM.

Syntax vs. semantics

Lukas Schulte: Let’s break down syntax vs. semantics.

Imagine the code x = x + 1. That has valid syntax. Its meaning—its semantics—is that we’re incrementing x by 1.

Now, you could also write x += 1. Different syntax, same semantics. So syntax defines structure, and semantics define meaning. That distinction is important when you’re analyzing or transforming code.

LLVM and portability

Tristan Handy: Have we been building abstraction layers like this for decades?

Lukas Schulte: Absolutely. That’s what LLVM does. It provides a consistent intermediate representation that compilers can use to target multiple backends—Intel, ARM, different OSes. Apple invested early in LLVM to support custom chips.

With Rust, for example, LLVM is what lets us build binaries that behave the same on macOS, Windows, and Linux with relatively little effort.

Bootstrapping compilers

Tristan Handy: So there’s this recursive loop—compilers being built with other compilers?

Lukas Schulte: Exactly. Rust wasn’t always written in Rust—it started in C++. Eventually, the compiler was rewritten in Rust itself. Now, Rust compiles Rust. It’s fully self-hosted. That’s common with mature languages—it shows the compiler ecosystem is stable and powerful enough to sustain itself.

Why compilers matter

Tristan Handy: You said once that compilers are the foundation of every software ecosystem. What did you mean?

Lukas Schulte: There are two big drivers in software: abstractions and standards. You want one way to interface with a USB device—not ten. Same for software. You want one standard way to express a Python program, a JavaScript app, etc.

Compilers enforce those standards and make sure the same code works across platforms. That consistency powers things like package managers, shared libraries, and open ecosystems.

SQL dialects and fragmentation

Tristan Handy: Are there ecosystems that are doing worse than others?

Lukas Schulte: SQL does a particularly bad job. Anyone who's used more than one data warehouse knows you can't take the same SQL statement and expect it to work the same way. Casting, case sensitivity, functions—every engine handles these things differently.

Toward a universal SQL compiler

Tristan Handy: Can you convince me this problem is solvable?

Lukas Schulte: Yes. That's what we're working on with SDF—creating a shared intermediate representation for SQL. If we can express SQL logic in a unified form, we can compile it to any dialect—BigQuery, Snowflake, Redshift, and so on.

That allows developers to build reusable libraries, just like in other languages. It also makes governance, validation, and testing easier.

Future of data ecosystems

Tristan Handy: What would that future look like for practitioners?

Lukas Schulte: One major change would be the emergence of robust SQL libraries. Today, there’s no import system for SQL. Everyone writes similar logic over and over.

A shared compiler abstraction would let us reuse components, collaborate across companies, and build an ecosystem of packages for transformations, metrics, and validations—similar to how we use NPM or PyPI.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

The evolution of databases (w/ Wolfram Schulte)

Dan Poppy — Mon, 28 Apr 2025 12:02:43 GMT

Summary

Welcome to our new season of The Analytics Engineering Podcast. This season, we’re focusing on developer experience. We’ll explore the developer experience by tracing the lineage of foundational software tools, platforms, and frameworks. From compilers to modern cloud infrastructure and data systems, we’ll unpack how each layer of the stack shapes the way developers build, collaborate, and innovate today. It’s a theme that lends itself to a lot of great conversations on where we’ve come from and where we’re headed.

In our first episode of the season, Tristan talks with Wolfram Schulte. Wolfram is a distinguished engineer at dbt Labs. He joined the company via the acquisition of SDF Labs Labs, where he was co-founder and CTO. He spent close to two decades in Microsoft Research and several years at Meta building their data platform.

One of the amazing things about Wolfram is his love of teaching others the things that he's passionate about. In this episode, he discusses the internal workings of data systems. He and Tristan talk about SQL parsers, compilers, execution engines, composability, and the world of heterogeneous compute that we're all headed towards. While some of this might seem a little sci-fi, it’s likely right around the corner. And Wolfram is inventing some of the tech that's going to get us there.

Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Chapters

01:35 Introduction to dbt Labs and SDF Labs collaboration
04:42 Wolfram's journey from monastery to tech innovator
07:55 The role of compilers in database technology
11:05 Building efficient engineering systems at Microsoft
14:13 Navigating data complexity at Facebook
18:51 Understanding database components and their importance
24:44 The shift from row-based to column-based Storage
27:40 Emergence of modular databases
28:44 The rise of multimodal databases
30:45 The role of standards in data management
35:04 Balancing optimization and interoperability
36:38 Conceptual buckets for database engines
38:46 DataFusion compared to DuckDB
40:44 ClickHouse
44:20 Bridging the gap between SQL and new technologies
50:55 The future of developer experience

Key takeaways from this episode

From monastery to Microsoft: Wolfram’s journey

Tristan Handy: Can you walk us through the Wolfram Schulte origin story?

Wolfram Schulte: I was born in rural Germany—Sauerland—and ended up in a monastery boarding school after my father passed away. Their goal was to train monks and priests, but that didn’t stick for me.

Later I went to Berlin—back then you had to cross East Germany to get there—and began studying physics. But I realized everyone else understood physics better than I did! One day I walked past a lecture on data structures and algorithms, and I was hooked. I hadn’t written a line of code at that point, but I switched to computer science immediately.

After my PhD in compiler construction, I joined a startup, then landed at Microsoft Research in 1999 thanks to a chance encounter with the logician Yuri Gurevich.

Inside Microsoft Research and Cloud Build

At Microsoft Research, we were like Switzerland—neutral across teams like Office, Windows, and Bing. We’d invent tools and ideas, but often the business units didn’t trust them. That changed when I was asked to build an engineering org.

We created Cloud Build, a distributed build system like Google’s Bazel. It reduced build times from hours to minutes and had a huge impact on iteration speed, productivity, and even morale. People stayed in flow. Builds were faster, cheaper, and smarter—running mostly on spare capacity.

Janitorial work at Meta: cleaning up big data

You later joined Facebook (Meta). What was that like?

A different world. No titles for engineers. Egalitarian, fast-moving. I joined to clean up the data warehouse—what they called “janitorial work.” At Meta, each type of workload had its own engine: time-series, batch, streaming, etc. This made understanding lineage and dependencies across systems extremely hard.

We responded by building UPM, a SQL pre-processor that stitched metadata across engines. It became part of Meta’s privacy infrastructure and compliance tooling, especially after the fallout from Cambridge Analytica.

Databases as compilers

Let’s shift gears. Can you walk us through how analytical databases actually work—like a professor at a whiteboard?

Sure. Think of a database like a compiler:

Parsing & analysis: Is the SQL valid? Are the types correct?
Optimization: SQL is declarative, so you can reorder joins, push down filters—based on algebraic laws like associativity.
Execution: Often done in parallel, especially in modern warehouses.
Storage: Columnar vs. row-based; optimized formats like Parquet or ClickHouse’s custom format.

Historically, storage and compute were bundled. Now they’re decoupled. But when the engine understands the format deeply, performance is much better.

The rise of modular and composable data platforms

How did we get from monolithic systems to the composable database architectures we have today?

It started with the rise of big data—Hadoop, HDFS, MapReduce. That decoupled compute from storage. Columnar formats like Parquet enabled analytical workloads. Then came Iceberg, Delta Lake, and similar standards that enabled multiple engines to share data.

Modern databases are modular. For example, Postgres is transactional, but you can bolt on an OLAP engine for analytical queries. You can mix and match based on your workload. The result is a data ecosystem that’s far more flexible—but also more complex.

Engine families: Snowflake, DuckDB, ClickHouse

Can you help us bucket the different kinds of engines out there?

Totally. Here are three buckets:

Cloud-native engines: Snowflake, BigQuery. They’re optimized for massive scale, often with their own proprietary storage.
Embedded/single-node engines: DuckDB, DataFusion. Great for local dev or embedded analytics. DuckDB is for users; DataFusion is for database builders.
Real-time/high-throughput engines: ClickHouse, Druid. Tuned for streaming and extremely fast aggregations.

Each has its trade-offs. Increasingly, projects are combining these. For example, you can plug DuckDB or DataFusion into Spark to speed up leaf-node execution. The whole engine space is getting more composable—and more interchangeable.

The role of SDF in dbt’s future

If you think about the future where SDF is fully integrated into dbt Cloud, what does that enable?

Initially, it might feel the same—but faster, smarter. Longer-term, we can give developers superpowers.

Imagine your dev environment proactively surfaces:

“This data looks different than yesterday—want to investigate?”
“You’re missing a metric that’s often used alongside this one.”
“This join will behave differently on engine X—here’s what to change.”

That’s the kind of intelligent, predictive developer experience we’re building. We’re catching SQL up to what IDEs have done for code. And if we can make logical plans portable across engines, dbt becomes the consistent interface across heterogeneous compute.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

A New Kind of Weird

Jason Ganz — Sun, 27 Apr 2025 11:52:04 GMT

I did something wrong.

I try really hard and go into every conference with an open mind about what I’m going to learn. Tabula rasa. Blank Slate. Beginners Mind. This is actually a really important part of being able to continually grow and develop your analysis of the industry rather than getting stuck in familiar mental grooves.

But for this year’s Data Council, I have to admit I went in with a preconceived take on the newsletter I wanted to be sending out today.

“I’ve been to a whole lot of data conferences that talk about the intersection of data and generative AI”, I’d write triumphantly, “but this was the first one I’ve been to where data and AI felt truly integrated, where the worlds finally converged”.

And you know what? It was true. You couldn’t throw a stone in the convention hall without hitting a booth for AI-assisted data development or using your data in agent systems.

GenAI applications, after all, aren’t just running on models trained on massive datasets built and maintained with many of the tools and open source libraries created by the people and organizations at Data Council. Their usage and utility also depends on strong infrastructure, as Martin has told us.

We saw a lot of very cool data + AI infrastructure at Data Council!

Bauplan, fresh off their recent fundraise, walked us through the minimum viable data platform
The Snowflake booth showed how Cortex Agents can sit in your database and perform useful work
Lloyd Tabb gave a great walkthrough of Malloy and repeatedly emphasized the benefits of writing LLM-based analytics queries with a Semantic Layer as opposed to going straight to SQL
Jacob ran a session on vibe-coding your data engineering workflows
MCP was the talk of the town, with notable MCP servers being discussed by ClickHouse, Motherduck and yours truly.

And then of course we had Elias discussing SDF + dbt and walking through a new bit of data infrastructure that I believe is going to play a significant role in the story of how data + Gen AI fit together, the development of the new dbt engine - Rust-based, type-aware and ready to validate your SQL queries are dialect accurate and governed, whether they are written by a human or a machine.

So in a certain sense, I am walking away from this Data Council feeling like the worlds of generative AI and traditional data infra are closer together than ever.

But in another, deeper sense, I’m not.

A familiar kind of weird and a new kind of weird

Three years ago, in his reflections on Data Council, Drew had one request: “Keep Data Council Weird”. At the time, we were wondering if the ecosystem was becoming too vendor+VC driven and hoping that we’d still maintain our spunky outsider energy.

Well, I have to be honest with you, this Data Council felt pretty darn weird.

Partly, it felt weird in a familiar way. I asked Drew if this year felt weird and here’s what he told me:

The venue - a masonic temple - was gorgeous and unlike any conference venue I’ve been to before. My legs hurt from walking up and down 4 flights of carpeted stairs. I watched Elias’s talk from a parapet (is that even the word?)1 in a column adorned theater. I think I saw a crucifix. The bathrooms had couches in them. Scott B and I talked about our skincare routines. I saw a lot of old friends and former coworkers. I befriended [redacted]. My beef with [redacted] grew even deeper. I had a top 3 all-time cheeseburger and a bottom 3 all-time dessert (Mango Piggy). Pete and the whole Data Council team put on one hell of an event this year!

If you’ve been around the block enough times, this is a familiar kind of weirdness. Comforting.

It also felt weird in a different way though:

Because fundamentally, even though data infra + AI are moving ever closer together, there are big differences in how each side moves and progresses.

The reason boils down to this:

Data Infra is heavily engineered, based on building well-understood systems and standards.

It moves at the speed of ecosystems and standards. Three years ago at Data Council I’m sure there were people talking about Apache Iceberg and wondering whether it would become adopted across the industry. We’re big believers in Iceberg at dbt Labs and I expect to see strong and meaningful adoption of Iceberg over the next three years. I think an 80th percentile good outcome for Iceberg adoption looks like a world where organizations are not meaningfully constrained by their choice of data platform and are able to use Iceberg to avoid vendor lock-in and have true cross-platform control of how they operate on their data.

Generative AI is built differently, and it moves at a different speed.

The folks at Anthropic like to say that LLMs are grown, not built. Three years ago when Drew said that we should keep Data Council Weird, we were about 9 months out from the release of ChatGPT, and a year away from GPT-4.

Since then, the price of a query to GPT-4 has fallen by somewhere around 100x. OpenAI is projecting $125 billion in revenue by 2029. The latest paradigm shift, reasoning models, are around six months old.

I don’t know what an 80th percentile “good” (meaning fast) outcome looks like here, but there are people a lot closer to this than me that are saying we’re going to be deploying bio-engineered algae nanobots to fuel the data centers doing recursively self-improving AI by the time we hit three year’s from now’s Data Council.

That, to me, is pretty weird.

The weirdness of two worlds, closer than ever before but apparently moving at blindingly different speeds.

The weirdness of sitting in a talk and getting legitimately excited by the idea that we as an ecosystem can robustly adopt the nearly-decade-old Apache Arrow and then going into the hall to talk to someone who had just walked out of a talk on Bryan’s Foundation Models track and was wondering to what extent 2 year old LLM based coding workflows are going to change whether any of these questions are still relevant.

So what do we do with this?

Look, maybe one day soon, we’ll pinch ourselves, bolt awake and think “man that whole AI thing was crazy”. I’ll look back on this newsletter, cringe a bit about my prognostication and sheepishly admit that maybe I got carried away by drawing out lines on a curve. God knows it’s happened before.

But … maybe not. And in that world, what relevance does the data infra have?

I think it means that all of this matters a lot - even more so in this world. It means that pretty soon, the data systems and data infrastructure we build are going to be powering a whole lot of systems that interface more directly with the world than we are used to.

Because my prewritten take about data systems and AI workflows becoming increasingly intertwined and dependent on each other was right. And now we need to figure out how to make engineered data infrastructure that move at human scale support LLMs that look like they are moving much faster and are still fundamentally mysterious to us.

The real world and the data we represent it with has a lot of complexity. And if we’re about to have AI systems that are 100x cheaper and 100x more powerful than what we have today operating on the tools, systems and standards we build, then they’d better be really good.

I don’t have an exact answer to how we should approach this. I don’t think anyone does.

I do know that I’m looking forward to next year’s Data Council and the one after that and the one after that too. I’m hoping that alongside the new weirdness, we keep the familiar weirdness and that we all continue to share our knowledge, our expertise and perhaps most importantly our mango piggies.

Anders eating a well deserved Mango Piggy

Appendix

As I was writing this, the ever thoughtful Benn Stancil released a post touching heavily on MCP and the dbt MCP.

benn.substack

A new invisible hand

a year ago · 16 likes · 4 comments · Benn Stancil

As with basically everything Benn writes - it’s worth your time. The post probably deserves a full response, so I’ll save commentary for another day, but I recommend you check it out.

The analytics engineering roundup is sponsored by dbt Labs.

If you want to see what the big kerfuffle about dbt + SDF is all about, plus a whole lot more, join Elias and the dbt team for our Cloud Launch Showcase on 5/28 (parapet not included).

Editors note: That is not the word