<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Data Jargon Newsletter]]></title><description><![CDATA[A newsletter about data, business and the business of data. ]]></description><link>https://datajargon.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!rNNS!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b14ac3f-a096-45c9-903d-80cb4ba1191a_400x400.jpeg</url><title>The Data Jargon Newsletter</title><link>https://datajargon.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 05 Apr 2026 18:10:28 GMT</lastBuildDate><atom:link href="https://datajargon.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Joe Naso]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[datajargon@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[datajargon@substack.com]]></itunes:email><itunes:name><![CDATA[Joe Naso]]></itunes:name></itunes:owner><itunes:author><![CDATA[Joe Naso]]></itunes:author><googleplay:owner><![CDATA[datajargon@substack.com]]></googleplay:owner><googleplay:email><![CDATA[datajargon@substack.com]]></googleplay:email><googleplay:author><![CDATA[Joe Naso]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Numbers: Wrapping up Year Three]]></title><description><![CDATA[A no-nonsense summary of another year running a small data consultancy]]></description><link>https://datajargon.substack.com/p/the-numbers-wrapping-up-year-three</link><guid isPermaLink="false">https://datajargon.substack.com/p/the-numbers-wrapping-up-year-three</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Wed, 04 Mar 2026 14:05:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Nq7F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>My last annual recap came 5 months into the new year. </p><p>Unintentionally, I&#8217;ve been mostly quiet on here since then. In some ways 2025 felt like a year of growing up, both for the business and for me. </p><p>But before I start waxing poetic about running Purview Labs, we can run through the numbers. I know that&#8217;s what everyone wants to know. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! Subscribe below.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>Money, Clients &amp; Projects: </strong></h2><p>Profit Margin is going down, and according to my adjusted calculations, clocks in at ~65%. This is notably different from my last review at a respectable ~85%. </p><p>Revenue went up. In fact, this was another year where the top line was at least 1.5x the previous year. This is not exponential growth. But it is sustained.</p><p>I wrapped up long standing engagements with two clients. One engagement lasted nearly 2 years. The other almost 10 months. At a certain point, you just know when to move on. </p><p>I worked on projects for 12 different clients in 2025. 4 of those clients were repeat customers. </p><h2>Technologies</h2><p>I&#8217;ve seen a lot of the usual suspects over the last year: </p><ul><li><p>AWS, GCP</p></li><li><p>Snowflake, BigQuery, Postgres</p></li><li><p>Dagster, Airflow</p></li><li><p>FastAPI, Flask, Django</p></li><li><p>dbt, Dataform</p></li></ul><p>Only 2 projects did not involve writing code. </p><p>There are plenty of other tools and technologies to add to that list but those details are really not that important.  </p><h2>Can You Deliver?</h2><p>I still don&#8217;t have a logo and the Purview website is little more than text on a screen.</p><p>The only thing that matters is whether your client believes you can deliver what you say you can deliver. </p><p>Substitute client with stakeholder, boss, CEO, manager, interviewer, whoever. That is the only thing that matters. </p><p>And the way to instill that belief is to:</p><ol><li><p>Know what you&#8217;re taking about </p></li><li><p>Have proof you&#8217;ve done it before</p></li></ol><p>It&#8217;s a simple formula. But simple is not easy, as I always say. </p><h2>Where Is The Constraint? </h2><p>Business is filled with constraints. And only one matters at any given time. </p><p>If I get all Alex Hormozi on you, that constraint is either a demand constraint or a supply constraint. </p><p>It could be time, mental capacity, cost, staffing, sickness, AWS outages, who freaking knows. There are a million things that affect your ability to deliver but only one of them matters at any given time. </p><p>As a mostly one-person business, I started out the year as the constraint. And, I remained the constraint for most of the year. Unfortunately, the business in it&#8217;s current form would not survive without me. </p><p>But I am working to unwind that. </p><p>Getting contractors to work alongside me has been the first step in that direction. I have 3 of them right now. This has been transformative for the business. But there is still A LOT that can be improved. </p><p>I think of myself as a fairly intentional and purposeful decision maker. But I still hem and haw about mundane details. So, I talked to some coaches. I mostly wanted to gut check that what I am doing is not foolish. It&#8217;s easy to look at a post like this and think that things are trending in the right direction, but on a daily basis, that is not obvious.</p><p>It&#8217;s a big game of trial and error. You try some things. They either work or they don&#8217;t. Then your try some other things.  The cycle continues.</p><p>And as it turns out, sometimes that &#8220;other thing&#8221; is just doing more of what you&#8217;re already doing. It&#8217;s not about doing something <em>new; </em>it&#8217;s just about doing <em>more </em>of what works<em>. </em></p><h2>It Doesn&#8217;t Have To Be Hard</h2><p>My network has grown significantly since I started this business. Some of the people I talk to want to go out on their own. </p><p>I tell most of them some version of the same thing: if you actually know what you&#8217;re talking about, it really doesn&#8217;t have to be that hard. </p><p>I used to feel guilty when saying that out loud. Concerned I would come across as arrogant. But I don&#8217;t feel that way anymore. </p><p>Doing hard things is interesting and cool. It tells a good story. </p><p>But you don&#8217;t have to manufacture a challenge.</p><p>The challenges already exist. And by virtue of trying to do something a little different from everyone else (ie working on your own business), you will be challenged. </p><p>But, that doesn&#8217;t mean it has to be <em>hard.</em></p><p>There are endless ways to make money. There are limitless clients looking for help. And they all have countless problems that need to be solve.</p><p>If you know your stuff as well as you think you do, the &#8220;hard&#8221; part isn&#8217;t doing the work. </p><p>This defies everything we&#8217;re seeing from proponents of &#8220;9-9-6&#8221;  and all those hustle-culture warriors online. A past version of me could&#8217;ve gotten behind that mindset to some degree; but today? No thanks.</p><p>Effort and value are not the same thing. </p><h2>The Inevitable Outcome </h2><p>This business has a lifespan.</p><p>Ironically, the more the business grows, the more I recognize this to be true. </p><p>If AI automates away all value-add tech work, so be it. At that point, money won&#8217;t mean the same thing anyway. </p><p>More realistically, however, this business will run its course long before that happens. </p><p>In fact, I know it will. I tell people all the time that I am going to capitalize on this opportunity as best I can, while I can. </p><p>I have no idea what that timeline is, but I assume I smack in the middle of it right now.</p><p>I believe one of two things will occur: </p><ul><li><p>The company grows to have full-time employees. We ride the wave as long as makes sense</p></li></ul><ul><li><p>Purview in its current form no longer exists and I join forces with a client (future or current, who knows) to build something else</p></li></ul><p>Either would be a welcome next professional &#8220;season&#8221;. </p><h2>A More Personal Note</h2><p>I don&#8217;t share much personal news online these days. </p><p>But two big things happened in 2025 that I need to mention. </p><p>First, I got married in February. </p><p>And second, we bought a house towards the end of the year. </p><p>2025 was a year of upgrades. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nq7F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nq7F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Nq7F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Nq7F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Nq7F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nq7F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg" width="424" height="469.30190114068444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2911,&quot;width&quot;:2630,&quot;resizeWidth&quot;:424,&quot;bytes&quot;:2277948,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datajargon.substack.com/i/183827767?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad2d8d1-d6a3-4d5e-9d8d-bee86a707c44_3024x4032.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nq7F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Nq7F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Nq7F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Nq7F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fe493f-058a-4206-9842-c4fe349a8682_2630x2911.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">&#8220;Woof&#8221; - Phil</figcaption></figure></div><div><hr></div><p>If you want to talk data, engineering, entrepreneurship, or anything related to those topics, give me shout.</p><p>You can find me on <a href="https://www.linkedin.com/in/josephmnaso/">LinkedIn</a> and (very) occasionally on <a href="https://x.com/itsjoenaso">X</a>.</p><p>Joe </p><p></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! Subscribe below. </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Knowing What Good Looks Like]]></title><description><![CDATA[If you always do what you've always done, you'll always get what you've always got]]></description><link>https://datajargon.substack.com/p/knowing-what-good-looks-like</link><guid isPermaLink="false">https://datajargon.substack.com/p/knowing-what-good-looks-like</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Tue, 01 Jul 2025 11:45:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!rNNS!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b14ac3f-a096-45c9-903d-80cb4ba1191a_400x400.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The biggest blocker to progress is not knowing what good looks like. </p><p>It applies to everything in work and life. </p><p>If you are <em>here </em>and you want to get <em>there, </em>how do you make that happen? </p><p>In personal cases, it&#8217;s relatively easy to figure that out. You search online. Read some blogs. Talk to some friends or acquaintances. Maybe you got to therapy. </p><p>Identifying the how is straightforward, but the doing is not. </p><p>In professional cases, it&#8217;s different. You&#8217;re limited by the breadth and depth of the knowledge of your immediate team, of your current system, of the established processes. You need to support the old way while pushing forward into the new. </p><p>Getting from <em>here </em>to <em>there </em>in a professional or business context needs to be incremental. It needs to be iterative. </p><p>You&#8217;ll likely need to overcome the sunk cost fallacy, do a cost-benefit analysis, and produce a number of other one-off justifications. Those initiatives are worthwhile and necessary within reason. But they are incomplete. </p><p>When you break the established frame of <em>good</em>, you change the equation. Suddenly the old option is no longer sufficient. </p><p>Sticking with what is familiar can work, but it can also stifling progress. Especially as new requirements get introduced into the mix. </p><p>G<em>ood </em>in this context may look markedly different from what <em>good </em>looked like in the past. </p><p>This holds true for evaluating new tools and systems, designing systems, and vetting consultants. </p><p>What&#8217;s familiar is not necessarily what is good. </p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Drop you email below to receive occasional emails from yours truly.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Numbers: Two Years into a Solo Consulting Business]]></title><description><![CDATA[More like 2.5 years, actually.]]></description><link>https://datajargon.substack.com/p/the-numbers-two-years-into-a-solo</link><guid isPermaLink="false">https://datajargon.substack.com/p/the-numbers-two-years-into-a-solo</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Mon, 05 May 2025 16:02:48 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="5120" height="2880" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2880,&quot;width&quot;:5120,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;blue and white heart illustration&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="blue and white heart illustration" title="blue and white heart illustration" srcset="https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1620121478247-ec786b9be2fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNDR8fGFic3RyYWN0fGVufDB8fHx8MTc0NjQzMDY2OXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="true">Richard Horvath</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p></p><p>I&#8217;ve been delaying posting this update for a few months. Life and work have have kept me pretty busy. </p><p>But, I know you&#8217;re here for the numbers first and my commentary second, so let&#8217;s get those out of the way. </p><p>Here&#8217;s where we (the Royal we, of course) clocked in at the end of a 2nd year full time operating on my own. </p><h3><strong>Money: </strong></h3><p>Profit Margin: 85% </p><p>I 2x&#8217;d my 2023 revenue. Not bad for a one man show. </p><p>I did not bill 2x the hours of 2023. This is a good thing. </p><h3><strong>Clients: </strong></h3><p>I had 9 clients throughout the course of the year. </p><p>Of that total, 3 were repeat customers. And 2 were referrals.</p><p>1 client found me. </p><p>I was ghosted by a handful (3 I think?) prospects after talking multiple times, proposing a plan of attack, and getting into the weeds on pricing. </p><h3><strong>Projects/ Engagements: </strong></h3><p>I worked on 10 different engagements throughout 2024. Some took up less than 1 full day of billable time. Others lasted the entire year. </p><p>3 projects did not include writing code. </p><p>4 of the projects involved Airflow, with 3 of them requiring HEAVY usage of it.</p><p>2 projects involved Dagster. </p><p>5 of the projects used dbt to some degree. The implementations varied wildly in terms of maturity and complexity.</p><p></p><p>Ok, now it&#8217;s time for the commentary. </p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! Drop your email below for occasional emails.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><h3><strong>Crossroads</strong></h3><p>I offer a service that generates some decent cashflow. I sometimes hesitate to call it a <em>business</em>. <em> </em>Some of you might think that&#8217;s a dumb distinction, and you might be right. </p><p>There is a myth that some people tell themselves when they do independent consulting. The myth that you&#8217;ll work on a handful of projects, find the common thread across some problem-space or industry, and build a product from it. </p><p>I do not know a single person who has pulled this off. </p><p>But I know plenty of people who have lasted a few seasons trying to be a contractor, or a fractional Head of X, or an independent consultant. Then they stop. </p><p>Maybe they can&#8217;t sell, maybe they can&#8217;t market themselves as well as they thought, or maybe their personal lives changed and they need more stability. More predictability. </p><p>Maybe they just got bored. </p><p>The challenge with transitioning from being a consultant to building a product is that if you&#8217;re any good at solving problems, you&#8217;ll be making a fair amount money. And switching gears from doing that to building a product is hard. </p><p>Not to mention expensive. </p><p>And this is the crossroads where I find myself.</p><p></p><div><hr></div><h3><strong>Professional Attention Span</strong></h3><p>This might ruffle some feathers. </p><p>And I may run the risk of coming off the wrong way with this comment, but my professional attention span has plummeted. </p><p>You might be able to re-label it as <em>professional patience</em> or even <em>tolerance, </em>but the short of it is that I have learned to value my time, energy and attention at a much higher level than when I was not working for myself. </p><p></p><div><hr></div><h3><strong>Illusions (Delusions?) of Grandeur</strong></h3><p>I used to post on LinkedIn regularly. I stopped a couple years ago because my feed was turning into an echo chamber of self-aggrandizing and unhelpful commentary all centered around <em>data. </em></p><p>Actually, my feed is still an echo chamber.</p><p>I stopped posting on Twitter (or even logging in for that matter) for mostly the same reason. Similarly, my attempts at BlueSky lasted all of a few months. </p><p>That&#8217;s not to say that I was gaining little from these platforms. </p><p>They provided value at times.</p><p>But they are tiring. </p><p>Not everyone can be an expert. That&#8217;s just reality. If everyone is an expert, no one is.</p><p>I&#8217;m fairly knowledgeable about the world of data and analytics. I do not consider myself &#8220;an expert&#8221;. </p><p>The breadth and depth of knowledge needed to be considered an expert is prohibitive to most - there are too many nuances, too many subcategories, too many parts to the whole.</p><p>But whenever I log into LinkedIn, it sure seems like there are <em>a lot</em> of experts.</p><p>And they all know how that  problem - <em>your problem -</em> should<em> </em>be solved. How to do it the<em> correct </em>way<em>. </em></p><p>There are many known solutions to common problems we see as engineers of any discipline - data engineers, backend engineers, frontend engineers,  even machine learning engineers.  </p><p>But things aren&#8217;t quite so cut-and-dry as many online want you to believe. </p><p>Problem solving is an art. The application of specific solutions is the science. </p><p></p><div><hr></div><h3><strong>Middle of the Pack</strong></h3><p>At the risk of sounding overly simplistic, there are two types of people I see everywhere online. </p><p>One is the previously mentioned expert; sometimes they have a strong personal brand. They have the answers. They know their stuff!</p><p>The other is the newcomer. Wet behind the ears, and figuring things out. They have questions. They want to learn. </p><p>As you could probably guess, I consider myself to be neither of these, but I can relate to both.  </p><p>But, the group I most closely identify with professionally are also not found in either bucket. And they&#8217;re really hard to find online. </p><p>Am I doing $1M annually? Not yet. </p><p>Am I struggling to pay the bills? Also no.</p><p>Those two &#8220;profiles&#8221; are all over the internet, especially in the &#8220;solopreneurship&#8221; and independent consulting communities. Where are the people in the middle? </p><p>Turns out, they&#8217;re just getting shit done and living their lives. </p><p>If this is you, it can <em>feel </em>like being in the middle of the pack. But that likely means you&#8217;re comparing yourself against the wrong group.</p><p>Because the people really crushing it are often the ones you&#8217;ve probably never heard of. </p><p></p><div><hr></div><h3><strong>On Operating Solo</strong></h3><p>I spend a lot of time thinking. </p><p>Too much time, in fact. </p><p>Past a certain point, this is one of the most dangerous places to spend time when working alone. </p><p>I&#8217;m not talking about the instances when you&#8217;re in a &#8220;flow state&#8221;. I&#8217;m talking about an mundane Tuesday when there is not a ton going on. </p><p>The biggest downside of working independently? The opportunities for serendipity are lacking. </p><p>But the opportunities for over-analysis and preoccupation abound.</p><p>Sure, I talk to client teams every day. And work closely with them. </p><p>But that&#8217;s not the same as working with your immediate team. I do miss some of those things.</p><p>What I&#8217;ve come realize is that my ability to understand clients needs, execute on the technical components, and communicate with clarity are the things that have helped me along this journey. </p><p>Those are inherently social activities.</p><p>But my current setup is not particularly conducive to fostering the social aspects of professional life, at least not on a daily basis. </p><p>Which brings us full-circle to the crossroads I mentioned towards the top of this post. Funny how that works. </p><p></p><div><hr></div><h3><strong>Unprecedented Times! Uncertainty! Fear!</strong></h3><p>The current state of the market - globally, in the US, and more specifically within tech - is volatile. There is no question. </p><p>But, volatility breeds opportunity. </p><p>I have no idea how things will play out, but I am operating under the assumption that I can take advantage of the current market conditions. </p><p>There really isn&#8217;t much of an alternative. </p><p>At this point in my career, I strongly prefer to be making the decisions than being told what decisions were made.</p><p>Regardless of which direction the economy may head, these things will always remain true: </p><ul><li><p>Companies will still have problems they do not know how to solve </p></li><li><p>Knowledgable and trustworthy outside experts will be the fastest way to address those problems</p></li></ul><p>Does that mean it will always be smooth sailing? </p><p>Of course not.</p><p>But it does mean that knowing how to solve problems is how you weather the storm. And if the storm never comes, you keep on solving those problems just the same. </p><p></p><div><hr></div><p></p><p>If any of this resonated with where you find yourself professionally, I&#8217;d love to connect. </p><p>If you&#8217;re doing it on your own or part of a small team, that&#8217;s even better. </p><p>You can find me on <a href="https://www.linkedin.com/in/josephmnaso/">LinkedIn</a> (even if I&#8217;m not posting regularly).</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading!  Drop you email below to receive occasional emails from yours truly. </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Don't overcomplicate your data strategy]]></title><description><![CDATA[It&#8217;s easy to make things complicated.]]></description><link>https://datajargon.substack.com/p/dont-overcomplicate-your-data-strategy</link><guid isPermaLink="false">https://datajargon.substack.com/p/dont-overcomplicate-your-data-strategy</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Thu, 17 Oct 2024 12:31:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c3d3fbfd-57db-4059-aea2-e10728a90cea_800x800.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It&#8217;s easy to make things complicated. </p><p>It requires less commitment, less resolve. </p><p>You can hide your business&#8217;s lack of clarity (or your own) with confusing terms and countless stakeholders and unclear requirements. </p><p>The result? An abstract mandate from an exec to &#8220;figure out our data strategy&#8221;. </p><p>But if you ask 10 different people what constitutes a data strategy, you&#8217;ll get 10 different answers. </p><p>The quality of your strategy has nothing to do with the size of you business. But &#8220;good&#8221; data strategies generally follow the same themes:</p><ol><li><p>A data strategy is not made up of tools. </p></li><li><p>A data strategy has less to do with &#8220;the data&#8221; than you think. </p></li><li><p>A data strategy is never actually finished. </p></li><li><p>An incomplete data strategy does not mean it is an ineffective one.</p></li></ol><p>Good data strategies are not complicated. </p><p>Hell, good <em>strategies </em>are not complicated. </p><p>Effective data strategies:</p><ol><li><p>Meet consumers where they live</p></li><li><p>Provide a consistent experience</p></li><li><p>Extend to all  </p></li></ol><p>Simple, but not easy. </p><p></p><h4>Meet consumers where they live</h4><p>If your account team spends their time in Salesforce, you&#8217;re not going to have a good time telling them to use Metabase or Omni or some other tool. Don&#8217;t push for something new. Put the data where they spend their time. </p><p>Do your customers want to export data for their own use? Data sharing can take many forms, from warehouse native integrations to cloud storage. The methodology is a one part of the a larger whole. </p><p>The entry points you enable are arguably as important as the data you provide. Keep this in mind; we&#8217;ll come back to this one in a minute. </p><p>Don&#8217;t lose sight of the bigger picture - the most important factor is whether or not the data you share actually provides value to those who consume it.</p><p></p><h4>Provide a consistent experience</h4><p>Consistency means a few thing in this context.  </p><p>Behavior needs to be consistent across systems. </p><p>Data needs to be consistent across access points - for instance, those customer exports and that Omni dashboard. Is data a core component of your in-app experience? Great, you better make sure everything jives across these various points of access. The only way to do this is to ensure they are logical and well-maintained. </p><p>The &#8220;single source of truth&#8221; label everyone likes to throw around is a bit of a misnomer; there is seldom a singular place to go to get all the data you want. Instead, a well-structured and well-understood set of resources (ie tables, datasets, etc) are what you need. </p><p>A data platform is an effective way to eventually make this &#8220;single source of truth&#8221; a reality, but it can often become just a formal entrypoint into an ugly mess behind the scenes. </p><p>Beware the &#8220;lipstick on a pig&#8221; phenomenon. You still need to establish standards and adhere to them. </p><p>Remember, tooling does not make a strategy. </p><p>Data platforms are also never quite finished, even when you think they may be. The velocity of change may slow and shrink in size and scope, but there will always be change. </p><p>It&#8217;s simply the reality of providing one centralized point of entry into a much broader and  heterogenous set information.</p><p>Iterations of work to build and harden your data platform for many use cases are a requisite. Don&#8217;t be afraid of this part.</p><p>Those iterations also serve another purpose - refining your data strategy. </p><p></p><h4>Extend to All</h4><p>It might seem odd to use platform development to inform your strategy. Shouldn&#8217;t your strategy inform your development work? </p><p>If you are an established business with stable revenue, stable tech, and well understood growth levers, sure. If you are not, prepare for a lot of iteration. </p><p>But one thing is for certain - your data strategy defines how <em>everyone </em> interacts with data at your business. This means your customers, internal users and the many teams across the organizations. </p><p>They may have different SLAs or require different sets of data. But your customers - both the ones who pay you and your coworkers on different teams - can be serviced by the same strategy.</p><p></p><p>If you got this far and found yourself thinking &#8220; this is not an exhaustive list of criteria for a data strategy&#8221;, you&#8217;re right. </p><p>Itemizing every single thing that <em>can </em>fall under the umbrella of &#8220;data strategy&#8221; results in more noise than value.</p><p>There is an elegance in simplicity. However simplicity does not suggest oversimplification. </p><p>Some data problems are hard. They require a lot of work, take up many resources, and have many moving parts. </p><p>An effective data strategy does not need to be hard to design or implement. </p><p>In fact, it should not be.</p><p>The simpler your data strategy, the easier it will be for your team to deliver on it. </p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Drop you email below to stay in the loop. New posts monthly-ish, or (more realistically) every couple months.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[25 Unpopular and Contradictory Opinions on Data and Engineering]]></title><description><![CDATA[If your Data Product does not generate revenue, it&#8217;s an internal tool]]></description><link>https://datajargon.substack.com/p/25-unpopular-and-contradictory-opinions</link><guid isPermaLink="false">https://datajargon.substack.com/p/25-unpopular-and-contradictory-opinions</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Fri, 23 Aug 2024 13:01:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!rNNS!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b14ac3f-a096-45c9-903d-80cb4ba1191a_400x400.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<ol><li><p>If your Data Product does not generate revenue, it&#8217;s an internal tool</p></li><li><p>Many companies will get more value out of a half-decent Business Intelligence team than they will trying to build Data Products</p></li><li><p>Data Engineering is a form of Software Engineering</p></li><li><p>Data engineering projects often do not require the complexity seen in software engineering projects</p></li><li><p>Complex data engineering projects are not more complex than software engineering projects; they just used a specialized set of tools</p></li><li><p>Centralized data platforms are good</p></li><li><p>Centralized data teams often are not</p></li><li><p>Your data team structure should compliment the rest of the organization</p></li><li><p>More data analysts probably won&#8217;t make your data team more effective but more analysts might help your organization use data more effectively</p></li><li><p>More data engineers might make your data platform better</p></li><li><p>&#8220;Data&#8221; as a standalone function has created more confusion than any other function. UX and User Research might be a close second</p></li><li><p>Data PMs are only warranted when your product is VERY technically complex</p></li><li><p>Data PMs are often a byproduct of misaligned business goals</p></li><li><p>The data infrastructure design and tooling at Meta, Netflix and AirBnb are not going to solve your startup&#8217;s problems</p></li><li><p>Most of your tooling decisions will not matter </p></li><li><p>A handful of your tooling choices will cause acute frustration</p></li><li><p>Don&#8217;t use multiple vendor tools when one would suffice; the marginal additional value for a dedicated solution is seldom worth the additional complexity</p></li><li><p>Even the most pragmatic engineers get caught up in the vendor hype cycle</p></li><li><p>Many companies using Snowflake, BigQuery and Redshift could just be using rollup tables in Postgres</p></li><li><p>Save a few exception, there is now little differentiation in the Data Warehouses; most novel feature sets are now table stakes</p></li><li><p>That data catalog you&#8217;ve been scoping out is probably a waste of time </p></li><li><p>A metadata catalog is very much worth it</p></li><li><p>You don&#8217;t need a new tool to have a metadata catalog</p></li><li><p>Your organization&#8217;s Data Strategy has less to do with its data than you think</p></li><li><p>If you think your Data Strategy is complete, you should revisit it</p><p></p></li></ol><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! Drop your email below for monthly-ish content on data, engineering and business.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Corner Store vs The Wholesaler]]></title><description><![CDATA[This is about more than just local grocers and suppliers.]]></description><link>https://datajargon.substack.com/p/the-corner-store-vs-the-wholesaler</link><guid isPermaLink="false">https://datajargon.substack.com/p/the-corner-store-vs-the-wholesaler</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Tue, 21 May 2024 13:02:28 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="398" height="502.3804524886878" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:6974,&quot;width&quot;:5525,&quot;resizeWidth&quot;:398,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;grayscale photo of UNKs store&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="grayscale photo of UNKs store" title="grayscale photo of UNKs store" srcset="https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1582140110399-ff82473c6ebc?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMHx8Y29ybmVyJTIwc3RvcmV8ZW58MHx8fHwxNzE2Mjk0MDExfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The people at the corner store know you. They&#8217;ve seen you at your best and your worst. They know you enjoy that one weird drink flavor that no one else buys, and often ask what you&#8217;re getting up to. </p><p>They know your breakfast order. And they make a joke when you change it up from your usual. </p><p>You don&#8217;t get that type of attention from the wholesaler. </p><p>The wholesaler spends their time analyzing margins. Forecasting fulfillment and renegotiating contracts. Our delivery utilization is below the norm, they say in long internal email chains. </p><p>There are 8 people CC&#8217;d on your emails with their team. Your recent call with the wholesaler had even fewer people engaged than the last one. A video call filled with the gray squares of disabled laptop cameras and disabled microphones. </p><p>Your usual restock is one of dozens, even hundreds that they have lined up this week. </p><p>This is not a story about local grocers and suppliers. </p><p>This is a story about consulting.</p><p>Consulting firms are either the corner store or they are the wholesaler. They cannot be both, at least not at the same time. </p><p>I run what I consider to be a corner store. I am hired to solve problems for specific teams. Those problems happen to be related to systems processing and producing data. I provide a specialty set of services and tailor those services to the company&#8217;s needs. </p><p>Its current setup will not scale to tens of millions in revenue. And that is ok. It&#8217;s not intended to right now.</p><p>But a wholesaler would not be ok with this.</p><p>The wholesaler is a necessary force. They operate at a scale that some businesses require. But many do not.</p><p>You could argue that most do not. </p><p>Even those with some venture funding in the bank often don&#8217;t need wholesalers. They need corner stores. </p><p>They need someone to solve their problems. To anticipate what comes next. To actually care about the outcomes. </p><p>You lose that level of detail when dealing with wholesalers.</p><p>I inherited a project built by a wholesaler a few years back. It had cost the business somewhere around $300k over the course of the build. It followed the well-established formula for what &#8220;should work&#8221;.  </p><p>But it didn&#8217;t solve the problems we faced well enough. It was cumbersome. Fragile. And obviously expensive. </p><p>The business didn&#8217;t need a wholesaler. They needed a corner store. </p><p>They just didn&#8217;t realize it. </p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! New posts every month-ish.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Lean Data Engineering with Dagster and DuckDB]]></title><description><![CDATA[A simple pattern for aggregating the data you already have]]></description><link>https://datajargon.substack.com/p/lean-data-engineering-with-dagster</link><guid isPermaLink="false">https://datajargon.substack.com/p/lean-data-engineering-with-dagster</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Tue, 05 Mar 2024 14:00:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qEwc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qEwc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qEwc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512 424w, https://substackcdn.com/image/fetch/$s_!qEwc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512 848w, https://substackcdn.com/image/fetch/$s_!qEwc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512 1272w, https://substackcdn.com/image/fetch/$s_!qEwc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qEwc!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512" width="680" height="680" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:512,&quot;width&quot;:512,&quot;resizeWidth&quot;:680,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qEwc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512 424w, https://substackcdn.com/image/fetch/$s_!qEwc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512 848w, https://substackcdn.com/image/fetch/$s_!qEwc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512 1272w, https://substackcdn.com/image/fetch/$s_!qEwc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8234df-42fa-4adb-aa51-ebda55d4fbe1_800x512 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Your average data engineer patching things together</figcaption></figure></div><p></p><blockquote><p>This post was inspired by the work that <a href="https://www.linkedin.com/in/jacobmatson/">Jacob Matson</a>, <a href="https://www.linkedin.com/in/petefein/">Pete Fein</a> and I have done helping companies make use of their data lakes.</p><p>If you&#8217;d rather read the code, you can find it all in Github: <em><strong><a href="https://github.com/JoeNaso/lean-data-engineering/tree/main">Lean Data Engineering</a></strong></em>.</p><p>But make sure to check out the <strong>Implementing This Pattern in the Real World </strong>section before giving it a go. </p></blockquote><p></p><p>Data lakes are nice because you can throw anything you want in there and move on. This is also why they cause a lot of trouble.</p><p>But, we like quick wins. We also like a short time-to-value. So, we&#8217;re going to cover a quick-ish pattern for taking what&#8217;s in your data lake and making it useful. Limited fancy tech. Limited infrastructure. </p><p>Just python, SQL and a couple tools you may have already heard of. </p><p>We&#8217;re talking lean data engineering.</p><div><hr></div><p>There are plenty of different - and difficult - ways to get data from one place to another. But, we&#8217;re keeping this simple. </p><p>Believe it or not, Dagster, DuckDB and a warehouse of choice (we&#8217;re using Snowflake) will take you very far.&nbsp;</p><p>AWS and GCP have made it especially easy to create systems designed for high-volume data. In the time it takes to read this post, you could launch your streaming pipelines with AWS Kinesis or GCP PubSub. And landing that event stream in S3 or GCS is trivial.</p><p>But, it&#8217;s only half the battle.&nbsp;</p><p>The last mile - bringing that data into your data warehouse and actually using it - is the real challenge.&nbsp;</p><p>Data lakes are convenient, but they reframe traditional data engineering challenges. Technical projects always have some governance and cost components. Within the changing context of a data lake, though, they can quickly become challenges.&nbsp;</p><p>They present a set of data engineering problems not found in traditional pipelines. They change the narrative.&nbsp;</p><p>There is no friendly package of data + compute. Instead, it becomes &#8220;&#8220;here is all the data; provide your own compute&#8221;.</p><p>It&#8217;s common to see companies dumping data into their data lake, ingesting it into their warehouse, and processing it repeatedly. It&#8217;s a classic trap: a system that is easy to create and hard to maintain.</p><p>The Data team might use the data appropriately, but this system leaves your Analytics Engineers, Data Analysts and Data Scientists heading towards a slippery slope. The opportunity to deploy needless dbt models won&#8217;t go away. And with that comes ever growing cloud bills. Let&#8217;s not forget that you&#8217;ve also created a security risk.&nbsp;</p><p>PII and other sensitive information will land in your warehouse even when you don&#8217;t want it to. It&#8217;s one unfortunate side effect of this design.</p><p>We&#8217;ve all seen it happen.</p><p>Even the most cost-conscious teams slip up from time to time.</p><p>But we can side-step these issues by returning to an ETL pattern of old. By applying aggregations to our data lake and pushing that data to our warehouse.</p><p>And with some specific design patterns, we can build a lean pipeline that fits our current infrastructure.</p><p>Plus, we&#8217;ll get the data where it needs to be faster and cheaper.</p><h3>Data Lake + Dagster + DuckDB</h3><p>There are plenty of vendors who will copy your data lake into your warehouse for a fee. But, we&#8217;re not heading down that path. We&#8217;re keeping it lean.&nbsp;</p><p>You&#8217;ll hear a lot of data engineers say to &#8220;push it down to the database&#8221;, meaning let the DB do the heavy lifting. We&#8217;re going to do something similar, but with lighter weight tooling.&nbsp;</p><p>We can&#8217;t aggregate data directly in the data lake, because we don&#8217;t have access to a compute engine.&nbsp;</p><p>We don&#8217;t want to push the aggregations into Snowflake. That requires we copy everything to Snowflake first, and would be far too costly.&nbsp;</p><p>Instead, we&#8217;re going to aggregate partitions of data in-process using DuckDB, and dump that to a location accessible by Snowflake.&nbsp;</p><p>Anything downstream of that (dbt models, BI reports, etc) are derived from our aggregated data. We&#8217;re using our pipeline to perform work we&#8217;d be doing anyway.&nbsp;</p><p>In theory, you could forgo dbt altogether and treat DuckDB like a data modeling layer, but might be going too far. We still want some separation of concern between data available for use and data ready for consumption (reporting).&nbsp;</p><p>We can break our path forward into a few steps:</p><ol><li><p>Pre-processing and Aggregation</p></li><li><p>Staging and Loading</p></li><li><p>Reporting and Presentation&nbsp;</p></li></ol><p>We&#8217;re going to use Dagster to execute the first 2 steps above. If we opt for layering some lightweight dbt models on top of our data, we can orchestrate that work with Daster, too. But, the method you choose for that section of the pipeline doesn&#8217;t matter too much.&nbsp;</p><p>We&#8217;re already in a good place with data capture - the Engineering team is dumping an event stream into s3. We don&#8217;t need to rely on Webhooks from vendor systems, and we&#8217;re capturing more data per event than we will&nbsp; use for analytics purposes. With Kinesis, a new file lands in s3 approximately every 15 minutes. A typical payload looks like this:&nbsp;</p><pre><code>{
  "a_unique_identifier": 1234, 
  "another_id": "xyz-abc",
  "event_type": "signup",
  "occurred_at": "2024-03-01T15:00:00.141626",
  "source": "data-producer-app", 
  "raw": {"some": "nested", "json": "data"}
}</code></pre><p></p><p>The files are timestamp partitioned JSON blobs, and there are a lot of them.&nbsp;</p><p>But, before we do anything we need to handle two important requirements:&nbsp;</p><ol><li><p>Granularity at a daily grain</p></li><li><p>Removing PII</p></li></ol><p>Daily granularity is easy enough - we need to run this pipeline at least once a day, and aggregate our output data on <em>at least </em>a per-day basis.&nbsp;</p><p>We can fine-tune our controls of PII data as it&#8217;s getting processed by DuckDB. A vendored solution would be too costly, especially one that requires this level of configuration.</p><p>In years past, you might have opted for <a href="https://pandas.pydata.org/docs/reference/index.html">Pandas</a>, or some generic CSV processing. But we can minimize a lot of boilerplate code by opting for DuckDB. This also helps us reduce our in-process memory footprint (more on this later).&nbsp;</p><p>And, this whole thing is ephemeral - we don&#8217;t need to persist our DuckDB instance. We don&#8217;t need to maintain state in the pipeline, so we can more easily scale horizontally.</p><p>That also means we can re-run jobs with ease. </p><p>Our pipeline needs to read and write to s3, make it easy to convert one file format to another, and have a friendly API. DuckDB handles all these requirements with some readily available tooling.&nbsp;</p><p>We&#8217;re primarily interacting with it via SQL, and it will work with limited domain knowledge. Not to mention we can drop it into Dagster without having to explicitly integrate the two tools.&nbsp;</p><p>From there, we produce some parquet files and drop them into an s3 bucket. We wrap it up by having Dagster trigger a load from s3.</p><p>This might sound a bit abstract, but it&#8217;s conceptually pretty simple. There are a few moving pieces though, so let's look at some code.</p><h3>DuckDB.build()</h3><p>DuckDB is doing the majority of the work in this pipeline. Since Dagster is python based, we&#8217;re going to package up our DuckDB interface as a python object, as well. Dagster provides a native integration with DuckDB by way of Resources, but we don&#8217;t need that here.&nbsp;</p><p>Our class has a .build() method that spits out a DuckDB instance loaded with a specific configuration.</p><p>Now we can read json, interact with s3, and handle memory spillage predictably.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-l-R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-l-R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png 424w, https://substackcdn.com/image/fetch/$s_!-l-R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png 848w, https://substackcdn.com/image/fetch/$s_!-l-R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png 1272w, https://substackcdn.com/image/fetch/$s_!-l-R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-l-R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png" width="1456" height="1523" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1523,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:353664,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-l-R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png 424w, https://substackcdn.com/image/fetch/$s_!-l-R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png 848w, https://substackcdn.com/image/fetch/$s_!-l-R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png 1272w, https://substackcdn.com/image/fetch/$s_!-l-R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce74fa7-7f72-414e-a2fc-d1069fac57e3_1484x1552.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Psuedo-builder pattern for DuckDB</figcaption></figure></div><p></p><h3><strong>Incremental Processing with Partitions</strong></h3><p>Since our data lake will be growing over time, we need to logically partition our source data. This is especially important given our memory constraints.&nbsp;</p><p>We don&#8217;t have access to any metadata indicating the size of the file we&#8217;re about to ingest.&nbsp;We don&#8217;t control the data producing system. And since we&#8217;re processing gzipped CSVs, we could have a wide range of compressed files.  </p><p>Incrementally processing our data at specific intervals is the easiest path forward. And it allows enough flexibility to tweak things as needed.</p><p>We can use Dagster&#8217;s Multipart Partitions to manage the partition keys we&#8217;ll use to filter, parse and process only a subset of files at a time. This is the basis for our incremental processing.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mqvE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mqvE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png 424w, https://substackcdn.com/image/fetch/$s_!mqvE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png 848w, https://substackcdn.com/image/fetch/$s_!mqvE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png 1272w, https://substackcdn.com/image/fetch/$s_!mqvE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mqvE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139838,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mqvE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png 424w, https://substackcdn.com/image/fetch/$s_!mqvE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png 848w, https://substackcdn.com/image/fetch/$s_!mqvE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png 1272w, https://substackcdn.com/image/fetch/$s_!mqvE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27a9be-0a7a-4ae0-bf65-3a182e8bced3_1482x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We&#8217;re processing events from multiple producers. So, we&#8217;re factoring time and origin into our partitions.&nbsp;</p><p>We&#8217;re going to use this definition in a few places so we&#8217;re only ever handling a subset of data at a time. We have some helper functions and utilities to make filtering the s3 files easier, but they are not that interesting.&nbsp;</p><p>Within Dagster, two functions make up the vast majority of the pipeline so far - finding some files in s3 and handing them to DuckDB.&nbsp;</p><p>Those functions are tidy, too. They find files we need to process and execute queries against them. </p><p>In one case, we cast JSON to Parquet. In another, we do a rollup. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xnzc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xnzc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png 424w, https://substackcdn.com/image/fetch/$s_!xnzc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png 848w, https://substackcdn.com/image/fetch/$s_!xnzc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png 1272w, https://substackcdn.com/image/fetch/$s_!xnzc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xnzc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png" width="1456" height="913" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:913,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:250861,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xnzc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png 424w, https://substackcdn.com/image/fetch/$s_!xnzc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png 848w, https://substackcdn.com/image/fetch/$s_!xnzc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png 1272w, https://substackcdn.com/image/fetch/$s_!xnzc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F751dd90a-3b8f-4383-8b7d-b73db3da97d3_1900x1192.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With the hive partitioning and file type conversion, executing a daily rollup would be much more difficult. </p><p>By converting the JSON data to Parquet, we can now just point DuckDB to those files, aggregate them, and ship them off to another location in s3. </p><p>And, it can happen without having to give DuckDB a predefined schema.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LhIq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LhIq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png 424w, https://substackcdn.com/image/fetch/$s_!LhIq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png 848w, https://substackcdn.com/image/fetch/$s_!LhIq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png 1272w, https://substackcdn.com/image/fetch/$s_!LhIq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LhIq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png" width="1456" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:320715,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LhIq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png 424w, https://substackcdn.com/image/fetch/$s_!LhIq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png 848w, https://substackcdn.com/image/fetch/$s_!LhIq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png 1272w, https://substackcdn.com/image/fetch/$s_!LhIq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4515f678-8d81-4b4f-b9c8-c0478ff5d031_2148x1372.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3><strong>Ship it off to Snowflake</strong></h3><p>We&#8217;re 80% of the way.&nbsp;</p><p>We already did most of the heavy lifting in-transit with DuckDB and Dagster. The final step is loading our pre-processed data into Snowflake.&nbsp;</p><p>We want to keep things relatively streamlined in Dagster and in Snowflake, so we&#8217;re defaulting to <code>CREATE IF NOT EXISTS</code> for a Parquet file format, Rollup Stage and Rollup Table.&nbsp;</p><p>Dagster&#8217;s Resource pattern makes it easy to inject interfaces into your databases and cloud storage into your job context.&nbsp;</p><p>Instead of managing all of our <code>CREATE IF NOT EXISTS</code> statements with inline sql, we&#8217;re going to set up a ConfigurableResource as a &#8220;public interface&#8221; into our data warehouse. This is how we&#8217;ll execute the commands we need to take the staged Parquet files, and ensure they land where we want in Snowflake.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xPNL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xPNL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png 424w, https://substackcdn.com/image/fetch/$s_!xPNL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png 848w, https://substackcdn.com/image/fetch/$s_!xPNL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png 1272w, https://substackcdn.com/image/fetch/$s_!xPNL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xPNL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png" width="1382" height="1552" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1552,&quot;width&quot;:1382,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:351775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xPNL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png 424w, https://substackcdn.com/image/fetch/$s_!xPNL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png 848w, https://substackcdn.com/image/fetch/$s_!xPNL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png 1272w, https://substackcdn.com/image/fetch/$s_!xPNL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb19fc10-bfbc-4f1e-b2e1-7f50ff98a030_1382x1552.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The last remaining piece? Copy our staged data into the rollup table using our SnowflakeResource, and we&#8217;re done.&nbsp;</p><p>From here, anyone with a connection to Snowflake can make use of our newly minted rollups. </p><p>And backfills are easy, too, thanks to the partitions. </p><p></p><h3><strong>Quick Aside: If Money Were No Object</strong></h3><p>If we weren&#8217;t considering cost, Snowflake and BigQuery offer plenty of options for getting your data from where it lives to where you want it.&nbsp;&nbsp;</p><p>Snowflake Dynamic Tables and Tasks work well for transferring files from s3 into your warehouse for additional processing. In BigQuery, the Data Transfer Service provides similar tooling.&nbsp;</p><p>But, you&#8217;re still incurring storage and processing cost at multiple points. You&#8217;d still be copying the &#8220;raw&#8221; data in your warehouse where you&#8217;d have to do more transformation. Adding dbt into this mix quickly takes you from &#8220;data-driven&#8221; to &#8220;cost-center&#8221;.&nbsp;</p><p></p><h3>Implementing This Pattern in the Real World</h3><p>Examples and walkthroughs are nice, but in the real world, there are always challenges. Here are some &#8220;gotchas&#8221; that came up when implementing this pattern for a real-world Marketing SaaS.</p><h4><strong>Your DuckDB configuration matters, and it depends on the platform you&#8217;re using</strong></h4><p>More specifically, your DuckDB Memory Limit and Temp Directory configurations matter. If you&#8217;re running this pipeline on an EC2 instance with comically large RAM, you won&#8217;t have to worry too much about memory. If you&#8217;re running it on a small instance, or through Dagster Cloud, you will have memory limits. </p><p>And those memory limits are important.&nbsp;</p><p>Since our data is loaded as gzipped JSON and written as Parquet, we don&#8217;t know the initial memory requirement until the filetype changes. The collection of gzipped files may be small or large. Because of this, you need to be able to spill to your local disk since your RAM will not be enough for in-memory processing. We settled on 10GB of RAM since Dagster Cloud defaults to 16GBs.&nbsp;</p><p>On high-volume days, we saw activity skyrocket. As a result, we needed to plan for the edge cases.</p><h4><strong>Just use Parquet</strong></h4><p>Speaking of JSON and Parquet, DuckDB will cooperate much better with Parquet files than Gzipped JSON. No surprise that an in-memory database will work better on files with attached metadata, so if you have the opportunity to update your Kinesis pipeline to write your event batches as Parquet, you should do it.&nbsp;</p><h4><strong>Preserve a lower level grain in the pipeline when you can</strong></h4><p>Any experienced data engineer has been here. The requirements expect daily aggregations. But if you can process things on an hourly basis with minimal cost or performance impact, it&#8217;s probably worth doing so from the start.&nbsp;</p><p>By presenting daily aggregations to the user, but keeping hourly rollups under the hood, you keep things flexible. You can change the granularity of reporting in the future, but you don't need to make major changes to do it.</p><h4><strong>Build once, share twice</strong></h4><p>We dump this data Snowflake in this scenario, but there&#8217;s no reason to stop there. Other systems (including the production application!) can use the aggregates that land in s3. Making use of the data you have does not just mean it gets shared via BI report.</p><h4><strong>Backfill configuration is trivial, but execution is not</strong></h4><p>I also happened to crash our Dagster Cloud deployment because of backfills. Pair intermittent high-volume periods with a large number of backfill jobs, and the result is poor UI performance. Unfortunately, I caused issues for other Dagster customers, too.&nbsp;</p><p>Simply put, we were creating so many "JobRun" objects that the UI failed to load. And, other Dagster Cloud customers were seeing degraded performance as a result.</p><p>Dagster was <em>quick </em>to solve that issue for everyone. Props for moving fast.</p><p>But, this brings up 2 changes I'd make next time.</p><p>First, change the way DuckDB aggregates a single batch of data. Rather than running 1 job per hour, we could run one job per day , and write local Parquet files partitioned by hour. You maintain the same granularity, but significantly reduce the number of jobs tied to your Dagster instance.</p><p>Second, use a hybrid deployment rather than a strictly serverless design. At least this way, we could manage the RAM and CPU settings ourselves.</p><p></p><div><hr></div><p> If you have thoughts or comments, give me a shout on <a href="https://twitter.com/itsjoenaso">Twitter</a> or <a href="https://www.linkedin.com/in/josephmnaso/">LinkedIn</a>. </p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! New posts every month-ish.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Numbers: One Year into a Solo Consulting Business]]></title><description><![CDATA[A realistic look at the first year out on my own.]]></description><link>https://datajargon.substack.com/p/the-numbers-one-year-into-a-solo</link><guid isPermaLink="false">https://datajargon.substack.com/p/the-numbers-one-year-into-a-solo</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Mon, 15 Jan 2024 13:15:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b14ac3f-a096-45c9-903d-80cb4ba1191a_400x400.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;m a little late to the 2023 recap party, so we&#8217;re keeping this one brief. </p><p>Brief, yet informative. </p><p>The last 4 months have been surprisingly eventful from a professional sense, and it continued all the way through the New Year. </p><p>In fact, it&#8217;s continued into the new year. I was not expecting that. </p><p>Here is a snapshot of my <em>almost </em>first calendar year working for myself and consulting in the data engineering space. </p><p>Rather than waxing poetic about life as a solopreneur, I&#8217;ll try to keep it pithy.</p><h3><strong>Projects</strong></h3><p>~35% Greenfield/ New</p><p>~55% Brownfield/ Existing</p><p>~10% Non-Development</p><p>I&#8217;m not surprised by this mix, but the way these project play out differ greatly. </p><p>Greenfield projects are seldom <em>truly</em> greenfield. Someone internally has run a POC, convinced management, or decided that XYZ thing needs to get done. </p><p>I expect a larger chunk of Brownfield projects in 2024 - companies have no shortage of problems to solve. And coming off the ZIRP-era heyday for data tools, I expect many companies will realize that there is plenty of low hanging fruit with regards to cost-optimization and performance improvements. </p><p>In case you didn&#8217;t know&#8230; you don&#8217;t need to throw more money at your problem to solve it. Sometimes, you just need someone who&#8217;s seen it before.  </p><h3><strong>Customers</strong></h3><p>~20% Referrals</p><p>~20% Inbound </p><p>~20% Repeat</p><p>I can&#8217;t share all the details here for obvious reasons (read: contractual) . But of the wider range of customers I worked with this year, I&#8217;m very happy that 1) people within my network are willing to vouch for me, 2) things I say on the internet are taken seriously (by some) and 3) customers recognize I can help solve their problems. </p><p>You know what they say - the easiest customer to sell to is one who's already paid you.</p><p> 50% of my clients are Venture-backed businesses, all providing some SaaS solution.</p><p>The other 50% were bootstrapped or spinout companies. </p><h3>Timing</h3><p>Shortest Duration Project: 3 Weeks</p><p>Longest Duration Project: 6 Months</p><p>Shortest Time to Close: 1 Day</p><p>Longest Time to Close: 4 Months</p><p>If I had to guess, the typical project duration was 3 months. Not bad. </p><p>I&#8217;ve learned I much prefer to come into an organization with a specific problem to solve over doing a &#8220;staff aug&#8221;-like engagement where I am embedded on a team. </p><p>Project based engagements still mean working closely with the existing team. But,  the markers of success (and sense of urgency) when working as an outside expert are much more concrete than when you&#8217;re a hired-hand picking up To-Do&#8217;s from the Kanban board. </p><h3>Money</h3><p>Price Increases: 2</p><p>Profit Margin: 91% (Lol)</p><p>Yes, bootstrapped service businesses have low overhead. Yes, 91% is ridiculous. </p><p>There are only 2 options to scale this business - prices and people. While I&#8217;ve managed teams before, I am not confident the economics work when it comes to hiring additional hands. Sub-contracting, however, is a different story.</p><p>By the time I am officially one year into this ride, I will eclipse my previous full-time salary by a meaningful margin. I still have a couple months until I reach that milestone. </p><h3>A More Personal Note</h3><p>The biggest news this year wasn&#8217;t starting Purview Labs. </p><p>We got a dog - a rescue from a farm in South Texas. His name is Phil. He sits next to my desk when he wants pets, runs in circles when he&#8217;s excited, and sometimes looks goofy in photos. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!poZR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!poZR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg 424w, https://substackcdn.com/image/fetch/$s_!poZR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg 848w, https://substackcdn.com/image/fetch/$s_!poZR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!poZR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!poZR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg" width="382" height="509.2458791208791" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1941,&quot;width&quot;:1456,&quot;resizeWidth&quot;:382,&quot;bytes&quot;:1391868,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!poZR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg 424w, https://substackcdn.com/image/fetch/$s_!poZR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg 848w, https://substackcdn.com/image/fetch/$s_!poZR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!poZR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5bd131-27df-4d61-89ec-d3cdd3ddae21_3024x4032.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Woof - Phil</figcaption></figure></div><p>I tend to keep my business persona separate from my personal one, at least when it comes to my online presence. </p><p>I don&#8217;t think that&#8217;s totally feasible anymore. Nor is it necessarily good for business. </p><p>I may be sharing a bit more about this side of life in 2024. Maybe.</p><p> </p><h3>Misc. Observations</h3><ul><li><p>I work almost every day in some capacity. I don&#8217;t wear myself thin, but it&#8217;s part of the journey right now. I&#8217;m not sure this approach is for everyone, and I&#8217;m not sure this is something I want to do for the long-term.</p></li><li><p>Long-term consulting work is still not the end goal. I find myself debating the merits of selling a product vs selling services. </p></li><li><p>Both can be wildly successful and, if done right, complementary offers. </p></li><li><p>I&#8217;ve expanded my network more in ~ 10 months of working for myself than most of my W-2 roles.  Conversations are different when you come into them as your own brand rather than someone who does XYZ thing at Some-VC-Funded-Startup.</p></li><li><p>Content works.</p></li><li><p>Creating content is time consuming.</p></li><li><p>Creating useful content even more so.</p></li><li><p>Content about tools and technologies get significantly more readership and engagement than content about business. The data world has perpetual shiny object syndrome. </p></li><li><p>Tools like ChatGPT are not going to replace engineers. But they do help flatten the learning curve of new things. <em>Mastery </em>is only ever achieved through practical application of knowledge, though.</p></li><li><p>The difference between the Airflow and Dagster dev experience is night and day. I discounted Dagster for too long. </p></li><li><p>I haven&#8217;t been Duck-pilled, but DuckDB has its place in specific workflows. It is not a magic bullet, but it is a useful open source technology.</p></li><li><p>The broader data ecosystem is following a similar arc to that of the &#8220;internet marketer&#8221;. Lots of online course. Lots online personalities. It can be tiring to sort through the noise. </p></li><li><p>I still don&#8217;t have a company logo or branding. And I&#8217;m not sure I need one</p></li></ul><p></p><p>If you made it this far, give me a shout on <a href="https://twitter.com/itsjoenaso">Twitter</a> or <a href="https://www.linkedin.com/in/josephmnaso/">LinkedIn</a>. One of my favorite parts of the last year has been talking to new people. </p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! New posts every month-ish.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Business of Open Source]]></title><description><![CDATA[The open-core business model in the world of Data is not as simple as it seems]]></description><link>https://datajargon.substack.com/p/the-business-of-open-source</link><guid isPermaLink="false">https://datajargon.substack.com/p/the-business-of-open-source</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Mon, 02 Oct 2023 15:01:39 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="5935" height="3957" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3957,&quot;width&quot;:5935,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Open LED signage&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Open LED signage" title="Open LED signage" srcset="https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1569017388730-020b5f80a004?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxvcGVuJTIwc291cmNlfGVufDB8fHx8MTY5NjE5NTU3N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">If a data tool is open source but no one talks about it, does it even matter? </figcaption></figure></div><p>At first glance, open source software sounds like a utopian ideal - useful  products packaged up for anyone to use. Of course, they require a base level of technical acumen, but if you&#8217;re already interested in using open source tooling, you probably have some chops.  </p><p>Recently, there has been a strong push for open-core businesses - companies whose core offering comes in two flavors: a do-it-yourself version (open source) or a managed-for-you-version (hosted). </p><p>Is your team too small to afford a data engineer or someone else to run your infrastructure? Pay us for this hot new managed service and we&#8217;ll handle it for you. It&#8217;s that easy. </p><p>Never miss the data tool hype-train again! </p><p>There is a certain mystique about open source and open-core businesses. They appear to be exemplars of an one-for-all and all-for-one mindset where the people who build the product aren&#8217;t just shilling it for profit. They are here to help the average man or woman. </p><p>But that&#8217;s a naive and simplified view on something far more complex. </p><p>The boom in data tools over the last 5 - 10 years has made this exceedingly obvious. Never before has the <a href="https://dagster.io/blog/fake-stars">Github Stars to VC-backed startup</a> pipeline been stronger. </p><p>But, does the idea of open-core even make sense for businesses in today&#8217;s market? The answer depends on who you talk to. </p><p>Either way, let&#8217;s take a look at the concept of open-source and what it means in the world of data.</p><p></p><div><hr></div><h2><strong>What&#8217;s Open Source?</strong>  </h2><p>Fundamentally, Open Source (OSS) is a means of releasing software that allows the end user to distribute, extend, or otherwise alter that software for their own use. It&#8217;s the backbone of software today (not just data-related tooling), and you&#8217;d be hard pressed to find an engineer who has not worked with a variety of open source tools. I&#8217;d venture to say finding such a person would be impossible. </p><p>Almost every popular programming language is open-source. In fact, every language has an eco-system dedicated to the distribution of these OSS packages: </p><blockquote><p>javascript &#8212;&gt; npm</p><p>python &#8212;&gt; pip </p><p>rust &#8212;&gt; cargo</p><p>java &#8212;&gt; maven/ gradle</p><p>&#8230; there are many other languages and package managers, but you get the idea</p></blockquote><p></p><p>Are you reading this post on a Firefox browser? That&#8217;s open source. You can introduce changes to the Firefox project if you really want to. </p><p>Chrome is not open source. But, a lot of the stuff under-the-hood, namely the Chromium web browser that powers all instances of Chrome, is an open-source project. </p><p>Confused? Welcome to the world of modern software. There&#8217;s layers to this game. </p><p>There is often a difference between the OSS version of a piece of software, and the commercially available version of it. The vast majority of Chrome&#8217;s features come from the open source Chromium browser, but there are some which are only supported by Chrome - not Chromium. </p><p>You could install Chromium instead, but you&#8217;ll lose some functionality. For instance, you&#8217;ll need to manually install every update. </p><p>It does get more complicated. Google offers Chrome under a free license; so, despite you not paying for the features they add on top of Chromium, it&#8217;s a closed-source project. For our purposes, we don&#8217;t need to focus on licensing, though it does have a <a href="https://thenewstack.io/hashicorp-abandons-open-source-for-business-source-license/">meaningful impact</a> at scale. </p><p>OSS typically follows this pattern - a free, community supported version of a product, and a commercialized cousin with some additional, often very useful, features. It&#8217;s like the current version of you and that other version of you that hits the gym, strictly eats whole foods, and doesn&#8217;t endlessly scroll social media before going to bed.</p><p>This works really well when multi-national mega-corps like Google are incubating the product and issuing free licenses for its use. But many of the OSS tools you&#8217;re likely to use in the data world fall under the category of open-core, where a managed version is available if you&#8217;re willing to pay the price. </p><p>They also operate at a very different scale of business. </p><p></p><div><hr></div><h2><strong>The Economics of Open Source</strong> </h2><p>To the average person in another industry, this model probably makes no sense. The incentives on both sides appear misaligned.</p><blockquote><p>Build software that anyone can use.</p><p>Let them use it for free.</p><p>Pay your own people to maintain it.</p><p>Users can contribute to any open project available in the wild. </p></blockquote><p>This model breaks the mold of being compensated for providing value, at least with regards to financial compensation. Often, resources are dedicated to projects that generate no revenue, or in many cases, average people are contributing to projects that are assets of a company with which they are in no way officially affiliated. </p><p>That second scenario is a really interesting positive side-effect of open-core businesses: your features are sometimes built by your own users. It is literally free labor and free product development. Quite amazing.</p><p>By design, OSS projects rely on contributions (of code) by the public. The alternative path is that the company spends its own resources to develop and maintain a software package that others can use for free.</p><p>Can you name an industry other than tech where this would work? Non-profits don&#8217;t count. </p><p>The closest corollary may be a loss-leader for e-commerce or retail, but even that is not an entirely apt comparison.</p><p>Don&#8217;t get me wrong - OSS is fundamental to modern software. </p><p>In the early days of OSS, this approach offered an alternative to lock-in with tools built by large tech companies. In the mid-1980s, Postgres - one of the most popular database options today - was born (in part) as a response to the marketing aggression and growing market dominance of Oracle and its relation database service. </p><p>There is something incredible about choosing to build a free, widely-available alternative to a technology just because you don&#8217;t like the incumbent. </p><p>But OSS has grown significantly since then. </p><p>As engineers and developers have collectively gravitated towards marketplaces like GitHub, the cycle time between creating some software and others finding that software has sped up. There is no doubt platforms like GitHub have contributed to the rise in OSS - more engineers can contribute to these project, and it&#8217;s easier than ever for OSS projects to be found. </p><p>And this has lead to a partial change in what OSS represents. </p><p>The <a href="https://www.technologyreview.com/2022/04/21/1050788/the-changing-economics-of-open-source/">MIT Tech Review and ThoughSpot</a> suggest that OSS might be as much an effort in corporate branding as it is a popular business model. The modern OSS community has built an eco-system where many projects aspire to be spun-off into venture-backed businesses, especially in data.</p><p>In many cases, there is nothing wrong with this. The initial contributors set off to solve a problem, grew distribution, identified a market, and created a business. </p><p>At face value, that&#8217;s great. It&#8217;s the exact model that you find with digital marketing gurus and Twitter/ LinkedIn influencers: build an audience first.  </p><p>But the reality of the &#8220;OSS to VC-backed company&#8221; pipeline might not be as sexy as it seems. </p><p></p><div><hr></div><h2><strong>Open Source and Open Core</strong></h2><p>The open-core SaaS model is pretty simple: </p><ol><li><p>Build a thing that solves a problem</p></li><li><p>Get people to use it</p></li><li><p>Get some portion of those users to pay you to manage the thing for them</p></li></ol><p>That&#8217;s fundamentally the OSS to product-business pipeline. Maybe you throw some professional services in there, or you start pricing by usage (Everyone&#8217;s <em><a href="https://datajargon.substack.com/i/109882966/on-billing-in-todays-data-ecosystem">favorite </a></em><a href="https://datajargon.substack.com/i/109882966/on-billing-in-todays-data-ecosystem">pricing model</a>). The formula is well-defined, with many software products fitting this mold. </p><p>Take a small slice of popular data tool vendors, and you&#8217;re likely to find more than a handful of open source/ open-core products - <a href="https://www.getdbt.com/">dbt</a>, <a href="https://dagster.io/">Dagster</a>, <a href="https://www.astronomer.io/">Astronomer</a>, <a href="https://grafana.com/">Grafana</a>, <a href="https://snowplow.io/">Snowplow</a>, <a href="https://www.starburst.io/">Starburst</a>. There are plenty more. </p><p>Even big names like Snowflake use open <a href="https://github.com/apple/foundationdb">open source technologies</a> as fundamental components to their offer. Databricks&#8217; <a href="https://www.databricks.com/product/data-lakehouse">own website</a> states their Data Lakehouse product &#8220;is underpinned by widely adopted open source projects Apache Spark&#8482;, Delta Lake and MLflow&#8221;. </p><p>This is a new kind of leverage. </p><p>It&#8217;s a distribution network, with free labor, that becomes a business.</p><p>It&#8217;s the repackaging of well-distributed, widely available software into new products. </p><p>Sounds great, right? </p><p>All of the companies mentioned above are venture-backed, and all of them - save Snowflake - are currently privately held. </p><p>With the exception of Grafana, who is probably in a somewhat favorable position due to San Francisco Partners&#8217; recent acquisition of NewRelic and the never-ending chatter about DataDog&#8217;s outrageous pricing model, you&#8217;d be hard-pressed to find a contender for a future public offering.</p><p>Yet the pipeline remains strong. Curiously so, if you ask me.</p><p></p><h3><strong>Point-Solutions and Systems-of-Record</strong></h3><p>OSS exists on a spectrum. On one end you have <a href="https://qz.com/646467/how-one-programmer-broke-the-internet-by-deleting-a-tiny-piece-of-code">tiny single use packages</a> - some might even call them toys. On the other, you have <a href="https://hadoop.apache.org/">large scale</a> <a href="https://projects.apache.org/project.html?spark">enterprise technologies</a> which historically have grown out of big tech companies. </p><p>The distinction may seem obvious with the examples above; they are clearly extremes. JavaScript&#8217;s LeftPad fiasco is internet lore at this point, while technologies like Hadoop and Spark are some of the most popular and impactful OSS data tools released in the past two decades. But it still illustrates the broader point - where the OSS lands on the spectrum matters.  </p><p>Hadoop and Spark were incubated at Yahoo! and AMPLab of UC Berkeley, respectively. They are fundamental components of technology for thousands of businesses and technical organizations. </p><p>These two tools have spawned <a href="https://www.cloudera.com/products.html">multiple commercially</a> <a href="https://en.wikipedia.org/wiki/Hortonworks">viable businesses</a>, and have been co-opted by platform providers like <a href="https://aws.amazon.com/emr/">AWS</a>.</p><p>Will the typical OSS data tool available today reach this scale? It&#8217;s unlikely in my opinion, and that&#8217;s mainly a function of <em>where</em> there new tools exist on the spectrum of available OSS. </p><p>On one end, you have single-use-case tooling and utilities for &#8220;end users&#8221; (aka analysts and other reasonably technical roles). They solve specific needs, often tied to existing workflows within the business. On the other, you have databases, infrastructure-as-code, and system observability. These are wide-reaching and flexible tools which (in some cases) act as the system-of-record. </p><p>Tech businesses can survive without point solutions; they cannot function without systems-of-record.</p><p>Over the last 5-10 years and in the data space specifically, it seems as though there have been fewer lower-level technologies following the OSS-to-product-business formula than one-off utilities and point solutions packages. The exception to this statement is likely <a href="https://voltrondata.com/">Voltron Data</a>, from the team behind PyArrow (and Pandas, back in the day). </p><p>And even if that is not the case, there is no denying that point-solutions have been marketing to practitioners aggressively in an attempt to drum up FOMO. You can make a case that tools like dbt, Dagster, Prefect, and many others fall into this camp. </p><p>dbt, itself an point-solution on our OSS spectrum, illustrates another unique phenomenon for this market. While individuals may have gripes about its implementation details - namespacing of models, excessive YAML configurations, lack of governance - dbt Labs has been able to plug its own product gaps reasonably well with many of the surrounding tools in the ecosystem. </p><p>Is this a product moat? </p><p>You could make that case. Though a more realistic assessment is <em>sort of. </em>These complementary tools are, after all, separate businesses and separate concerns. And down the line, they&#8217;ll likely be fighting for mind-share and market-share. </p><p>It&#8217;s tough to say how defensible this positioning is over the long term. If your product encourages users to actively supplement its feature set with additional tooling, the market will figure out how to do this efficiently. It&#8217;s just one of the second order effects of the open-core model. </p><p> </p><p></p><div><hr></div><h2><strong>Second Order Effects</strong></h2><p>Some vendors - for instance, Clickhouse - have apparently <a href="https://altinity.com/blog/is-clickhouse-moving-away-from-open-source">reduced their OSS focus</a> in favor of proprietary, closed-source feature development. Their CTO even weighed in publicly <a href="https://github.com/ClickHouse/ClickHouse/issues/44767#issuecomment-1689746194">on the matter</a>: </p><blockquote><p>It's good to have a small, limited number of modifications exclusive to ClickHouse Cloud, but only those that do not compromise the features or operation in self-managed usages, but in the same way, are crucial and distinguishing for the Cloud.</p></blockquote><p>This shouldn&#8217;t be surprising. There is a rich history of large tech platforms like AWS and others releasing managed offerings of these open source options. AWS, specifically, is comically guilty of taking OSS projects and releasing some managed version of it - Managed OpenSearch (ElasticSearch), Managed Airflow, Managed Flink, Managed Kafka (MKS), the list goes on. That is a only a small subset. </p><p>You may have gathered that AWS itself has a less than favorable relationship with the broader OSS community - they are arguably the<em> </em>tech giant with the least notable contributions to OSS packages. Or, at least those packages they can&#8217;t readily commercialize. </p><p>All this is to say the business of open source software is tricky, and the path to commercialization is often tied to a brand. </p><p>And the branding aspect of OSS cannot be understated, especially for products that exist closer to the point-solution end of our spectrum. </p><p>Apache Airflow - a very popular and equally hated orchestration tool - is now primarily maintained by employees of the <a href="http://astronomer.io">Astronomer</a>, a company whose primary offer is a SaaS for managed Airflow deployments. They are not the exception here.</p><p>This relationship between OSS tool and well-branded corporate sponsor has spawned an entirely new category of role in tech - Developer Relations . </p><h3>Your Friends, DevRel</h3><p>In other industries, it&#8217;s called Community Building or Community Marketing. At one point, they were known as &#8220;Evangelists&#8221;. But today, it seems tech companies prefer to use different terminology in order to hide the fact that they are, after all, selling another product. </p><p>I am sure that many businesses have seen both improved traction and improved brand awareness from DevRel efforts, but don&#8217;t be fooled.  These roles are about driving the bottom line, and that rings especially true for your VC-backed point-solutions. Until the recent boom in open-core businesses, DevRel was primarily associated with only a few companies, many of whom are quite large - Apple, Google, NewRelic, Twilio. </p><p>Today, this function has become commonplace among small to mid-sized tech companies building developer tools. If you work in data, I guarantee you&#8217;ve come across some blog posts, YouTube videos, or Slack channels where the brand&#8217;s Developer Advocate or Head of Developer Experience is the person leading the charge. </p><h3>Software Re-bundling</h3><p>The OSS-induced spread of point solutions has lead to software sprawl at many companies. Pair that with the end of the ZIRP era, and it&#8217;s likely that many businesses are going to be shortening their list of accounts payable; for many, it&#8217;s already happening.</p><p>Businesses can push back against software sprawl by cutting licenses and opting to not renewal contracts. But, how do vendors, especially those who may have strong ties to these open-core tools, play this game? </p><p>Re-bundling. </p><p>This is a positive externality, at least with regards for the broader market. The average consumer (a business adopting OSS tooling) benefits from another organization packaging up various tools into one offer; it&#8217;s an efficiency gain for the end customer, and the re-bundler is able to play labor arbitrage by packaging up OSS tools. </p><p>It is an all-around fantastic model. </p><p>We are already seeing this happen in the data space with companies like <a href="https://datacoves.com/product">Datacoves</a>, <a href="https://www.paradime.io/">Paradime.io</a>, and <a href="https://www.y42.com/">Y42</a>. There are others in the market, as well, including some service-heavy players like <a href="https://www.5x.co/product">5X Data</a>. </p><p>Other single-platform tools like <a href="https://mozartdata.com/">Mozart Data</a> and <a href="https://www.keboola.com/">Keebola</a> also fit this mold.</p><p>Almost all of these businesses use dbt Core, the open-core version of dbt Labs&#8217; product. You have to admire when independent businesses make open-core software a transparent component within their own offering. </p><p>These re-bundled offers are a win for everyone involved. The open-core business gains contributors to - and users of - their OSS tooling, the end customers get a friendlier experience thanks to another company handling all the tedious leg-work, and the re-bundlers fill a market need by reducing customer-facing complexity. </p><h3>Community Alternatives</h3><p>The final side-effect of open-core businesses is the creation of 100% open solution. And, it builds on the re-bundling/ re-packaging of tools like mentioned above. </p><p>Think of it like knowledge sharing; a callback to the true spirit of OSS.</p><p>Projects like <a href="https://github.com/matsonj/nba-monte-carlo">MDS In a Box</a>, a 100% open-source codebase for running the &#8220;Modern Data Stack&#8221;, embody this. They include <em>all </em>the features and components of an analytics toolkit, but remain accessible to everyone.</p><p>Similarly, we&#8217;ve seen releases of fully <a href="http://open-source playbooks">open-source playbooks</a> on how to mimic dbt Cloud behavior without incurring additional cost related to a recent pricing change from dbt Labs&#8217; hosted cloud service.  </p><p>Are these options are robust as a fully-managed service? Hard to say, but that does not discount the fact that the option remains. </p><p>The very tooling that started as OSS and evolved into venture-backed open-core businesses is now being independently co-opted by other open-source projects. All to illustrate that you don&#8217;t need to pay for a managed service. </p><p>Perhaps we&#8217;ve come full circle.</p><p>It&#8217;s clear that as long as OSS tools are around, there will always be an alternative to the paid service. </p><p></p><div><hr></div><h2>So, Where Does This Leave Us?</h2><p>The open source path is attractive in some cases, and for good reason. For personal and professional recognition, it&#8217;s hard to beat. </p><p>But as a means to build a business, that path is not as friendly.</p><p>Building a product business is already difficult. Transitioning from a open-source positioning appears to be even more difficult. Not only are you developing the technology, but you&#8217;re also managing the perception and opinion of the broader community - the very people helping you develop your product. </p><p>The current focus on GitHub Stars feels akin to the Dot-Com era of measuring eyeballs on a webpage. No mention of profit, of building sustainable revenue streams. Just get users; growth-at-all-costs.</p><p>Is open-core the best way to build a business in the data space? </p><p>Unlikely, when you consider the wide range of tools that fall into the world of &#8220;data&#8221;. </p><p>Is open-core the best way to leverage the broader community, both in terms of distribution and product development? </p><p>Yes. </p><p>But, it would be naive to think that the open-core path is a guaranteed win. If anything, past experience and recent market headwinds show that scale matters,  product category matters, and the open-core model introduces competition in some very clever and unexpected ways. </p><div><hr></div><p>Think I missed the mark, or want to share your thoughts? </p><p>Find me on <a href="https://www.linkedin.com/in/josephmnaso/">LinkedIn</a> or <a href="https://twitter.com/itsjoenaso">Twitter</a>.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! New posts every month-ish. </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[DBT vs SDF vs SQLMesh]]></title><description><![CDATA[A data modeling tool comparison, with examples]]></description><link>https://datajargon.substack.com/p/dbt-vs-sdf-vs-sqlmesh</link><guid isPermaLink="false">https://datajargon.substack.com/p/dbt-vs-sdf-vs-sqlmesh</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Tue, 22 Aug 2023 16:43:16 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="2768" height="1848" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1848,&quot;width&quot;:2768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;white printing paper with numbers&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="white printing paper with numbers" title="white printing paper with numbers" srcset="https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1529078155058-5d716f45d604?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxleGNlbHxlbnwwfHx8fDE2OTI2NDMwNjV8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Excel, the one data modeling tool everyone knows</figcaption></figure></div><p></p><blockquote><p>For some of you, that headline might not mean much. That&#8217;s ok, and this one might not be for you. I&#8217;m mixing in some technical posts every once and a while. </p><p>This is the first of the series. </p></blockquote><p></p><p>If you&#8217;re reading this, I&#8217;m going to assume you know what DBT is. You don&#8217;t need to know what SQLMesh is, or have an opinion Semantic Data Fabric, but the more familiar with the world of data warehouses and data modeling, the better. </p><p>This is going to be a side-by-side comparison of 3 tools:</p><ol><li><p>DBT (<a href="https://www.getdbt.com/">https://www.getdbt.com/</a>)</p></li><li><p>Semantic Data Fabric (<a href="https://www.sdf.com/">https://www.sdf.com/</a>)</p></li><li><p>SQLMesh (<a href="https://sqlmesh.com/">https://sqlmesh.com/</a>)</p></li></ol><p><strong>Some guidelines:</strong></p><ul><li><p>I have no direct affiliation with any of these tools other than being a user </p></li><li><p>We&#8217;re using the same raw data set for each tool </p><ul><li><p>Specifically, some dumbed-down NFL Play By Play data available for download using either R or Python (<a href="https://github.com/nflverse/nflverse-pbp">Here</a> or <a href="https://github.com/cooperdff/nfl_data_py">here</a>)</p></li><li><p>You can see how we munge/ clean the raw data prior to loading to the DB <a href="https://github.com/JoeNaso/nfl-pipeline/blob/main/cli.py#L43">here</a></p></li></ul></li><li><p>The schema of the Play-by-Play data can be found below:</p><ul><li><p><a href="https://github.com/JoeNaso/nfl-pipeline/tree/main/data-models">Data Models</a></p></li></ul></li><li><p>All the code (and some other stuff) can be found in this repo:</p><ul><li><p><a href="https://github.com/JoeNaso/nfl-pipeline">NFL Pipeline Repo</a></p></li></ul></li><li><p>While we could easily add Snowflake or BigQuery into the mix for any of these tools, we&#8217;re going to use DuckDB so we don&#8217;t have to deal with cloud resources, associated costs, and additional configuration</p></li><li><p>We&#8217;re keeping things <em>pretty </em>simple, and we&#8217;re doing that on purpose</p><ul><li><p>Simple project setup, simple table definitions, simple outputs</p></li><li><p>While most of these tools offer non-SQL modeling options (ie. Python), we&#8217;re sticking to SQL for this walkthrough</p></li></ul></li><li><p>Simple also means we&#8217;re not testing/ comparing <em>every </em>feature of these tools. There is just too much to discuss across all 3 pieces of software</p></li></ul><p>The tools outlined here are going to be evaluated based on a few categories: </p><ul><li><p>Computes and Utilities </p></li><li><p>Speed of Development</p></li><li><p>Quality of Life</p></li></ul><p>There is no score. There is no winner. This is only a comparison that you can hopefully use to decide which tool fits best with your workflow. </p><div><hr></div><h2>The Setup</h2><p>I exported all the play-by-play data to some Parquet files that are stored in S3; this data was slightly cleaned in a pre-processing step, that took the raw Play-By-Play data from the Python package linked above and stripped out malformed characters, and dropped some columns. You can find the <a href="https://github.com/JoeNaso/nfl-pipeline/blob/main/cli.py#L43">simple</a> <a href="https://github.com/JoeNaso/nfl-pipeline/blob/main/cli.py#L64">pipeline</a> in <a href="https://github.com/JoeNaso/nfl-pipeline/blob/main/pipeline/db/duck.py#L52">Github</a>.</p><p>We&#8217;re using DuckDB on a local machine to access this data, and each of the three tools mentioned above are configured to use DuckDB to access the data. </p><p>You won&#8217;t find the DuckDB &#8220;instance&#8221; in the repo (it&#8217;s gitignored), but the pipeline will look for a local DB and create one if it doesn&#8217;t exist. </p><p>The (large) schema of the our raw data can be <a href="https://github.com/JoeNaso/nfl-pipeline/tree/main/data-models">found in the repo</a>; just know that each row of data represents a single play. We can move throughout the course of a single game to rewind or fast-forward to a certain game state.</p><p>In our can case, we&#8217;re using this play-by-play data to do some simple aggregations. </p><p>We&#8217;ll be replicating the same tables in each tool, using native patterns for that tool. </p><p>If you&#8217;re a Kimball die-hard, married to Data Vault, or an Inmon purist, you might be a bit annoyed at what you see. These data models aren&#8217;t 100% by-the-book; they&#8217;re only meant to illustrate how different tools use different patterns. </p><p>And, as a result, produce different developer experiences. </p><p></p><div><hr></div><h2>DBT</h2><p>There are plenty of tutorials about setting up a DBT project, how to do X with DBT, and much more online. I don&#8217;t want to focus on that.</p><p>Instead, we&#8217;re going to talk about the <em>experience</em> of using DBT. </p><p></p><h4>Compute and Utilities</h4><p>The decision to use DuckDB was driven by DBT, even though the other tools support a &#8220;cost-less&#8221; local developer experience. I simply do not want to pay for every execution of  this project. The other tools in this write up enable us to entirely bypass the compute costs of designing and tweaking a data warehouse. Unfortunately, DBT is not such a tool. </p><p>There is effectively no way to validate the outputs of your DBT models without running them. This means that in the case of a single model change - even as simple as casting a value to another type - results in an incremental cost. </p><p>You also cannot evaluate your tests without incurring some compute overhead. </p><p>The local development experience is solid enough - there are a lot of useful extensions in the ecosystem, baked-in utilities to construct a graph of model dependencies, and a well-documented project configuration. But, it comes at a price. </p><p>Take the relatively simple `base_games.sql` model below. We&#8217;re basically just joining play-by-play table to itself, and doing some minor aggregations to create a single record of the &#8220;final&#8221; outcome of the game. </p><pre><code>with plays_home as (
    select
        game_id,
        home_team as team,
        count(distinct play_id) as total_plays
    from {{ ref('pbp_all')}}
    where posteam = home_team
    {{ group_by(2) }}

),
plays_away as (
    select
        game_id,
        away_team as team,
        count(distinct play_id) as total_plays
    from {{ ref('pbp_all')}}
    where posteam = away_team
    {{ group_by(2) }}

),
plays as (
    select
        game_id,
        plays_home.team and home_team,
        plays_home.total_plays as total_plays_home, 
        plays_away.team as away_team,
        plays_away.total_plays as total_plays_away
    from plays_away
    inner join plays_home
        plays_away.game_id = plays_home.game_id
        
)

select
    distinct game_id,
    week, 
    season_type,
    home_team,
    away_team,
    md5(plays.home_team) as home_team_masked, 
    md5(plays.away_team) as away_team_masked,
    total_away_score,
    total_home_score,
    total_plays_home,
    total_plays_away,
    case 
        when total_away_score &gt; total_home_score then away_team
        else home_team 
    end as victor
from {{ ref('pbp_all') }}
inner join plays
    on pbp_all.game_id = plays.game_id</code></pre><p>To validate the output of this model, we would run it on our local machine, which would connect to a database, potentially start up a server/warehouse, and execute that logic in our sandbox schema (as is DBT convention). </p><p>If we deploy this model to staging, run some tests on it, and then promote it to production, we&#8217;ve now incurred compute charges on <em>at least</em> 4 separate instances.</p><p>In our small `base_games` table, that is relatively insignificant, but when you multiply this across 100s or 1000s of production models, you introduce a significant multiplier effect. </p><p></p><h4>Speed of Development</h4><p>This is easily where DBT shines - once you know the patterns and the configuration framework, you can introduce new models as quickly as you can throw together some SQL. </p><p>But, this is a double-edge sword.</p><p>Anyone who works in the space knows this. It&#8217;s easy to introduce new models, but that often leads to messy projects, duplicate logic, and database objects that persist long after their utility (or accuracy) has expired. </p><p>I&#8217;ve designed, redesigned or contributed to well over a dozen DBT projects. You can make things happen quickly, but that flexibility is also what contributes to many of the anti-patterns that have become prevalent in the data ecosystem. </p><p></p><h4>Quality of Life</h4><p>There is a big ecosystem surrounding DBT - packages, SaaS vendors, blogs, whatever else. </p><p>When it comes to quickly covering ground in a warehouse project, you&#8217;ll probably be able to move <em>very </em>fast with this tool. That is of course assuming you have the appropriate supporting tools and permissions. </p><p>In fact, many of the quality-of-life components of DBT are enabled by surrounding tools. Especially the third-party packages. </p><p>So, while we all have a few ideas about what could be improved with DBT (name-spacing, for one), the quality of life when using it is quite high. The challenges come when building at scale - duplicate logic, inconsistent definitions, poor input data. </p><p>Some of those pieces are out of the reach of DBT entirely; others are not. </p><p>In my experience, anyone who says that DBT doesn&#8217;t enable them to move quickly is doing something wrong or blocked by organizational constraints.</p><p>A quick comment on lineage - DBT <em>kind of </em>shows lineage, but only at a high level. That is, table to table. The tool&#8217;s rise to popularity almost certainly helped spawn a suite of supplemental tools, or at least helped them also rise to prominence. This is feature - rather, lack thereof - is a drawback of DBT at scale. </p><p></p><div><hr></div><h2>SDF</h2><p>SDF stands for Semantic Data Fabric, a new-comer product with roots at Meta.  I need to caveat everything I mention below because SDF is still in beta. So, as of this post, some features are not yet available.</p><p>Of the tools in this post, this one feels most different. And there are some clear reasons why. </p><p>First off, it&#8217;s written in Rust, which means you install a binary. There is nothing  to <code>`pip install`</code>, no package extensions to include, and no virtual environment to maintain. </p><p>This lends nicely to development speed, as you don&#8217;t have to battle with package dependencies, but that also means it might feel foreign to many less experienced engineers and analysts. </p><p>Python is undoubtedly the language of choice in business analytics (sorry, R), and you&#8217;re bound to find candidates who are much more familiar with Python than Rust. That said, you&#8217;re not writing any Rust code when you&#8217;re using SDF. </p><p>You just&#8230; write SQL. </p><p>No macros, no ref statements, no jinja templating (Again: They&#8217;re in Beta).</p><p>But, just like DBT, there is a fair amount of YAML-based configuration. I did not notice a significant difference in YAML when setting up DBT and SDF, but like most things in software, it depends. </p><p>Most of the YAML that SDF relies on is not generated manually  - it&#8217;s created when you run SDF. In fact, for our small NFL project, the system generated YAML exceeds 6600 lines.</p><p>YAML - the workhorse of modern data tools.</p><p></p><h4>Compute and Utilities</h4><p>Here&#8217;s the thing - what SDF might lack in convenience utilities it makes up for in development experience. </p><p>You can &#8220;run&#8221; your production warehouse from your local machine, complete with propagating changes throughout downstream models without incurring a cent of compute cost. </p><p>This is different from both the DBT approach and the SQLMesh approach (see below), which require you to run a SQL command against your database or execute a query against it without generating a materialization, respectively.  </p><p>SDF essentially keeps a cache of your current project definition in a hidden folder on your machine. Change a column or datatype? You can apply those changes to your local project state without incurring compute cost. </p><p>You might think of it as something like <code>`terraform plan`</code>, if you&#8217;re familiar with IAC patterns. </p><p>That DBT model I shared above? In SDF, <code>`base_games.sql`</code> it looks like this: </p><pre><code><code>with plays_home as (
    select
        game_id,
        home_team,
        count(distinct play_id) as total_plays
    from pbp_all
    where posteam = home_team
    group by 1, 2

),
plays_away as (
    select
        game_id,
        away_team,
        count(distinct play_id) as total_plays
    from pbp_all
    where posteam = away_team
    group by 1, 2

),
plays as (
    select
        plays_home.game_id,
        plays_home.home_team,
        plays_home.total_plays as total_plays_home, 
        plays_away.away_team,
        plays_away.total_plays as total_plays_away
    from plays_away
    inner join plays_home
       on plays_away.game_id = plays_home.game_id
        
)

select
    distinct plays.game_id,
    week, 
    season_type,
    plays.home_team,
    plays.away_team,
    md5(plays.home_team) as home_team_masked, 
    md5(plays.away_team) as away_team_masked,
    total_away_score,
    total_home_score,
    total_plays_home,
    total_plays_away,
    case 
        when total_away_score &gt; total_home_score then plays.away_team
        else plays.home_team 
    end as victor
from pbp_all
inner join plays
    on pbp_all.game_id = plays.game_id
;</code></code></pre><p>Looks pretty familiar, right? </p><p>It&#8217;s just SQL without the `{{ ref() }}` and `{{ group_by() }}` macros. </p><p>Some of you might not like this - you lose the convenience utilities we all know with DBT. Others might look at this and declare that SQL-based data models are the way things are <em>supposed </em>to be. I&#8217;ll let you decide. </p><p>I did find myself pleasantly surprised by how <em>clean</em> working with SDF felt. </p><p>But, I also found myself mulling over the fact that if I were to do something complex in SDF, I would be writing much more code. </p><p>If I were to translate some DBT projects I&#8217;ve seen into SDF models, I know it would be a challenge. </p><p>However, that speaks more about the state of data modeling at many businesses today than it does SDF. </p><p>That thought alone might be just one reason why a tool like SDF is a strong choice for companies with strong domain-driven data models. After all, there is clearly a reason behind the naming choice of &#8220;semantic data fabric&#8221; (see: <a href="https://www.gartner.com/smarterwithgartner/data-fabric-architecture-is-key-to-modernizing-data-management-and-integration">what is data fabric</a>). </p><p>The SDF team did build in some nice utilities and tooling to help wrangle the inevitable sprawl of data models, though. <code>`Classifiers`</code> provide an SDF-native mechanism to apply categorical tagging to your columns and their child columns throughout your project. </p><p>By pairing <code>`Classifiers` and `Function` </code>definitions, you can reclassify data points from sensitive to non-sensitive as needed, and have that change propagate across all relevant models. </p><p>For example, we&#8217;re naively hashing the team names in order to simulate what would normally be  done through some data masking mechanism. In our case, we&#8217;re just applying an md5 hash to a string.  </p><p>So, something like `Dallas Cowboys` becomes `5a40b7cccc2a230895881416edda4622`.</p><pre><code>--- workspace.sdf.yml
classifier:
  name: MASKER
  labels:
    - name: hash_me
    - name: it_was_hashed

--- src/functions.sdf.yml
function:
  name: md5
  reclassify:
    - from: MASKER.hash_me
      to: MASKER.it_was_hashed

--- base_games.sql
select
    distinct plays.game_id,
    week, 
    season_type,
    plays.home_team,
    plays.away_team,
    md5(plays.home_team) as home_team_masked, 
    md5(plays.away_team) as away_team_masked,
    &lt;... more query stuff happening here ...&gt;</code></pre><p>And when running SDF,  we see that we&#8217;ve now reclassified our data from something needing to be hashed (aka hash_me) to a clean value (now flagged as it_was_hashed). </p><p>This reclassification follows our home_team and away_team values to every model that this data is used. </p><pre><code>Schema nfl_sdf.pub.base_game
+------------------+-----------+-------------+----------------------+
| column_name      | data_type | is_nullable | classifier           |
+------------------+-----------+-------------+----------------------+
| game_id          | varchar   | true        |                      |
| week             | int       | true        |                      |
| season_type      | varchar   | true        |                      |
| home_team        | varchar   | true        | MASKER.hash_me       |
| away_team        | varchar   | true        | MASKER.hash_me       |
| home_team_masked | varchar   | true        | MASKER.it_was_hashed |
| away_team_masked | varchar   | true        | MASKER.it_was_hashed |
| total_away_score | double    | true        |                      |
| total_home_score | double    | true        |                      |
| total_plays_home | bigint    | true        |                      |
| total_plays_away | bigint    | true        |                      |
| victor           | varchar   | true        | MASKER.hash_me       |
+------------------+-----------+-------------+----------------------+</code></pre><p></p><h4>Speed of Development</h4><p>I&#8217;ve never set up an SDF project before. It&#8217;s only been publicly available for a few weeks. </p><p>And while setting this thing up  had a few hiccups, it was nothing beyond what you&#8217;d expect for a tool in Beta. First attempts at standing up a new tool can take some getting used to. For what it&#8217;s worth, SQLMesh also took me for a little ride. </p><p> I will give the team credit, though; they responded to my slacks and clarified questions QUICKLY. Even to the point of sending me resources multiple times over. </p><p>It&#8217;s clear that this approach is different from DBT. </p><p>It&#8217;s perhaps less about the convenience methods and more about building fast, stable software. In SDF&#8217;s case, that means fast run times, building a native graph of dependencies without the need for Python and jinja wrappers, and evaluating changes against your local project cache. </p><p>Writing and running models with SDF gave me the sense I could feel the underlying Rust framework - a markedly different experience than working with a Python-based tool like DBT. </p><p>If you&#8217;re not familiar with using statically typed languages with strong type-checking, it might be the right time for you to try something new. </p><p></p><h4>Quality of Life</h4><p>There are some things about SDF that stand out as nice quality of life additions. While the feedback cycle of making changes locally and &#8220;running&#8221; them against your project is great, I found that small details like syntax errors that include references to your bad code are super useful (thanks, Rust compiler). This beats the nondescript errors you often find in many databases, handsdown.</p><pre><code><code>error: Ambiguous column 'home_team' found, available are 'pbp_all.play_id', 'pbp_all.game_id', 'pbp_all.old_game_id', 'pbp_all.home_team', 'pbp_all.away_team', ...
  --&gt; src/base/base_game.sql:42:4</code></code></pre><p>While it might not yet have the robust ecosystem that DBT has developed, I get the impression that <em>right now</em>, your ability to use SDF effectively depends mostly on your ability to effectively model data using SQL. </p><p>In the future, I&#8217;d expect that Polars, PySpark, and other tools can be plugged into the mix. Similarly, some of that YAML configuration will likely be supplemented by Python or Rust.</p><p>Finally, when it comes to lineage, SDF excels. Simple as that. Between the query parsing and Classifiers, SDF makes it really easy to find where a column originates, and how it may be modified throughout your project. </p><p>I obviously only scratched the surface of SDF, but so far, I like what I&#8217;ve seen. </p><p></p><div><hr></div><h2>SQLMesh </h2><p>I was intruiged when SQLMesh was released; to me, it was a meaningful step away from the DBT paradigm. </p><p>After having read through the docs and setting up the project, I initially found myself thinking of SQLMesh as a new tools build on top of DBT patterns. </p><p>But that&#8217;s not entirely correct.</p><p>Opinionated tools are generally a good thing (even if you don&#8217;t agree with their opinions). In the case of SQLMesh, there are no shortages of opinions:</p><ul><li><p>Pipeline definitions should be in SQL, rather than configured via YAML</p><ul><li><p>Python is an option, too</p></li></ul></li><li><p>`SELECT` expressions are expected to follow a certain convention </p><ul><li><p>unique column names (this is a given)</p></li><li><p>explicit types </p></li><li><p>explicit aliases are recommended</p></li></ul></li><li><p>Inline models configurations provide metadata, as well as control (orchestration) </p></li><li><p>&#8220;Backfilling&#8221; is a core concept within the SQLMesh workflow</p></li></ul><p>If you look at these requirements independent of other tools, they make a lot of sense. Even things like explicit types and explicit aliases are generally just good practice. </p><p>But when you put DBT and SQLMesh side-by-side, you begin to get the sense that these choices were very <em>intentional. </em></p><p>YAML  is everywhere in DBT; from docs to project configs to tests to packages. You can&#8217;t escape it. That&#8217;s not the case with SQLMesh, and it&#8217;s nice to see a modeling tool that is lighter on the markup. </p><p>This does mean properties that would typically be defined in YAML are now pushed down to individual models, though.</p><p>There are at least <a href="https://sqlmesh.readthedocs.io/en/stable/concepts/models/overview/#properties">14 different model-level</a> configuration options for &#8220;full-refresh&#8221; models alone; incremental models add another 3 options. And, interestingly, those are independent of the `<code>model_defaults`</code> section in the project-level `<code>config.yaml` </code>or <code>`config.py</code>`file. </p><p>Of course, many of these properties are optional, but it does add overhead to the project configuration. For a larger-scale project, these are probably quite useful. For smaller deployments, it feels excessive.</p><p>Where some tools <em>require</em> that you write SQL, SQLMesh almost asks you to swap some out for Python. I can&#8217;t speak to the friendliness of the API, but this certainly fits the toolset of today&#8217;s data teams. </p><p></p><h4>Compute and Utilities</h4><p>Much like SDF, SQLMesh keeps a local cache of your project. But, it&#8217;s used different - specifically for Plans, or what you can think of as a project diff. </p><p>Environments and Plans add additional complexity to SQLMesh projects. </p><p>Under the hood, there is a robust state machine tracking changes of your models and their dependencies. The &#8220;state&#8221; of your project - aka Plan - gets applied to an Environment. This get increasingly complex as the scope of your changes is determined.</p><p>I am all for tracking changes and enabling teams to understand how their tables have changed overtime. But, SQLMesh seems to take this to the extreme: </p><blockquote><p>Every time a model is changed as part of a plan, a new variant of this model gets created behind the scenes [&#8230;]. In turn, each model variant's data is stored in a separate physical table.</p><p><em>Each model variant gets its own physical table while environments only contain references to these tables.</em></p></blockquote><p>This sounds like an interesting feature that enables teams to rewind model changes or fast-forward to a designated model state. But, I&#8217;m curious how teams would use this in a real-life production setting outside CI/CD. </p><p>Given that databases like Snowflake enable <a href="https://docs.snowflake.com/en/user-guide/data-time-travel">Time-Travel</a> - also something commonly used in CI/CD - I&#8217;m curious when one would be used over the other. If you&#8217;re not using Snowflake, then maybe a feature like this has more utility. Regardless, this part of SQLMesh adds significant complexity. </p><p>One other thing about SQLMesh that stands out is the Virtual Update feature; I initially equated it to something like React&#8217;s Virtual DOM - a representation of the DOM (Document Object Model) that is used to determine <em>what </em>needs to change on a webpage as a user interacts with it and changes state - but the docs indicate that <a href="http://the process of promoting a change to production is reduced to reference swapping">&#8220;the process of promoting a change to production is reduced to reference swapping</a>&#8221;. So, it&#8217;s not like the Virtual DOM at all. </p><p>In fact, it appears more like an internal mechanism for blue-green deployments. </p><p>The more I read through the docs for SQLMesh, the more it appeared that many features of the tool were inspired by existing paradigms in other tools - DBT, Airflow, various pipeline vendors.</p><p>Testing within SQLMesh is done through two mechanisms - Tests and Audits. SQLMesh Tests effectively act as unit tests, checking some logical output is produced as expected. Audits, however, allow you to attach  assertions to your models; these are more akin to data quality checks. Much like DBT, these are SQL-based assertions, designed to assert that no rows are returned. </p><p>The distinction between the two components of the project make sense, and follow with the established DBT pattern of testing the data and custom logic tests. </p><p>SQLMesh&#8217;s implementation feels more robust, possibly because of the wide range of Audits available by default. These Audits seem to blend the default testing patterns of DBT with those available in packages like <a href="https://github.com/calogica/dbt-expectations">DBT-Expectations</a>, Calogica&#8217;s testing toolkit inspired by <a href="https://docs.greatexpectations.io/docs/">Great Expectations</a>.</p><p>It is unclear if there are other mechanisms to test Python-based models (beyond Python-native testing utilities, of course), given that Tests and Audits are SQL-based.</p><p>You can see that an Audit is included in the SQLMesh <code>`base_game.sql`</code> model: </p><pre><code>MODEL (
  name base_sqlmesh.base_game,
  kind FULL,
  grain game_id,
  audits [assert_game_scores_positive],
);

with plays_home as (
    select
        game_id,
        home_team,
        count(distinct play_id) as total_plays
    from raw.pbp_all
    where posteam = home_team
    group by 1, 2

),
plays_away as (
    select
        game_id,
        away_team,
        count(distinct play_id) as total_plays
    from raw.pbp_all
    where posteam = away_team
    group by 1, 2

),
plays as (
    select
        plays_home.game_id,
        plays_home.home_team,
        plays_home.total_plays as total_plays_home, 
        plays_away.away_team,
        plays_away.total_plays as total_plays_away
    from plays_away
    inner join plays_home
       on plays_away.game_id = plays_home.game_id
        
)

select
    distinct plays.game_id,
    week, 
    season_type,
    plays.home_team,
    plays.away_team,
    md5(plays.home_team) as home_team_masked, 
    md5(plays.away_team) as away_team_masked,
    total_away_score,
    total_home_score,
    total_plays_home,
    total_plays_away,
    case 
        when total_away_score &gt; total_home_score then plays.away_team
        else plays.home_team 
    end as victor
from raw.pbp_all
inner join plays
    on pbp_all.game_id = plays.game_id
;</code></pre><p>And of course, our Audit `<code>assert_game_scores_positive.sql`</code>:</p><pre><code>AUDIT (
  name assert_game_scores_positive,
);

SELECT *
FROM @this_model
WHERE total_away_score &lt; 0 OR total_home_score &lt; 0
;
</code></pre><p></p><h4>Speed of Development</h4><p>I found setting up my first working SQLMesh project to be as challenging as the SDF project. It  wasn&#8217;t necessary complicated, just something with which I was not familiar.</p><p>SQLMesh&#8217;s Environments and Plans is a good example of this. Conceptually, they are not difficult to grok, but the way they interact with table changes, Virtual Updates, and &#8220;Backfilling&#8221; takes some getting used to. </p><p>SQLMesh lets you spin up models as quickly as an other tool, but making use of its advanced features requires following specific paradigms within the tool. </p><p>For many, these will be new concepts. </p><p></p><h4>Quality of Life</h4><p>One small, though impactful thing that SQLMesh provides is the ability to configure your project in different - declaratively with YAML or dynamically with Python. The <code>`config.yml`</code> file you need to set up your connections and projects specs can instead be a `<code>config.py`</code> file, enabling you to do more complex pathing logic, loading env vars, and more. </p><p>This project uses a <code>`config.py` </code>in order to access a DuckDB instance located at the project root. Using YAML would mean a hard-coded value, effectively making the project useless across computers. </p><p>SQLMesh feels like a lot of software you find in the Python-based data tool category - there is the quick-and-dirty way to get things working, and then there is the idiomatic approach. </p><p>The idiomatic approach always unlocks a new toolkit, but it can be challenging to embrace. </p><p>I encourage everyone to look at <a href="https://sqlmesh.readthedocs.io/en/stable/concepts/macros/sqlmesh_macros/">SQLMesh Macros</a> as an example of something that may be very powerful when done right, but requires a foreign syntax to get there. </p><p></p><div><hr></div><h2>Final Considerations</h2><p>DBT, SDF and SQLMesh solve the same problem in different ways. </p><p>You can&#8217;t say that one is better than the others, mostly because there are parts to each of them which support different workflows. Plus, there are many features to each tool, so it&#8217;s quite difficult to say you should only use X in some situations but only Y in others.  </p><p>I stand by my earlier comment that using the idiomatic approach (read: tool-specific) will unlock new methods to solve your problem, but it&#8217;s not always straightforward to get to this point. </p><p>Especially when you or your team is accustomed to solving that problem in a specific way. </p><p>Perhaps a wildly under-appreciated fact is that these 3 tools can only take you so far. Speed of development and easy of use does not make up for data-producing systems that lack governance and well structured access patterns.</p><p>Data modeling as a practice existed longer before DBT, Cloud Warehouses, and compute credits. Some companies have apparently forgotten this, or worse yet, never knew this was the case.  </p><p>At the end of the day, you&#8217;re not going to go wrong by using DBT, SDF or SQLMesh - in fact, choosing one is almost certainly better than not using any of them. Yes, there are tradeoffs in choosing one vs the others, but that&#8217;s the nature of software. </p><p>Some things should not factor into your decisions, mostly because they are table stakes: </p><ul><li><p>Speed of getting started</p><ul><li><p>You can get a fully functional project running in well under an hour no matter which tool you&#8217;re using</p></li></ul></li><li><p>Support for specialized systems and platforms</p><ul><li><p>If you&#8217;re using such a specialized toolset that none of these tools fit what you need, you&#8217;re probably looking into the wrong category of solutions altogether</p></li></ul></li><li><p>Deployments</p><ul><li><p>The options for deploying data tools like these are virtually ENDLESS. They all going to support a paid deployment mechanism in one way or another, but prudent data teams know you <em>don&#8217;t </em>need to rely on a vendor for this. In fact, the most cost-conscious of them will purposely look elsewhere. </p></li></ul></li></ul><p>Instead, your decision should come down to: </p><ul><li><p>Incremental cost of delivery </p></li><li><p>How well the tool fits into your existing systems and development workflow </p></li><li><p>Whether you can support this tool within your workflow for a long time </p></li></ul><p>Do they have 100% feature parity? No. But, chances are they will in short order. This category - who saw <a href="https://www.serra.io/">new entrants to the market</a> even as I was writing this post - is on a path towards feature commoditization. </p><p>Over the next next 12-18 months, I expect most tools in the space to spend significant resources attempting to further differentiate. We&#8217;ll see how it plays out. </p><p></p><div><hr></div><p>Think I missed the mark, or want to share you thoughts?  Give me a shout on <a href="https://www.linkedin.com/in/josephmnaso/">LinkedIn</a> or <a href="https://twitter.com/itsjoenaso">Twitter</a>.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Data Jargon Newsletter! New posts every month-ish.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Building a Solo Consulting Business: 6 Months In]]></title><description><![CDATA[Quick thoughts on a calculated bet to go out on my own.]]></description><link>https://datajargon.substack.com/p/building-a-solo-consulting-business</link><guid isPermaLink="false">https://datajargon.substack.com/p/building-a-solo-consulting-business</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Wed, 12 Jul 2023 15:32:56 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="5650" height="3767" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3767,&quot;width&quot;:5650,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;yellow arrow road sign&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="yellow arrow road sign" title="yellow arrow road sign" srcset="https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1463680942456-e4230dbeaec7?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxsZWZ0JTIwb3IlMjByaWdodHxlbnwwfHx8fDE2ODkwMTk1ODh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cue Yogi Berra quote </figcaption></figure></div><p>I wasn&#8217;t going in completely blind (or without work lined up), but I wasn&#8217;t exactly battle-hardened, either. </p><p>I was moonlighting for 9 months as a consultant while working full-time  before I decided to go off on my own. I had known I would do this for many years, but the time finally had arrived.</p><p>A spike in interest rates, a tightening of budgets, and an increased scrutiny on vendor spend seemed to be the right scenario for increased demand for outside expertise. </p><p>So far, it&#8217;s been an enjoyable ride working with a number of clients on data engineering and data architecture problems, but it&#8217;s only been about 6 months. </p><p>We don&#8217;t hear many people in the space talk about running a solo shop and the dynamics you deal with along the way. That might be because:</p><ol><li><p>Many people do consulting as a side gig, not full-time</p></li><li><p>There are different approaches to service delivery within consulting/ contract work (ie. design-only, staff augmentation, short-term freelance projects, etc)</p></li><li><p>They haven&#8217;t figured out how to scale predictably, or they may not want to</p></li></ol><p>I certainly have not figured out the predictability component, but I have noticed a few constraints in the market. You can find them below.</p><p>If you&#8217;re more interested in commentary on project scopes, scaling a solo business, and other topics related to running a consulting business, feel free to jump ahead.</p><p></p><div><hr></div><p></p><h2>Constraints on a one-person business</h2><p></p><h4>Constraint 1: You can only work a certain number of hours in a day. </h4><p>For the sake of my sanity, and in order to maintain high quality output, it&#8217;s not feasible to burn the midnight oil and wake up at 5am to start the day. This puts a natural limit on my billable hours; this applies to everyone. Even the internet gurus.</p><p>You have limited leverage. Even with project-based pricing, you are essentially working on a hourly basis. Yes, there is money to be made when you know how to solve a problem, but consulting does not result in paid conversions while you&#8217;re sleeping or middle-of-the-night upgrades like you&#8217;d find at a self-service SaaS business.</p><p>There are way around this - content, communities, courses - but those would be by-and-large more distracting than beneficial for me right now.</p><p></p><h4>Constraint 2: Your clients dictate the market rate of your services</h4><p>This might be obvious, if you work with startups, you&#8217;re going to be a different price point than if you&#8217;re working with Fortune 500 organizations. I&#8217;ve been keeping up with Twitter, Reddit and other places online in order to stay as plugged in as is reasonable (and beneficial), and it shocks me that some people forget this. </p><p>Even the sharpest GTM expert would be hard-pressed selling services to pre-product-market fit companies - they just don&#8217;t have money to burn. But people still try to do just that. A fool&#8217;s errand if you ask me. </p><p>You need to go where the money is - figuratively, of course. This can take many forms, and applies to technology as much as any other specialty.</p><p>In the world of data and engineering, going where the money is might mean:</p><ol><li><p>Expanding revenue channels for a customer by productizing some underutilized or underdeveloped component of their platform</p></li><li><p>Reducing cloud expenses </p></li><li><p>Working on their core product lines</p></li></ol><p>That last bullet is probably the most important; many consultants in the space focus on the Modern Data Stack, or some flavor of it. In my opinion, that has become a commoditized offer.</p><p>If you&#8217;re not driving more revenue for clients or reducing their technology expenses, you are also just another expense. You&#8217;re fighting an uphill battle. </p><p>Set your sights on solving bigger problems. </p><p></p><h4>Constraint 3: The projects dictate your earning potential </h4><p>Startups may be willing to pay less on an hourly basis than Mega Corps, but the scale of the projects you work on matter. 300+ hour projects for a smaller client will always be more meaningful (both to the client AND to your bank account) than a 20 hour project at Mega Corp. Personally, I will take the longer-term, higher impact work over a single brand name.</p><p>But, if you are looking for short duration, high-volume freelance engagements, then the Mega Corp logo might be a prudent thing to add to your portfolio. </p><p>It depends on your expertise, your goals, and your ability to actually deliver what you say you&#8217;ll deliver.</p><p>Naturally, this ties directly with point 3 above. Projects that are core to your client&#8217;s product lines will almost always be the longer-term, higher impact work. </p><p>Just don&#8217;t forget - your ability to deliver on the projects you commit to is perhaps the most important piece of the puzzle. </p><p>Especially if you try to scale.</p><p></p><div><hr></div><p></p><h2>Thoughts on Scope, Scale, and Market Equilibrium</h2><p>Many consultants present themselves in a coaching or advisory capacity - they provide suggestions, guidance, and general know-how related to a domain. The may sub-contract the hands-on work, or they may not be involved in that capacity at all. </p><p>In the world of data engineering, designing the system and building the system are very different skillsets. </p><p>But, until you reach a certain scale of business (we&#8217;ll talk about this in a bit), you need to be able to do both. </p><p>After a few years of running teams, I happen to like working on implementations (aka writing code) again. It&#8217;s nice to be hands-on solving problems once more. Will I keep doing this for a long time? Hard to say, but designing systems AND building them has been fun so far. </p><p>The real challenge is looming, though. That is how to surpass the point at which your obligations to your book of business exceed your individual capacity to deliver. All while maintaining your standard of quality, of course.</p><p>If the market roughly dictates the price a client is willing to pay, and you have reached your serviceable capacity, you could sub-contract. But, like we discussed in <strong>Constraint #2, </strong>your clients dictate the market rate of your services (at least to some degree). So, if you want to scale via sub-contracting without losing money, you need to play your cards right in terms of target client and pricing. </p><p>It takes a fair amount of trial-and-error to find a good balance between the &#8220;right&#8221; type of customer, the number of simultaneous customer projects,  and your rate. </p><p>Balancing the growth of your business with your ability to service your clients is not difficult to do well, but it&#8217;s incredibly easy to screw up. </p><p>And, as a one-person business, screwing it up means there&#8217;s only one person to blame. </p><p></p><div><hr></div><h2>Miscellaneous thoughts that no one tells you about operating a business of one </h2><p></p><ul><li><p>Closing projects is not that difficult if you know your area of expertise as well as you think you do. But, finding those projects can be.</p></li><li><p>You need to deliver quick wins to buy breathing room on longer term projects (this applies to W-2 life just the same as solopreneur life, actually)</p></li><li><p>Time to business value is arguably more important in a client-contractor relationship than for an employer-employee </p></li><li><p>The admin work associated with a new client takes a meaningful amount of time, as do invoicing and billing</p></li><li><p>You&#8217;ll think a lot about leverage, efficiency, and what you consider &#8220;good enough&#8221;</p></li><li><p>&#8220;Niche-ing down&#8221; a technology-focused service business is challenging, arguably more so than a product business</p></li><li><p>Writing/ creating content is incredibly time consuming when you have <em>real </em>deliverables to produce</p></li></ul><p></p><p></p><p>If you&#8217;re building a single person consultancy, or just generally operate in the data and engineering space, I&#8217;d love to connect. Give me a shout on <a href="https://www.linkedin.com/in/josephmnaso/">LinkedIn</a> or <a href="https://twitter.com/itsjoenaso">Twitter</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This is the Data Jargon Newsletter. Monthly-ish, sometimes more. Drop your email below. </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Mutual Embiggening]]></title><description><![CDATA[A rare business model in the world of Data]]></description><link>https://datajargon.substack.com/p/mutual-embiggening</link><guid isPermaLink="false">https://datajargon.substack.com/p/mutual-embiggening</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Thu, 01 Jun 2023 13:12:55 GMT</pubDate><enclosure url="https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://datajargon.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="1080" height="720" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;group of people walking on the stairs&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="group of people walking on the stairs" title="group of people walking on the stairs" srcset="https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/reserve/NV0eHnNkQDHA21GC3BAJ_Paris%20Louvr.jpg?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwzMnx8YnVzeSUyMGJ1c2luZXNzfGVufDB8fHx8MTY3OTUwMjIwMA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@martinirc">Jos&#233; Mart&#237;n Ram&#237;rez Carrasco</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p></p><p>Mike Evans, cofounder of Grubhub, released a memoir in 2022. It&#8217;s called <em><a href="https://www.amazon.com/Hangry-Mike-Evans/dp/0306925532">Hangry</a>. </em>It&#8217;s an interesting story of a startup with a marketplace business model. He gleans over a lot of the dirty work building a business of that variety, but he repeatedly comes back to the concept of <em>mutual embiggening. </em></p><p>Strangely comical term, but altogether sound business principle.  </p><p>The idea is pretty straightforward - as a business, you grow as your customers grow. In the case of Grubhub, the more orders placed through their platform, the more sales restaurants are making, and in turn, the more revenue Grubhub generates from those restaurant. Everyone wins. Mutual embiggening. </p><p>This idea and general business model is not new. It exists in many verticals. </p><p>But it is a rare sight in the world of data. </p><p></p><h3>What is the purpose of <em>data?</em></h3><p>Plenty of tech and non-tech companies alike have data teams of some flavor. Analytics, Machine Learning, Business Intelligence, Decision Science - they&#8217;re all flavors within the wide spectrum of "&#8220;data teams&#8221;. </p><p>But, what is the general purpose or mandate of a Data team?</p><p>Knee-jerk reaction might be &#8220;to drive better decision-making&#8221; or something filled with more meaningless jargon (ie. &#8220;to uncover actionable insights!&#8221;&#8230; please excuse my virtual eye roll). </p><p>Really, there are only 3 things a &#8220;data team&#8221; of some flavor are meant for:</p><ol><li><p>Improve operations of the business through objectivity</p></li><li><p>Provide an alternative mechanism for monetization</p></li><li><p>Act as a signal to outsiders that this company &#8220;knows what they&#8217;re doing&#8221; even if they actually don&#8217;t </p></li></ol><p>Those might seem too generic - where does the data science team fit in? Why are we ignoring data engineering? </p><p>In reality, those teams also fall into these categories. If you&#8217;re building machine learning pipelines, you&#8217;re either working to improve some existing inefficiency in the business, or your producing a monetizable feature that powers some user experience. No need to complicate things.</p><p>What do these flavors of data team work look like? #1 usually takes the form of dashboards, reports or some automation of an existing workflow. It might mean sending data from one tool to another, delivering some metrics on a regular schedule to the weekly Exec meeting, or just dumping a spreadsheet for an analyst. This is work that is meant to highlight improvements in other parts of the business. </p><p>The second is what you&#8217;ll hear referred to as a &#8220;data product&#8221;. This is often the goal for most teams, but getting to this point can be a challenge. Sometimes a company may think they&#8217;re on the path to building a data product only to fall into the trap of internal business intelligence reporting. </p><p>Or, they may have ambitions of monetizing their internal datasets only to be derailed by a lack of internal data management, poor data modeling, and plenty of unexpected work to make things usable. </p><p>Data products are the best opportunity for mutual embiggening, though. But we need to talk about billing before we can get there. </p><p></p><h3>On Billing in Today&#8217;s Data Ecosystem</h3><p>For the last 5-10 years, a wave of optimism has washed over the tech industry. Data has been viewed as a &#8220;more is better&#8221; commodity - a phenomenon that has produced some wildly popular products in the space (Snowflake, BiqQuery, Fivetran, Census, etc, etc&#8230; the list goes on). </p><p><a href="https://mad.firstmark.com/">FirstMark&#8217;s ML/AI/Data Landscape</a> now lists an apparent 1400+ companies, and there are plenty missing.  Each one of those businesses are battling for your budget. </p><p>And while there are plenty of tools that provide real value in specific scenarios, looking at their pricing pages and billing models should leave you with one significant takeaway - these companies by-and-large show no alignment between your usage and the value your derive from the tool. </p><p>Usage-based billing has taken over.  OpenView Partners says it themselves - <a href="https://openviewpartners.com/usage-based-pricing/">usage-based billing has nearly doubled</a> within B2B SaaS over the last 5 years. </p><p>And guess what? SaaS companies in the data space are all B2B; let&#8217;s be real (Side note - you could make the case a product like <a href="https://www.whoop.com/">Whoop</a> is a data product, but that is the exception).</p><p>At a high level, this is fine. These companies are making money charging for a product that people want. </p><p>But, what&#8217;s shocking is how foreign Mike Evan&#8217;s <em>mutual embiggening </em>concept is to the data ecosystem. Businesses are designed to make money, but why has this entire industry turned away from a model that serves so many other businesses incredibly well? Call me crazy, but it can feel like a cash-grab at times. </p><p>The cost-value alignment of their pricing models in other industries are obvious. The businesses fortunate enough to fit the &#8220;mutual embiggening&#8221; paradigm seem to do pretty well, regardless of size - Grubhub, Postscript (a former employer of yours truly), and ConvertKit are good examples. </p><p>The prevalent theme? They make money as their customers make money - delivery orders, ecommerce, and newsletter subscriptions, respectively. </p><p>And this is the core problem. </p><p>Many of the companies on the Firstmark MAD Landscape are service providers or internal tools, not revenue drivers. And without that alignment, you can&#8217;t embrace <em>mutual embiggening</em>. Even data SaaS business who claim cost savings are still just an expense at the end of the day. The economics are tough. </p><p></p><h3>The Chosen Few</h3><p>There is another way to look at this mutual growth business model - it has to do with consumables. </p><p>We need to be careful here since it&#8217;s easy to conflate consumption with usage, but for this conversation, they have to be viewed differently. Consumption-based are valued and priced based on the outcomes they provide'; in our B2B SaaS scenario, that usually is tied to some other application&#8217;s making use of the product. Usage-based products provide an interface into a service of some sort, and charge based on the activity on the platform. There is no concept of &#8220;consumption&#8221; in another application context. </p><p>For instance, a software product that provides observability into errors in your running code (ala Grafana or DataDog) is a usage-based product. They are providing a service (introspection and observability), not a consumable. </p><p>On the other hand, a company like <a href="https://www.safegraph.com/">SafeGraph</a> is a consumable - they provide data to be consumed and processed by their customers. Coincidentally, their website explicitly says they offer &#8220;<a href="https://www.safegraph.com/pricing">No pay-per-use policies</a>&#8221;. </p><p>SafeGraph also happens to embody a <em>mutual embiggening </em>paradigm. Their data sets offer clear revenue generation opportunities, which in turn drives more consumption. It&#8217;s a win-win. </p><p>The fact is many SaaS businesses within the Data space just don&#8217;t provide opportunities for revenue generation and monetization for their customers. The select few that do are data-as-a-product companies who offer datasets for enrichment purposes. <a href="https://clearbit.com/">Clearbit</a> is another such example. </p><p>Data Teams are constantly pushed to move towards data products and monetization, but the reality is that the incentives often don&#8217;t align. Even within the broader data industry, pricing models overwhelmingly lean into a &#8220;we charge you for using our product&#8221; position rather than one of mutual benefit. </p><p>And until <em>mutual</em> <em>embiggening </em>becomes a more common practice for both vendors and data teams alike, the unspoken truth is that these vendors and teams will remain as costs centers. </p><p></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This is the Data Jargon Newsletter. Published monthly-ish. Drop your email below. It&#8217;s free.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Hiding in Plain Sight: The High Cost of a Data Team]]></title><description><![CDATA[The math doesn't add up for many companies.]]></description><link>https://datajargon.substack.com/p/hiding-in-plain-sight-the-high-cost</link><guid isPermaLink="false">https://datajargon.substack.com/p/hiding-in-plain-sight-the-high-cost</guid><dc:creator><![CDATA[Joe Naso]]></dc:creator><pubDate>Tue, 16 May 2023 14:02:55 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="1080" height="720" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;a close up of a black tie on a yellow background&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="a close up of a black tie on a yellow background" title="a close up of a black tie on a yellow background" srcset="https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1641661546856-37cad4aceff5?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxhcnJvdyUyMHVwJTIwcmlnaHR8ZW58MHx8fHwxNjc5NDM1MDc4&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Your monthly data team and infrastructure cost</figcaption></figure></div><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://datajargon.substack.com/subscribe?"><span>Subscribe now</span></a></p><h2>What is the cost of doing business? </h2><p>There is no single answer to that question. For some, it&#8217;s paid ads, salaries paid for content marketing, or the dollars they shell out on SaaS tools. We tend to look at receipts as the &#8220;cost&#8221; of things - how much is a new Salesforce license, how long has this Ad Set been running on Facebook, is this new tool actually worth the $30,000 upfront annual contract price tag? </p><p>This view of &#8220;cost&#8221; is accurate from an accounting perspective, but it is too narrow a view when it comes to the productivity cost of a team. </p><p>Technical roles - engineers, data scientists, analysts or anyone that is expected to write code - have an inherently higher cost than many other roles. If you&#8217;re working in sales, perhaps your total compensation is up there, but you&#8217;re offsetting your own price tag by the deals you bring in. This is fairly straightforward. </p><p>What is not obvious, though, is that for some of these technical roles to even <em>do their job</em>, you&#8217;re looking at tens of thousands of dollars in overhead. And that is on the conservative side. </p><p>Here are the average base salaries for the most common data team roles by market, according to <a href="https://www.indeed.com/career/data-engineer/salaries">Indeed</a>: </p><pre><code>|<strong> Role           </strong>|<strong> New York, NY </strong>|<strong> San Francisco, CA </strong>|<strong> Austin, TX </strong>|
| -------------- | ------------ | ----------------- | ---------- |
| <strong>Data Analyst</strong>   | $84,376      | $95,171           | $75,288    |
| <strong>Data Scientist</strong> | $135,280     | $140,487          | $119,553   |
| <strong>Data Engineer</strong>  | $138,676     | $147,978          | $120,578   |</code></pre><p>You don&#8217;t need to read into the fact that <em>Analytics Engineer</em> is not included; it&#8217;s too nascent of a title for Indeed to report.</p><p>These salaries are not obscene, but don&#8217;t forget they are only the <em>base</em> and skew lower than many actual salaries as these are averages among all level of experience. These roles are unlikely to get large cash bonuses, and far more likely to receive equity top-ups, so the total cash compensation is fairly accurate here. </p><p>But, let&#8217;s not forget - this is just to show up. </p><p>Tack on a minimum 10% for payroll and insurance, as well as whatever miscellaneous perks you choose, and we&#8217;re now looking at a sizable chunk of change. </p><p>A &#8220;standard&#8221; data team in today&#8217;s startup ecosystem is anywhere from 3-8% of headcount. For a Series A business of 30ish employees, we&#8217;re looking at roughly 1-2 people. By the time you&#8217;re at Series B or C, you&#8217;re definitely north of 150+ employees, and probably a team of around 5. </p><p>That&#8217;s a combined annual payroll of between $250,000 and $650,000 if we use a conservative $130,000/team member. </p><p>For a company who is lucky to be hitting $15M+ in revenue at Series A and $30M+ in revenue by Series C, these numbers are meaningful. </p><p></p><h2>The cost of software, tools and systems</h2><p>There is no escaping the price tag associated with salaried employees; those salaries are supposed to be a fair exchange for work performed, anyway. But these employees - charged with &#8220;making the business data-driven&#8221; or &#8220;uncovering insights&#8221; - need tools to do their jobs. </p><p>(Side note: Don&#8217;t use those phrases when hiring. They are complete corporate fluff. But, we digress)</p><p>Many of those tools are expensive by default, and are designed to make it easy for your teams to inadvertently drive your monthly cost higher. Go figure. </p><p></p><h4>The Warehouse and Pipelines</h4><p>Let&#8217;s focus on Snowflake here, since it is arguably the fastest growing option in the cloud warehouse space. Sure, Amazon&#8217;s Redshift has an edge in marketshare (22.16% vs 19.73% as of <a href="https://www.nasdaq.com/articles/1-big-reason-why-snowflake-is-outperforming-the-rest-of-the-software-industry">2022</a>), but talk to anyone in the data space and they&#8217;re likely moving off of Redshift to some newer option. </p><p>I work with Snowflake a lot. I like the tool. But I&#8217;ve found there are plenty of companies who have made the move to Snowflake who start off just fine, only to wake up one quarter and realize their spend has topped $100k ARR. In some cases I&#8217;ve seen Series C startups spending well north of $300k annually with minimal monetization efforts related to that infrastructure. There are plenty of examples of $1M+ ARR Snowflake users, as well. I know of multiple who went through layoffs in 2022. </p><p>But, Snowflake and other cloud warehouses are useless without data in them. And their utility decreases as the &#8220;freshness&#8221; of that data wanes. Stale data = less useful for the business. Coincidentally for Snowflake, stale data also means a lower ARR. </p><p>As a business you can do only a few things. </p><p>Option 1 - point some data engineers at the problem and let them build pipelines for you. This may take weeks, or it may take months depending on your maturity as a business and technology org. This is also an expensive undertaking, both in terms of absolute dollars and opportunity costs for the team. </p><p>Option 2 is a more sane choice - pay for more tools to make it easier to get data from your applications (mobile, web, whatever) into Snowflake. This, too, has a price, but it&#8217;s well worth the increase in your team&#8217;s velocity. </p><p>So, you adopt a pipeline vendor. Fivetran, Airbyte, or something else. They have their own costs, often obscured by the fact that they charge based on some bespoke &#8220;credit&#8221; methodology. Maybe 1 credit is 100MB of data, maybe it is 400K monthly updated rows, maybe it is something else altogether. </p><p>You start small. You want to transfer some of your production application data from Postgres to Snowflake. You decide to stick with only a handful of tables at first. Things are stable... $2500 per month seems reasonable right? </p><p>By this point we&#8217;ve passed the $300k mark and are inching closer to a $500k total cost for a single team to do it&#8217;s job. And that is on the conservative side with a small team. </p><p>I don&#8217;t care how much money you&#8217;ve raised. That is a lot of upfront investment for something that your exec team has been told will improve the business. </p><p>Will it, though? That depends. </p><p></p><h4>The hidden drivers of cost</h4><p>Before we get into the softer side of business/ data alignment and how that rears its ugly head in terms of cost, let&#8217;s talk about the way you go from a &#8220;reasonable&#8221; operating cost for data activities to something that scares your CFO. </p><p>It has three words - Usage-based Billing. </p><p>You might hear people call it consumption-based billing, too. This business model has become the de-facto standard for many, if not all, notable data vendors today. And it happens to be the biggest driver of unexpected costs for data teams and their finance counterparts. </p><p>The economics are pretty simple - you pay for what you use. The &#8220;gotcha&#8221;, however, is that the vendors commonly used by data teams, especially within the data warehouse and data pipeline categories, are designed to drive higher and higher usage. </p><p>It&#8217;s a genius business model for the vendor, but potential black hole for your cash as a customer. At first, the time-to-value is great - your team configures Fivetran and next thing you know, data shows up in your Snowflake instance. </p><p>You can talk your way through the initial usage bills since a lot of your initial data is loaded for free. But over time, scope creep sets in, and your once $500/ month usage now has passed $1000/month. Due to some inevitable misalignment with the business, a cool $2000+/month is not soon out of reach. </p><p>You started with a small scope of only &#8220;the essential&#8221; tables from production. Then the BizOps team asks for a slight twist on an existing analysis; this requires new tables. You set up Fivetran to transfer more tables. This increases your usage. </p><p>You eventually realize that the Engineering team didn&#8217;t include a nice <em>updated_at </em>field in some of the new tables you need, so you&#8217;re going to have to do full-table loads until they deploy a change. </p><p>Congratulations, you got got. </p><p>As a product-led motion, this is the goal. A customer signs up, see immediate value, and comes back for more. But what&#8217;s missing on the customer-side of the equation is <em>knowledge</em>. Too many teams - many of whom are young and inexperienced, or following rote &#8220;best practices&#8221; shared by other influential data vendors - wind up inflating their SaaS vendor bills because it&#8217;s so freaking easy to do. </p><p>They literally do not know how to make this workflow happen in other ways. </p><p>You almost can&#8217;t blame the teams doing the implementation here - they&#8217;re doing what they&#8217;re <em>supposed </em>to be doing. What they&#8217;ve been <em>told </em>to do. </p><p>But this is the pattern that leads from a reasonable spend for business and data analytics infrastructure to something that balloons significantly over time. </p><p>And this is business-as-usual for many companies today. Really makes you wonder. </p><p></p><h4>The Compute Challenge<strong> </strong></h4><p>Ok, by this point it&#8217;s pretty clear how usage-based billing can get you in a hole you never anticipated, but we have only scratched the surface. Compute is the other part of the puzzle. </p><p>In Snowflake, you pay for storage (more often than not a relatively small part of the total bill) and for compute. Compute is basically just the time some server is running and executing the commands you tell it. </p><p>The Modern Data Stack lives off of compute credits. Every piece of the stack - from loading data, to processing that data, to running tests, to introducing new code via pull-request, to ad-hoc queries, to dashboards, to understanding lineage - requires compute. </p><p>Some of those activities incur higher compute costs than others, but the point remains. They all need a server to run some commands. And doing that costs money. </p><p>You might get pushback from your Looker account rep that they cache query results and try to minimize the need for querying the database, but the fact remains you&#8217;re still paying &gt; $100k in platform fees and generating a larger cloud warehouse bill every time someone at your company hits refresh. </p><p>Don&#8217;t get me wrong. It&#8217;s not wrong to use Looker, or any of the tools in the categories I mentioned. But it&#8217;s important that you know the &#8220;hidden&#8221; costs associated with the tools your teams adopts. </p><p>The kicker to our earlier Fivetran debacle? You get hit with the one-two punch of a Fivetran bill for loading data, and Snowflake dings you for the compute to process the loading. </p><p>This is one thing you can categorize as &#8220;the cost of doing business&#8221;. There is no way around it. </p><p></p><h4>The Alignment Problem</h4><p>The final nail in the coffin for your warehouse and pipeline costs have little to do with the absolute cost of the technology, and more about the business problems dependent on their outputs. </p><p>Analytics as a function is intended to improve business operations in some regard. It doesn&#8217;t matter if you&#8217;re working on business intelligence questions, product analytics questions, or building a predictive model of some sort. If you&#8217;re working on some problem set within a business, and you call yourself a data or analytics person, your job is ultimately to help the business perform <em>better. </em></p><p>But, that&#8217;s really hard to do.</p><p>Plenty of companies wind up over-indexing on solving problems that really don&#8217;t matter, or changing focus so rapidly that whatever effort went into one problem is lost on the next. This is one part of how you accumulate tech debt, but that&#8217;s another topic. </p><p>Misalignment between teams does not just look like frustrated coworkers and incorrect deliverables. It also translates to increased costs, and in our case, those costs are borne by the data team. </p><p>Remember that BizOps request from earlier? We can use any team in their place. Without alignment in terms of analytical focus, metric definitions, or even problem areas, your data team will be eating up compute credits in an effort to catch up with the needs of the rest of the org. </p><p>You won&#8217;t hear many teams or companies talk about this, but misalignment is an implicit driver of cost. It requires work and rework. And it often happens over and over. </p><p></p><h3>So, what&#8217;s the alternative?</h3><p>We&#8217;ve successfully hit a theoretical run rate over somewhere between $500K and $1M for a single team to do it&#8217;s job. Pretty wild. </p><p>Frankly, I don&#8217;t see how many companies can justify this kind of spend. Especially when many early data hires are young and inexperienced. Much of the cost incurred with today&#8217;s data stacks accumulates over time by way of design decisions and tooling selection. </p><p>An easy way to trim off a sizable chunk of your future warehouse and pipeline bills? </p><p>Bring in someone who&#8217;s done it before. </p><p>Not as a full-time hire; just to stand things up so your infrastructure cost does not scale faster than the value you&#8217;re generating from it. </p><p>We do it all the time at <a href="https://purviewlabs.io/">Purview Labs</a>. </p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://datajargon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This is the Data Jargon Newsletter. Published monthly-ish. Drop your email below. </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>