Jekyll2023-10-04T14:37:11+00:00https://ianwhitestone.work//feed.xmlIan Whitestone🔎 📈 🐍Extend the Runway: A deep dive into Snowflake costs at Coalesce 20222022-10-27T00:00:00+00:002022-10-27T00:00:00+00:00https://ianwhitestone.work//extend-the-runwayCalculating cost per query in Snowflake2022-10-12T00:00:00+00:002022-10-12T00:00:00+00:00https://ianwhitestone.work//cost-per-querySnowflake Performance Tuning and Cost Optimization2022-09-28T00:00:00+00:002022-09-28T00:00:00+00:00https://ianwhitestone.work//snowflake-toronto-user-group-sepSnowflake Optimization Power Hour2022-09-15T00:00:00+00:002022-09-15T00:00:00+00:00https://ianwhitestone.work//snowflake-optimization-power-hourSnowflake Architecture Overview2022-09-12T00:00:00+00:002022-09-12T00:00:00+00:00https://ianwhitestone.work//snowflake-architectureWhat’s up with DuckDB?2022-08-17T00:00:00+00:002022-08-17T00:00:00+00:00https://ianwhitestone.work//whats-up-with-duckdbFarewell, Shopify ❤️2022-06-30T00:00:00+00:002022-06-30T00:00:00+00:00https://ianwhitestone.work//farewell-shopify<h2>👋</h2> <p>After three and a half beautiful years, today’s my last day at Shopify. Most people silently exit a company. I was planning to do just that, until my director <a href="https://www.linkedin.com/in/mike-develin-20616b59/">Mike</a> encouraged me to write a farewell note. He said we need to celebrate people moving on and not let it go unnoticed. After some brief thought, I agreed. It would be a shame to not reflect on what I’ve gained from this experience, and acknowledge that it’s exactly because of my time here that I’m ready for what’s next. It’s tough to cover everything I’ve learned here, so I’m going to focus on two buckets which have had the largest impact on me.</p> <h2 id="product-craft"><strong>product craft</strong></h2> <p>As an incoming data scientist, learning about the art of product was something I did not anticipate. I joined a brand new product area with ~7 other people (1 PM, 4 devs and 2 UX), which eventually grew into an org of over 100 people responsible for <a href="https://www.shopify.ca/markets">Shopify Markets</a> and a new tax platform. Of course, this wasn’t a linear journey. We had false starts and features we ended up killing. There were even periods early on where we as a team were close to getting dissolved. All of these bumps along the way came with great universal lessons. I learned to fall in love with problems, and not solutions. To dream big, but start small. I saw first hand how big opportunities will always be sitting right in front of you, you just have to reach and grab them. This was relevant 3 years ago at Shopify and remains true today. Work hard, eyes open.</p> <p>Being tightly embedded in a multi-disciplinary group will give you the opportunity to learn from experts in other crafts, you just need to take some initiative. I got to witness <a href="https://twitter.com/HeatherMcGaw">Heather</a> run world class user research sessions, because I simply asked to join. I learned how we manage complex product rollouts, handle production incidents, and develop in large scale codebases because I invested in relationships with our amazing devs and became a sponge.</p> <p>Regardless of what team you work on, one of the best features of working at Shopify is you get the closest thing possible to root level access to Tobi’s brain. Every couple months, I’d do a slack search for <code class="language-plaintext highlighter-rouge">from:@Tobi Lütke</code> and learn how he was thinking about the way things were built. One day it was <em>“Don’t stack abstractions”</em> in response to a discussion around abstracting <a href="https://guides.rubyonrails.org/active_record_basics.html#what-is-active-record-questionmark">ActiveRecord</a><sup>1</sup>. Another time it was the importance of setting good defaults in our product so everything just works out of the box. When deciding whether or not something should be built, he’d talk about the importance of having strong opinions and building based on that, rather than waiting for customer demand. Getting front row access to Tobi’s principled thinking and relentless focus on simplicity was easily one of the best things about working here.</p> <h2 id="data-craft"><strong>data craft</strong></h2> <p>As close as I was to the product, I still spent 90% of my time living and breathing data. Shopify’s data team came about in 2014, back when none of the <a href="https://mattturck.com/data2021/">“Modern Data Stack”</a> existed. Like other big tech companies from that era, they were forced to build many of the frameworks and tools that exist today as standalone companies.</p> <p>As a <a href="https://ianwhitestone.work/slides-v2/data-science-at-shopify.html">full stack data scientist</a>, you get exposure to the data stack end to end and the people who built it. From data extraction and all the pitfalls with change data capture or deletes. To event tracking with kafka and the joys of duplicates, missed events and late arriving data. Out of memory errors, disk spill and lost containers<sup>2</sup>. Slow SQL queries and figuring out when it makes sense to build a new data model. We exist to add value with data, and navigating this stack and learning the ins and outs of each system was one of the favourite parts of my job.</p> <p>Of course, I wasn’t alone in these endeavours. Across all crafts at Shopify, you’ll be surrounded with senior members who’ve been at it for 5 times as long as you have<sup>3</sup>. Take advantage of these opportunities and learn from the best. Be vocal and share your feedback about the platform. I did this frequently, and as a result got to participate in helping shape some of the new tooling we built.</p> <p>Working in an end to end nature also allows you to see the full data value chain. I got to work on analysis that unblocked key product decisions, ran experiments that resulted in shipping changes that positively impacted millions of merchant’s businesses, and built data-driven products that abstracted away some of the <a href="https://www.shopify.ca/blog/us-canada-sales-tax-insights">gnarlier aspects of commerce</a>. Getting exposure to all these things takes time and persistence. Be patient, and the opportunities will come.</p> <h2 id="onwards">onwards!</h2> <p>So, what’s next? A piece of advice that’s stuck with me for a long time is something my Dad said to me; that <em>“the worst thing that can happen in life is if you look back and say what if?”</em> <sup>4</sup>. While I could happily spend my career here, I’ve always wanted to take a shot at entrepreneurship and start a company<sup>5</sup>. With kids and a mortgage a few years out, it’s quickly become clear that now is the best time. As scared as I am, I know that 80 year old Ian in a rocking chair would be full of regret if he didn’t try this.</p> <p>Without question, I’d have nowhere close to the level of confidence required to take this leap if it weren’t for my time at Shopify. So thanks to Tobi for creating this incredible place, and thanks to everyone I got to work with along the way. I am forever grateful.</p> <h3 id="notes">notes</h3> <p><sup>1</sup> After being asked to elaborate, Tobi expanded on his point: <em>“Abstractions are bad unless they make something new possible or something that you really need to do 10x easier. The abstractions in rails are the ones that sit at this sweetspot. Stay close to vanilla rails as you can while solving the problem you have. Only deviate if you know exactly what you are doing. Never listen to architecture astronauts. Existence of arguments in favour of an abstraction doesn’t even nearly clear the bar for adopting it.”</em></p> <p><sup>2</sup> I’m intentionally highlighting many of the more challenging aspects of working in data. Of course it’s not always like this. Yet, when things break and you push their limits is when you’ll be forced to go deep and really understand the ins and outs of how something works.</p> <p><sup>3</sup> Special shout out to Karl Taylor, Michael Styles and Khaled Hammouda, who taught me pretty much everything I know about Spark.</p> <p><sup>4</sup> Jeff Bezos said <a href="https://www.youtube.com/watch?v=jwG_qR6XmDQ">something similar</a> when deciding to leave D.E. Shaw to start Amazon.</p> <p><sup>5</sup> More on this later, but I plan to build a B2B SaaS company in the data space. I’m happiest when I’m on the steepest part of the learning curve, and there’s no doubt that entrepreneurship and wearing all the hats required to build a company will bring this.</p>👋Unpacking the Spark Web UI2021-11-14T00:00:00+00:002021-11-14T00:00:00+00:00https://ianwhitestone.work//spark-web-ui<link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" /> <!-- <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous"> --> <!-- Twitter cards --> <meta name="twitter:site" content="@ianwhitestone" /> <meta name="twitter:creator" content="@ianwhitestone" /> <meta name="twitter:title" content="Unpacking the Spark Web UI" /> <meta name="twitter:description" content="A quick overview of how to navigate the Spark Web UI" /> <meta name="twitter:card" content="summary_large_image" /> <meta name="twitter:image" content="https://ianwhitestone.work/images/spark-web-ui/cover.png" /> <!-- end of Twitter cards --> <ul id="markdown-toc"> <li><a href="#example-job--data" id="markdown-toc-example-job--data">Example Job &amp; Data</a></li> <li><a href="#navigating-the-ui" id="markdown-toc-navigating-the-ui">Navigating the UI</a> <ul> <li><a href="#jobs" id="markdown-toc-jobs">Jobs</a></li> <li><a href="#stages" id="markdown-toc-stages">Stages</a></li> <li><a href="#sql" id="markdown-toc-sql">SQL</a></li> <li><a href="#plans" id="markdown-toc-plans">Plans</a></li> <li><a href="#storage-environment-and-executors" id="markdown-toc-storage-environment-and-executors">Storage, Environment and Executors</a></li> </ul> </li> <li><a href="#notes" id="markdown-toc-notes">Notes</a></li> <li><a href="#generating-the-dataset" id="markdown-toc-generating-the-dataset">Generating the dataset</a> <ul> <li><a href="#simulating-skewness" id="markdown-toc-simulating-skewness">Simulating Skewness</a></li> <li><a href="#transaction-dataset" id="markdown-toc-transaction-dataset">Transaction Dataset</a></li> <li><a href="#shop-dimension-dataset" id="markdown-toc-shop-dimension-dataset">Shop Dimension Dataset</a></li> </ul> </li> </ul> <p><br /></p> <p align="center"> <img width="80%" src="/images/spark-web-ui/cover.png" /> </p> <p><br /> The <a href="https://spark.apache.org/docs/latest/web-ui.html">Spark Web UI</a> provides an interface for users to monitor and inspect details of their Spark application. You can leverage it to answer a host of questions like:</p> <ul> <li>How long did my job take to run?</li> <li>How did the Spark optimizer decide to execute my job?</li> <li>How much disk spill was there in each stage? In each executor?</li> <li>What stage took the longest?</li> <li>Is there significant data skew?</li> </ul> <p>These capabilities make the Web UI incredibly useful. Unfortunately, it is not the easiest thing to understand. In this post I’ll provide a quick tour of the Web UI by leveraging a simple Spark job as a reference point. If your new to Spark or need a refresher on things like “jobs”, “stages” and “tasks”, I encourage you to read my <a href="/spark-from-100ft/">high level intro of Spark</a> first. It’s also important to note that everything shown in this post is using Spark v2.4.4 <sup>1</sup>.</p> <p>Onwards!</p> <h1 id="example-job--data">Example Job &amp; Data</h1> <p>We’ll imagine we have a bunch of e-commerce data, and we want to find out the maximum transaction value on each day in each country. For this example, we’ll have two datasets to help us answer this question. A <strong><code class="language-plaintext highlighter-rouge">transactions</code></strong> model with 1 row per transaction, and information like the transaction timestamp and amount.</p> <table> <thead> <tr> <th>transaction_id</th> <th>shop_id</th> <th>created_at</th> <th>currency_code</th> <th>amount</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>123</td> <td>2021-01-01 12:55:01</td> <td>USD</td> <td>25.99</td> </tr> <tr> <td>2</td> <td>123</td> <td>2021-01-01 17:22:05</td> <td>USD</td> <td>13.45</td> </tr> <tr> <td>3</td> <td>456</td> <td>2021-01-01 19:04:59</td> <td>CAD</td> <td>10.22</td> </tr> </tbody> </table> <p>The transactions model will also have a reference (<code class="language-plaintext highlighter-rouge">shop_id</code>) that links it to another model, <strong><code class="language-plaintext highlighter-rouge">shop_dimension</code></strong>, which has 1 row per shop and some metadata for that shop.</p> <table> <thead> <tr> <th>shop_id</th> <th>shop_country_name</th> <th>shop_country_code</th> </tr> </thead> <tbody> <tr> <td>123</td> <td>Canada</td> <td>CA</td> </tr> <tr> <td>456</td> <td>United States</td> <td>US</td> </tr> </tbody> </table> <p>Head to the <a href="#generating-the-dataset">notes section</a> to see the code I used to generate these two datasets. Using plain SQL, we could find the max transaction value per country &amp; day with:</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">sd</span><span class="p">.</span><span class="n">shop_country_code</span><span class="p">,</span> <span class="n">trxns</span><span class="p">.</span><span class="n">created_at_date</span><span class="p">,</span> <span class="k">MAX</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span> <span class="k">AS</span> <span class="n">max_transaction_value</span> <span class="k">FROM</span> <span class="n">transactions</span> <span class="k">AS</span> <span class="n">trxns</span> <span class="k">INNER</span> <span class="k">JOIN</span> <span class="n">shop_dimension</span> <span class="k">AS</span> <span class="n">sd</span> <span class="k">ON</span> <span class="n">trxns</span><span class="p">.</span><span class="n">shop_id</span><span class="o">=</span><span class="n">sd</span><span class="p">.</span><span class="n">shop_id</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span><span class="mi">2</span> </code></pre></div></div> <p>And in PySpark, the code would look something like:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output</span> <span class="o">=</span> <span class="p">(</span> <span class="n">trxns_skewed_df</span> <span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">shop_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'shop_id'</span><span class="p">)</span> <span class="p">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">'shop_country_code'</span><span class="p">,</span> <span class="s">'created_at_date'</span><span class="p">)</span> <span class="p">.</span><span class="n">agg</span><span class="p">(</span> <span class="n">F</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="s">'amount'</span><span class="p">).</span><span class="n">alias</span><span class="p">(</span><span class="s">'max_transaction_value'</span><span class="p">)</span> <span class="p">)</span> <span class="p">)</span> <span class="n">result</span> <span class="o">=</span> <span class="n">output</span><span class="p">.</span><span class="n">collect</span><span class="p">()</span> </code></pre></div></div> <h1 id="navigating-the-ui">Navigating the UI</h1> <h2 id="jobs">Jobs</h2> <p><code class="language-plaintext highlighter-rouge">.collect()</code> is an action, and actions trigger jobs in Spark. If you click on the <strong>Jobs</strong> tab of the UI, you’ll see a list of completed or actively running jobs. From this view, we can see a few things:</p> <ul> <li>The action that triggered the job (<code class="language-plaintext highlighter-rouge">collect at &lt;ipython-input-320-...&gt;</code>)</li> <li>The time it took (6.7 min)</li> <li>The number of stages (4) and tasks (1493)</li> </ul> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-web-ui-3.png" /> </p> <p>When we click into our job we can see some more details, particularly around the stages. Our job has 4 stages, which makes sense since a new stage is created whenever there is a shuffle. We have:</p> <ul> <li>1 stage for the initial reading of each dataset</li> <li>1 for the join</li> <li>1 for the aggregation</li> </ul> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-web-ui-4.png" /> </p> <h2 id="stages">Stages</h2> <p>From the detailed job view, we can zoom into any of the stages. I clicked on the third one (Stage 89<sup>2</sup>) where the join on <code class="language-plaintext highlighter-rouge">shop_id</code> is happening. Spark throws a bunch of information at us:</p> <ul> <li>High level stats like: <ul> <li>Shuffle Read: Total shuffle bytes of records read during the shuffle</li> <li>Shuffle Write: Bytes of records written to disk in order to be read by a shuffle in a future stage</li> <li>Shuffle Spill (Memory): The uncompressed size of data that was spilled to memory during the shuffle</li> <li>Shuffle Spill (Disk): The compressed size of data that was spilled to disk during the shuffle</li> </ul> </li> <li>Summary metrics (duration, shuffle, etc.) across all tasks and percentile</li> <li>Aggregated metrics by executor</li> </ul> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-web-ui-5-1.png" /> </p> <p>When looking at a given stage, it can often be tricky to figure out what is actually happening in that stage. To help with this, you can use the DAG visualization to get a high level sense of what the stage is doing. Below, you can see two datasets being shuffled and merged together. Pairing this with the knowledge of our query from above, you can ultimately deduce that this is where the join is happening.</p> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-web-ui-5-2.png" /> </p> <p>I intentionally <a href="#generating-the-dataset">generated a very skewed dataset</a> by having a small % of shops make up a large % of all transactions. The impact of this on our job quickly becomes evident.</p> <ol> <li>We can see there is ~20GB of disk spill happening. This is because there isn’t enough memory available to complete the tasks (shuffling and joining), so Spark has must write data down to disk. This is both expensive (slow) and can potentially take down the entire node if there is too much disk spill.</li> <li>Looking at the summary metrics across all tasks, we can see that some tasks are taking much longer than others (max time = 4.9 min vs. median time = 17 seconds, that’s 17 times as long!). Similarly, some tasks have much more disk spill than others (max disk spill = 4.7GB vs. median disk spill=36.1MB, 133 times as big!). This is a direct result of our skew: performing the join for shops with a large number of transactions (records) takes longer and spills more because the data is too big!</li> <li>Looking at the aggregated metrics per executor, we can see that some executors (like #61) are spilling more data to disk than others. This is likely a function of some executors having to deal with much larger partitions than others, again thanks to the skew.</li> </ol> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-web-ui-5-3.png" /> </p> <h2 id="sql">SQL</h2> <p>For most dataframe jobs<sup>3</sup>, the SQL tab can be leveraged to visualize how Spark is executing your query. You can find the query of interest by selecting the one associated with your job:</p> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-web-ui-6.png" /> </p> <p>You’ll then be presented with a nice graphical visualization of your job. I personally find these the most useful to diagnose what’s going on. We can see each dataset being read in and the associated size of each, the shuffle operation before the join and the eventual join. You can leverage the summary stats on this page to see things similar to what we saw on the Stage page, like the disk spill from the join!</p> <p align="center"> <img width="80%" src="/images/spark-web-ui/spark-web-ui-7-1.png" /> </p> <p>You can hover over different parts of the query to learn more, like which dataset is being scanned (it will show the full GCS path) or how many partitions are being used in the shuffle - in this example, it is 200, the default value set by Spark (see the <code class="language-plaintext highlighter-rouge">hashpartition(shop_id#3104, 200)</code> that appears when I hover over the Exchange block).</p> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-web-ui-7-2.gif" /> </p> <h2 id="plans">Plans</h2> <p>At the bottom of the page, you can see the different plans Spark created for your query. I only ever look at the Physical Plan, since that is what actually gets executed<sup>4</sup>:</p> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-web-ui-8.png" /> </p> <p>The Physical Plan tells you how the Spark optimizer will execute your job, in written form. You can use it to understand things like what join strategies are being used. Did Spark decide to try and do a broadcast join? Or you can see what filters have been pushed down to the Parquet level. The graphical representation above is generally easier to use as a starting point, but sometimes you’ll need to go into the physical plan in order to get more details not shown visually.</p> <p>Note that you can also get the physical plan outside of the Web UI, by calling the <code class="language-plaintext highlighter-rouge">explain()</code> method on your dataframe object:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output</span> <span class="o">=</span> <span class="p">(</span> <span class="n">trxns_skewed_df</span> <span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">shop_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'shop_id'</span><span class="p">)</span> <span class="p">...</span> <span class="p">)</span> <span class="n">output</span><span class="p">.</span><span class="n">explain</span><span class="p">()</span> </code></pre></div></div> <p align="center"> <img width="100%" src="/images/spark-web-ui/spark-physical-plan.png" /> </p> <h2 id="storage-environment-and-executors">Storage, Environment and Executors</h2> <p>I won’t go over the Storage, Environment or Executors tab, since I barely ever use these. You can read more about their use cases <a href="https://spark.apache.org/docs/latest/web-ui.html">here</a>. Very quickly:</p> <ul> <li><strong>Storage</strong> will show information about any persisted dataframes (i.e. if you called <code class="language-plaintext highlighter-rouge">df.persist()</code> or <code class="language-plaintext highlighter-rouge">df.cache()</code><sup>4</sup>)</li> <li><strong>Environment</strong> will tell you about the different environment and configuration variables that were set for the Spark job</li> <li><strong>Executors</strong> has information about each executor in your cluster, like disk space, the number of cores, memory usage, and more</li> </ul> <h1 id="notes">Notes</h1> <p><sup>1</sup> In Spark v3, there were some changes introduced, such as improved SQL metrics and plan visualization. Learn more <a href="https://canali.web.cern.ch/docs/WhatsNew_Spark3_Performance_Monitoring_DataAI_Summit_EU_Nov2020_LC.pdf">here</a> and <a href="https://www.waitingforcode.com/apache-spark/whats-new-apache-spark-3-ui-changes/read">here</a>.</p> <p><sup>2</sup> This is Stage 89 cause I’d run a bunch of Spark jobs prior to this one you are seeing.</p> <p><sup>3</sup> I’m not sure under what scenarios you wouldn’t see this when executing a Spark Job with dataframes.</p> <p><sup>4</sup>See this <a href="https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-laymans-term/">post</a> for an explanation of the differences between each plan type.</p> <p><sup>5</sup> Curious about the difference between <code class="language-plaintext highlighter-rouge">cache</code> and <code class="language-plaintext highlighter-rouge">persist</code>, see <a href="https://stackoverflow.com/questions/26870537/what-is-the-difference-between-cache-and-persist">here</a>. Wondering when you should be using them? See <a href="https://stackoverflow.com/questions/44156365/when-to-cache-a-dataframe">here</a>.</p> <h1 id="generating-the-dataset">Generating the dataset</h1> <p>For the purposes of this example, I wanted the join key (<code class="language-plaintext highlighter-rouge">shop_id</code>) to be skewed in order to show how skew can be detected in the Web UI. This is also quite common in practice, no matter what your domain is. Any time you have an event-level dataset, it’s quite possible that certain users/accounts/shops generate a large portion of those events. For this example (shops generating e-commerce transactions), we could rank &amp; sort each shop based on their total transaction count, and then plot the cumulative % of total transactions as we include each shop. You can see what this would theoretically look like for a skewed and un-skewed dataset:</p> <p align="center"> <img width="80%" src="/images/spark-web-ui/example-trxns-skew-1.png" /> </p> <h2 id="simulating-skewness">Simulating Skewness</h2> <p>To simulate a high degree of skewness, I sampled from a chi-squared distribution.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ids</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">round</span><span class="p">(</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">chisquare</span><span class="p">(</span><span class="mf">0.35</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10000</span><span class="p">)</span><span class="o">*</span><span class="mi">100000</span> <span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">ids</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">100</span><span class="p">);</span> </code></pre></div></div> <p>The resulting shop transaction frequency plot looks like this:</p> <p align="center"> <img width="80%" src="/images/spark-web-ui/example-trxns-skew-2.png" /> </p> <p>Running some quick analysis on this, we can see that 11% of all transactions come from a single shop in this artifical dataset:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="n">ids</span><span class="p">).</span><span class="n">describe</span><span class="p">()</span> <span class="n">count</span> <span class="mf">1.000000e+04</span> <span class="n">mean</span> <span class="mf">3.370065e+04</span> <span class="n">std</span> <span class="mf">8.156347e+04</span> <span class="nb">min</span> <span class="mf">1.000000e+00</span> <span class="mi">25</span><span class="o">%</span> <span class="mf">4.600000e+01</span> <span class="mi">50</span><span class="o">%</span> <span class="mf">2.676000e+03</span> <span class="mi">75</span><span class="o">%</span> <span class="mf">2.833600e+04</span> <span class="nb">max</span> <span class="mf">1.702097e+06</span> <span class="o">&gt;&gt;&gt;</span> <span class="mf">100.0</span><span class="o">*</span><span class="n">ids</span><span class="p">[</span><span class="n">ids</span> <span class="o">==</span> <span class="mi">1</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="n">ids</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="mf">11.19</span> </code></pre></div></div> <h2 id="transaction-dataset">Transaction Dataset</h2> <p>Both datasets were generated through a combination of pandas and numpy. The generated <code class="language-plaintext highlighter-rouge">transactions</code> dataset had 6.5 million rows (I played around with this until each file was ~120MB, a good aproximate size (compressed) for a single partition in Spark). You can see I leverage the same chi-squared distribution from above to randomly generate shop_ids, with the smaller shop_ids occuring much more frequently. While I didn’t leverage this in this post, I also made the dataset skewed by <code class="language-plaintext highlighter-rouge">currency_code</code>, by specifying that 80% of transactions would be USD, 2% CAD, 10% EUR, etc. All transactions were set to occur across 10 days.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">N</span> <span class="o">=</span> <span class="mi">6500000</span> <span class="c1"># 6.5 million rows </span> <span class="n">currencies</span> <span class="o">=</span> <span class="p">[</span><span class="s">'USD'</span><span class="p">,</span> <span class="s">'CAD'</span><span class="p">,</span> <span class="s">'EUR'</span><span class="p">,</span> <span class="s">'GBP'</span><span class="p">,</span> <span class="s">'DKK'</span><span class="p">,</span> <span class="s">'HKD'</span><span class="p">]</span> <span class="n">currency_probas</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.02</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.015</span><span class="p">,</span> <span class="mf">0.015</span><span class="p">]</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span> <span class="s">'transaction_id'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">N</span> <span class="o">+</span> <span class="mi">1</span><span class="p">),</span> <span class="s">'shop_id'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nb">round</span><span class="p">(</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">chisquare</span><span class="p">(</span><span class="mf">0.35</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">N</span><span class="p">)</span><span class="o">*</span><span class="mi">100000</span> <span class="p">),</span> <span class="s">'_days_since_base'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">N</span><span class="p">),</span> <span class="s">'currency_code'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span> <span class="n">currencies</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">N</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">currency_probas</span> <span class="p">),</span> <span class="s">'amount'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">exponential</span><span class="p">(</span><span class="mi">50</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">N</span><span class="p">)</span> <span class="p">})</span> <span class="n">df</span><span class="p">[</span><span class="s">'base_date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2016</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="n">days</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">TimedeltaIndex</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'_days_since_base'</span><span class="p">],</span> <span class="n">unit</span><span class="o">=</span><span class="s">'D'</span><span class="p">)</span> <span class="n">df</span><span class="p">[</span><span class="s">'created_at_date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">base_date</span> <span class="o">+</span> <span class="n">days</span> </code></pre></div></div> <p>I then converted the pandas dataframe to parquet and wrote to Google Cloud Storage (GCS):</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">base_path</span> <span class="o">=</span> <span class="s">"gs://my_bucket/in-the-trenches-with-spark/"</span> <span class="n">_parquet_bytes</span> <span class="o">=</span> <span class="n">io</span><span class="p">.</span><span class="n">BytesIO</span><span class="p">()</span> <span class="n">df</span><span class="p">.</span><span class="n">to_parquet</span><span class="p">(</span><span class="n">_parquet_bytes</span><span class="p">)</span> <span class="n">parquet_bytes</span> <span class="o">=</span> <span class="n">_parquet_bytes</span><span class="p">.</span><span class="n">getvalue</span><span class="p">()</span> <span class="n">gcs_helper</span><span class="p">.</span><span class="n">writeBytes</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">base_path</span><span class="p">,</span> <span class="s">'transactions_skewed_part_1.parquet'</span><span class="p">),</span> <span class="n">parquet_bytes</span><span class="p">)</span> </code></pre></div></div> <p>6.5 million rows is small. I wanted something 500x as big. You can’t generate that in memory in one-go, so you’d either have to repeat what I did above 500 times, or just make 500 copies of the dataset with a simple bash script (much quicker).</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">NUM_FILES</span><span class="o">=</span>500 <span class="nv">BASE_PATH</span><span class="o">=</span><span class="s2">"gs://my_bucket/in-the-trenches-with-spark"</span> <span class="k">for </span>i <span class="k">in</span> <span class="si">$(</span><span class="nb">seq </span>1 <span class="nv">$NUM_FILES</span><span class="si">)</span> <span class="k">do </span>gsutil <span class="nb">cp</span> <span class="s2">"</span><span class="nv">$BASE_PATH</span><span class="s2">/transactions_skewed_part_1.parquet"</span> <span class="s2">"</span><span class="nv">$BASE_PATH</span><span class="s2">/transactions_skewed_part_</span><span class="nv">$i</span><span class="s2">.parquet"</span> <span class="k">done</span> </code></pre></div></div> <p>Note, this will naturally result in multiple rows with the same <code class="language-plaintext highlighter-rouge">transaction_id</code>, etc..but for the purposes of the examples used in this post, it doesn’t matter.</p> <h2 id="shop-dimension-dataset">Shop Dimension Dataset</h2> <p>The shop dimension dataset was created in a similar fashion, with certain countries (like the US) appearing more often than hours - this introduces another source of skew!.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">shop_df_size</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">shop_id</span><span class="p">.</span><span class="nb">max</span><span class="p">())</span> <span class="n">country_names</span> <span class="o">=</span> <span class="p">[</span><span class="s">'United States'</span><span class="p">,</span> <span class="s">'Canada'</span><span class="p">,</span> <span class="s">'Germany'</span><span class="p">,</span> <span class="s">'United Kingdom'</span><span class="p">,</span> <span class="s">'Denmark'</span><span class="p">,</span> <span class="s">'Hong Kong'</span><span class="p">]</span> <span class="n">country_codes</span> <span class="o">=</span> <span class="p">[</span><span class="s">'US'</span><span class="p">,</span> <span class="s">'CA'</span><span class="p">,</span> <span class="s">'DE'</span><span class="p">,</span> <span class="s">'GB'</span><span class="p">,</span> <span class="s">'DK'</span><span class="p">,</span> <span class="s">'HK'</span><span class="p">]</span> <span class="n">shop_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span> <span class="s">'shop_id'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">shop_df_size</span> <span class="o">+</span> <span class="mi">1</span><span class="p">),</span> <span class="s">'shop_country_code'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">country_codes</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">shop_df_size</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">currency_probas</span><span class="p">),</span> <span class="s">'shop_country_name'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">country_names</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">shop_df_size</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">currency_probas</span><span class="p">),</span> <span class="s">'attribute_1'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">shop_df_size</span><span class="p">),</span> <span class="s">'attribute_2'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">shop_df_size</span><span class="p">),</span> <span class="p">})</span> </code></pre></div></div> <p>For this dataset, I just split it up into five files.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">num_files</span> <span class="o">=</span> <span class="mi">5</span> <span class="n">dfs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array_split</span><span class="p">(</span><span class="n">shop_df</span><span class="p">,</span> <span class="n">num_files</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">num_files</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span> <span class="n">_parquet_bytes</span> <span class="o">=</span> <span class="n">io</span><span class="p">.</span><span class="n">BytesIO</span><span class="p">()</span> <span class="n">dfs</span><span class="p">[</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">to_parquet</span><span class="p">(</span><span class="n">_parquet_bytes</span><span class="p">)</span> <span class="n">parquet_bytes</span> <span class="o">=</span> <span class="n">_parquet_bytes</span><span class="p">.</span><span class="n">getvalue</span><span class="p">()</span> <span class="n">path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">base_path</span><span class="p">,</span> <span class="s">'shop_dimension_{0}.parquet'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="k">print</span><span class="p">(</span><span class="s">'Writing parquet file {0}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">path</span><span class="p">))</span> <span class="n">gcs_helper</span><span class="p">.</span><span class="n">writeBytes</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">parquet_bytes</span><span class="p">)</span> </code></pre></div></div> <p>The resulting datasets are shown below (using <a href="https://gethue.com/">Apache Hue</a>’s file explorer):</p> <p align="center"> <img width="100%" src="/images/spark-web-ui/dummy-data.png" /> </p>ianwhitestoneSpark from 100ft2021-11-07T00:00:00+00:002021-11-07T00:00:00+00:00https://ianwhitestone.work//spark-from-100ft<link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" /> <!-- <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous"> --> <!-- Twitter cards --> <meta name="twitter:site" content="@ianwhitestone" /> <meta name="twitter:creator" content="@ianwhitestone" /> <meta name="twitter:title" content="Spark from 100ft" /> <meta name="twitter:description" content="A high level overview of how Spark works for beginners or those looking for a refresher" /> <meta name="twitter:card" content="summary_large_image" /> <meta name="twitter:image" content="https://ianwhitestone.work/images/spark-from-100ft/cover.png" /> <!-- end of Twitter cards --> <ul id="markdown-toc"> <li><a href="#architecture-overview--common-terminology" id="markdown-toc-architecture-overview--common-terminology">Architecture Overview &amp; Common Terminology</a></li> <li><a href="#example-1-aggregating-transaction-amounts-by-app" id="markdown-toc-example-1-aggregating-transaction-amounts-by-app">Example 1: Aggregating transaction amounts by app</a> <ul> <li><a href="#sample-code" id="markdown-toc-sample-code">Sample Code</a></li> <li><a href="#execution-overview" id="markdown-toc-execution-overview">Execution Overview</a> <ul> <li><a href="#stage-1" id="markdown-toc-stage-1">Stage 1</a></li> <li><a href="#shuffle--stage-2" id="markdown-toc-shuffle--stage-2">Shuffle + Stage 2</a></li> </ul> </li> </ul> </li> <li><a href="#example-2-enrich-a-set-of-user-events-in-a-particular-timeframe" id="markdown-toc-example-2-enrich-a-set-of-user-events-in-a-particular-timeframe">Example 2: Enrich a set of user events in a particular timeframe</a> <ul> <li><a href="#sample-code-1" id="markdown-toc-sample-code-1">Sample Code</a></li> <li><a href="#execution-overview-1" id="markdown-toc-execution-overview-1">Execution Overview</a></li> </ul> </li> <li><a href="#notes" id="markdown-toc-notes">Notes</a></li> </ul> <p align="center"> <img width="50%" src="/images/spark-from-100ft/cover.png" /> </p> <p><a href="https://en.wikipedia.org/wiki/Apache_Spark">Apache Spark</a> is an open-source framework for large-scale data analytics. Large-scale data processing is achieved by leveraging a cluster of computers and dividing the work among them. Spark came after the <a href="https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Hadoop MapReduce</a> framework, offering much faster perforamnce since data is retained in memory instead of being written to disk after each step. It’s available in multiple languages (Scala, Java, Python, R) and offers batch and stream based processing, a machine learning library, and graph data processing. Based on my experience, it is most commonly used for batch data processing. It is also rarely understood. To help with that, this is a quick post for beginners to better understand Spark at a high level (~100ft +/- some), or those with some experience looking for a refresher.</p> <h1 id="architecture-overview--common-terminology">Architecture Overview &amp; Common Terminology</h1> <p align="center"> <img width="100%" src="/images/spark-from-100ft/cluster-overview.png" /> </p> <p>A Spark cluster consists of a single <strong>driver</strong> and (usually) a bunch of <strong>executors</strong>. The <strong>driver</strong> is responsible for the orchestration of the job. Your Spark code is submitted to the <strong>driver</strong>, which converts your program into a bunch of <strong>tasks</strong> that run on the <strong>executors</strong>. The <strong>driver</strong> is generally not interacting directly with the data<sup>1</sup>. Instead, the work happens on the <strong>executors</strong>. Conceptually, you can think of an <strong>executor</strong> as a “single computer”<sup>2</sup> with a single Java VM running Spark. It has dedicated memory, CPUs and disk space<sup>3</sup>. <strong>Executors</strong> run tasks in parallel across multiple threads (cores), so parallelism in a Spark cluster is achieved both across and within executors.</p> <p>With Spark, your dataset will be split up into a bunch of distributed “chunks”, which we call <strong>partitions</strong>. A <strong>task</strong> is then a unit of work that is run on a single partition, on a single executor.</p> <p>Broadly speaking, there are two types of work: <strong>transformations</strong> and <strong>actions</strong>. A <strong>transformation</strong> is anything that creates a new dataset (filter, map, sort, group by, join, etc.). An <strong>action</strong> is anything that triggers the actual execution<sup>4</sup> of your Spark code (count, collect, write, top, take).</p> <p>If we look at the following PySpark code:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">event_logs_df</span> <span class="p">.</span><span class="nb">filter</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'event_at'</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">F</span><span class="p">.</span><span class="n">lit</span><span class="p">(</span><span class="s">'2020-01-01'</span><span class="p">))</span> <span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">event_dimension_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'event_id'</span><span class="p">)</span> <span class="p">.</span><span class="n">select</span><span class="p">([</span><span class="s">'user_id'</span><span class="p">,</span> <span class="s">'event_at'</span><span class="p">,</span> <span class="s">'event_type'</span><span class="p">])</span> <span class="p">.</span><span class="n">collect</span><span class="p">()</span> </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">filter</code>, <code class="language-plaintext highlighter-rouge">join</code>, and <code class="language-plaintext highlighter-rouge">select</code> are all <strong>transformations</strong> and <code class="language-plaintext highlighter-rouge">collect</code> (which asks for all executors to send their data back to the driver) is an <strong>action</strong>.</p> <p>An action triggers a <strong>job</strong>, which is a way to group together all the <strong>tasks</strong> involved in that computation. A <strong>job</strong> will consist of a collection of <strong>stages</strong>, which are in turn a collection of <strong>transformations</strong>. A new <strong>stage</strong> gets created whenever there is a <strong>shuffle</strong>.</p> <p>A <strong>shuffle</strong> is a mechanism for redistributing data so that it’s grouped differently across partitions. <strong>Shuffles</strong> are required by sort-merge joins, sort, groupBy, and distinct operations. If you think about making a distributed join work, you can imagine that you’d need to re-distribute (shuffle) your data such that all records with the same join key(s) are written to the same <strong>partition</strong> (and consequently the same <strong>executor</strong>). Only once these records are living on the same machine can Spark do the corresponding join to match the records in each dataset. <strong>Shuffles</strong> are complex &amp; costly operations since they involve serializing and copying data across <strong>executors</strong> in a cluster.</p> <p>Let’s try and ground all this in some examples.</p> <h1 id="example-1-aggregating-transaction-amounts-by-app">Example 1: Aggregating transaction amounts by app</h1> <h2 id="sample-code">Sample Code</h2> <p>Imagine we have a dataset that contains 1 row per transaction. Each transaction has some information about it, like when it was <code class="language-plaintext highlighter-rouge">created_at</code>, the <code class="language-plaintext highlighter-rouge">api_client_id</code> that was responsible for the transaction, and the <code class="language-plaintext highlighter-rouge">amount</code> (# of units) that were processed in the transaction.</p> <p>Say we want to bucket these <code class="language-plaintext highlighter-rouge">api_client_ids</code> into a particular <code class="language-plaintext highlighter-rouge">app_grouping</code> and see how much each <code class="language-plaintext highlighter-rouge">app_grouping</code> has processed since 2020-01-01. Written in SQL, this would look something like this:</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">WITH</span> <span class="n">trxns_cleaned</span> <span class="k">AS</span> <span class="p">(</span> <span class="k">SELECT</span> <span class="k">CASE</span> <span class="k">WHEN</span> <span class="n">api_client_id</span><span class="o">=</span><span class="mi">123</span> <span class="k">THEN</span> <span class="s1">'A'</span> <span class="k">WHEN</span> <span class="n">api_client_id</span> <span class="k">IN</span> <span class="p">(</span><span class="mi">456</span><span class="p">,</span> <span class="mi">789</span><span class="p">)</span> <span class="k">THEN</span> <span class="s1">'B'</span> <span class="k">ELSE</span> <span class="s1">'C'</span> <span class="k">END</span> <span class="k">AS</span> <span class="n">app_grouping</span><span class="p">,</span> <span class="n">amount</span> <span class="k">FROM</span> <span class="n">transactions</span> <span class="k">WHERE</span> <span class="n">created_at</span> <span class="o">&gt;=</span> <span class="nb">TIMESTAMP</span><span class="s1">'2020-01-01'</span> <span class="p">)</span> <span class="k">SELECT</span> <span class="n">app_grouping</span><span class="p">,</span> <span class="k">SUM</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span> <span class="k">AS</span> <span class="n">amount_processed</span> <span class="k">FROM</span> <span class="n">trxns_cl</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span> </code></pre></div></div> <p>And the corresponding PySpark code could look like this (assuming we’ll write the final results to disk somewhere as a set of Parquet files):</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">trxns_cleaned</span> <span class="o">=</span> <span class="p">(</span> <span class="n">df</span> <span class="p">.</span><span class="nb">filter</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'created_at'</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">F</span><span class="p">.</span><span class="n">lit</span><span class="p">(</span><span class="s">'2020-01-01'</span><span class="p">))</span> <span class="p">.</span><span class="n">withColumn</span><span class="p">(</span> <span class="s">'app_grouping'</span><span class="p">,</span> <span class="n">F</span><span class="p">.</span><span class="n">when</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'api_client_id'</span><span class="p">)</span> <span class="o">==</span> <span class="n">F</span><span class="p">.</span><span class="n">lit</span><span class="p">(</span><span class="mi">123</span><span class="p">),</span> <span class="s">'A'</span><span class="p">)</span> <span class="p">.</span><span class="n">when</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'api_client_id'</span><span class="p">).</span><span class="n">isin</span><span class="p">([</span><span class="mi">456</span><span class="p">,</span> <span class="mi">789</span><span class="p">]),</span> <span class="s">'B'</span><span class="p">)</span> <span class="p">.</span><span class="n">otherwise</span><span class="p">(</span><span class="s">'C'</span><span class="p">)</span> <span class="p">)</span> <span class="p">)</span> <span class="n">output</span> <span class="o">=</span> <span class="p">(</span> <span class="n">trxns_cleaned</span> <span class="p">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">'app_grouping'</span><span class="p">)</span> <span class="p">.</span><span class="n">agg</span><span class="p">(</span> <span class="n">F</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="s">'amount'</span><span class="p">).</span><span class="n">alias</span><span class="p">(</span><span class="s">'amount_processed'</span><span class="p">)</span> <span class="p">)</span> <span class="p">.</span><span class="n">select</span><span class="p">([</span><span class="s">'app_grouping'</span><span class="p">,</span> <span class="s">'amount_processed'</span><span class="p">])</span> <span class="p">)</span> <span class="n">output</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="n">parquet</span><span class="p">(</span><span class="s">"result.parquet"</span><span class="p">)</span> </code></pre></div></div> <h2 id="execution-overview">Execution Overview</h2> <p>The <code class="language-plaintext highlighter-rouge">output.write</code> line above is an <strong>action</strong>, which will trigger the <strong>job</strong> represented below. In this example job, we can see that Spark will read a bunch of files from cloud storage. Each file maps to one <strong>partition</strong>, the default behaviour in Spark. Our example job has two stages due to the shuffle required by the <code class="language-plaintext highlighter-rouge">groupBy</code> transformation.</p> <p align="center"> <img width="100%" src="/images/spark-from-100ft/example-1-full.png" /> </p> <h3 id="stage-1">Stage 1</h3> <p>In the first stage, we can see four different <strong>tasks</strong> being performed on each partition:</p> <ul> <li><strong>FileScan</strong>: this operation the reads the selected columns from the file<sup>5</sup> into memory</li> <li><strong>Filter</strong>: Remove any transactions created before 2020-01-01</li> <li><strong>Project</strong>: Select the columns we care about and create the new <code class="language-plaintext highlighter-rouge">app_grouping</code> column</li> <li><strong>HashAggregate</strong>: An initial aggregation that occurs on each partition prior to shuffling, as part of the <code class="language-plaintext highlighter-rouge">groupBy app_grouping</code> operation. This reduces the amount of data that needs to be <strong>shuffled</strong> before stage 2.</li> </ul> <p align="center"> <img width="75%" src="/images/spark-from-100ft/example-1-part-1.png" /> </p> <p>You can see what some example data looks like in a single <strong>partition</strong> after each <strong>task</strong> (transformation) is performed on it:</p> <p align="center"> <img width="85%" src="/images/spark-from-100ft/example-1-part-1-w-data.png" /> </p> <h3 id="shuffle--stage-2">Shuffle + Stage 2</h3> <p>In order to aggregate all the transaction amounts processed by each <code class="language-plaintext highlighter-rouge">app_grouping</code>, we need to first perform a <strong>shuffle</strong> to move all records for each <code class="language-plaintext highlighter-rouge">app_grouping</code> across all <code class="language-plaintext highlighter-rouge">partitions</code> in stage 1 onto the same <code class="language-plaintext highlighter-rouge">partition</code> in stage 2. Because partitions will live on different <strong>executors</strong>, this <strong>shuffle</strong> will have to distribute data across the network. Additionally, the new partitions must be small enough to fit on a single executor.<sup>6</sup></p> <p align="center"> <img width="75%" src="/images/spark-from-100ft/example-1-part-2-w-executors.png" /> </p> <p>This is best understood by looking at some example data. You can imagine that each partition will contain data for all three <code class="language-plaintext highlighter-rouge">app_groupings</code>: A, B and C. All the A’s need to get sent to the same partition, all the B’s to another partition, etc. Once the data has been distributed into these new partitions, a final <code class="language-plaintext highlighter-rouge">HashAggregate</code> step can be performed to finish summing the <code class="language-plaintext highlighter-rouge">amounts</code> processed by each <code class="language-plaintext highlighter-rouge">app_grouping</code>. A final <code class="language-plaintext highlighter-rouge">Project</code> transformation is applied to select the desired columns prior to writing the results back to disk.</p> <p align="center"> <img width="85%" src="/images/spark-from-100ft/example-1-part-2-w-data.png" /> </p> <h1 id="example-2-enrich-a-set-of-user-events-in-a-particular-timeframe">Example 2: Enrich a set of user events in a particular timeframe</h1> <h2 id="sample-code-1">Sample Code</h2> <p>Let’s pretend we work at <del>Facebook</del> Meta and have a dataset of <code class="language-plaintext highlighter-rouge">user_event_logs</code>, which contains 1 row for every user event. The user events are categorized by an <code class="language-plaintext highlighter-rouge">event_id</code>, which can be looked up in another dataset we’ll call <code class="language-plaintext highlighter-rouge">user_event_dimension</code>. For example, <code class="language-plaintext highlighter-rouge">event_id = 1</code> may be a “Like” and <code class="language-plaintext highlighter-rouge">event_id = 2</code> could be a “Post”.</p> <p>We want to create a dataset with all user events since 2020-01-01. Instead of seeing the <code class="language-plaintext highlighter-rouge">event_id</code>, we want to see the actual <code class="language-plaintext highlighter-rouge">event_type</code> so we’ll join to the <code class="language-plaintext highlighter-rouge">user_event_dimension</code> to enrich our dataset. Here’s what this data pull would look like in plain SQL:</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">WITH</span> <span class="n">cleaned_logs</span> <span class="k">AS</span> <span class="p">(</span> <span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">event_id</span><span class="p">,</span> <span class="n">event_at</span> <span class="k">FROM</span> <span class="n">user_event_logs</span> <span class="k">WHERE</span> <span class="n">event_at</span> <span class="o">&gt;=</span> <span class="nb">TIMESTAMP</span><span class="s1">'2020-01-01'</span> <span class="p">)</span> <span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">event_at</span><span class="p">,</span> <span class="n">event_type</span> <span class="k">FROM</span> <span class="n">cleaned_logs</span> <span class="k">INNER</span> <span class="k">JOIN</span> <span class="n">user_event_dimension</span> <span class="k">ON</span> <span class="n">cleaned_logs</span><span class="p">.</span><span class="n">event_id</span><span class="o">=</span><span class="n">event_dimension</span><span class="p">.</span><span class="n">event_id</span> </code></pre></div></div> <p>And the corresponding PySpark:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">output</span> <span class="o">=</span> <span class="p">(</span> <span class="n">user_event_logs_df</span> <span class="p">.</span><span class="nb">filter</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'event_at'</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">F</span><span class="p">.</span><span class="n">lit</span><span class="p">(</span><span class="s">'2020-01-01'</span><span class="p">))</span> <span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">user_event_dimension_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'event_id'</span><span class="p">)</span> <span class="p">.</span><span class="n">select</span><span class="p">([</span><span class="s">'user_id'</span><span class="p">,</span> <span class="s">'event_at'</span><span class="p">,</span> <span class="s">'event_type'</span><span class="p">])</span> <span class="p">.</span><span class="n">collect</span><span class="p">()</span> <span class="p">)</span> </code></pre></div></div> <h2 id="execution-overview-1">Execution Overview</h2> <p>The <code class="language-plaintext highlighter-rouge">.collect</code> line above is an <strong>action</strong>, which will trigger the <strong>job</strong> represented below. Our example job has three stages: one each dataset and one post-shuffle for the join. Similar to the <code class="language-plaintext highlighter-rouge">groupBy</code> in the previous example, all data for each join key needs to be co-located on the same executor in order to perform the operation. In this example, that means all <code class="language-plaintext highlighter-rouge">event_id</code>s from each dataset must get sent to the same executor.</p> <p align="center"> <img width="100%" src="/images/spark-from-100ft/example-2-no-broadcast.png" /> </p> <p>You can see how this shakes out below with some example data. Each dataset is read, with the <code class="language-plaintext highlighter-rouge">user_event_logs</code> dataset (in green) being filtered. After the shuffle, all the “Likes” are sent to same executor, along with all the “Posts” and “Shares”. Once they are collocated, the join can happen and our final dataset with the new set of columns (<code class="language-plaintext highlighter-rouge">user_id</code>, <code class="language-plaintext highlighter-rouge">event_type</code> and <code class="language-plaintext highlighter-rouge">event_at</code>) can be sent back to the driver for further analysis.</p> <p align="center"> <img width="100%" src="/images/spark-from-100ft/example-2-no-broadcast-w-data.png" /> </p> <h1 id="notes">Notes</h1> <p><sup>1</sup> In most batch Spark applications, the driver doesn’t actually read or process the data. It may do things like index your filesystem to find out how many files exist in order to figure out how many partitions there will be, but the actual reading and processing of the data will happen on the executors. In a common Spark ETL job, data or results will generally never come back to the driver. Some exceptions of this are things like broadcast joins or intermediate operations that calculate results used in the job (i.e. calculate an array of frequently occuring values and then use those in a downstream filter/operation), since these operations send data back to the driver.</p> <p><sup>2</sup> Often times, you’ll actually have multiple executors living in containers on the same compute instance, so they aren’t actually their own physical computers, but instead virtual ones.</p> <p><sup>3</sup> An executor is shown as having its own disk space in the diagram, but again, due to the fact that multiple executors may live on the same host machine, this will not always be true.</p> <p><sup>4</sup> Spark code is lazily evaluated. This means that your code won’t actual execute any of the code until you intentionally call a particular <strong>action</strong> that trigers the evaluation. Some advantages of this are described <a href="https://stackoverflow.com/questions/38027877/spark-transformation-why-is-it-lazy-and-what-is-the-advantage">here</a>.</p> <p><sup>5</sup> With popular file formats like Parquet, you can only read in the columns you care about, rather than reading in all columns (which happens when you read a CSV or any plain text file).</p> <p><sup>6</sup> In this diagram it looks like each executor only gets 1 partition in some cases. In reality this will not be the case, and would be really inefficient. Executors will hold and process many partitions.</p>ianwhitestoneIn the trenches with Spark2021-11-02T00:00:00+00:002021-11-02T00:00:00+00:00https://ianwhitestone.work//spark-trenches