Ian Whitestone

Extend the Runway: A deep dive into Snowflake costs at Coalesce 2022

2022-10-27T00:00:00+00:00

Calculating cost per query in Snowflake

2022-10-12T00:00:00+00:00

Snowflake Performance Tuning and Cost Optimization

2022-09-28T00:00:00+00:00

Snowflake Optimization Power Hour

2022-09-15T00:00:00+00:00

Snowflake Architecture Overview

2022-09-12T00:00:00+00:00

What’s up with DuckDB?

2022-08-17T00:00:00+00:00

Farewell, Shopify ❤️

2022-06-30T00:00:00+00:00

👋

After three and a half beautiful years, today’s my last day at Shopify. Most people silently exit a company. I was planning to do just that, until my director Mike encouraged me to write a farewell note. He said we need to celebrate people moving on and not let it go unnoticed. After some brief thought, I agreed. It would be a shame to not reflect on what I’ve gained from this experience, and acknowledge that it’s exactly because of my time here that I’m ready for what’s next. It’s tough to cover everything I’ve learned here, so I’m going to focus on two buckets which have had the largest impact on me.

product craft

As an incoming data scientist, learning about the art of product was something I did not anticipate. I joined a brand new product area with ~7 other people (1 PM, 4 devs and 2 UX), which eventually grew into an org of over 100 people responsible for Shopify Markets and a new tax platform. Of course, this wasn’t a linear journey. We had false starts and features we ended up killing. There were even periods early on where we as a team were close to getting dissolved. All of these bumps along the way came with great universal lessons. I learned to fall in love with problems, and not solutions. To dream big, but start small. I saw first hand how big opportunities will always be sitting right in front of you, you just have to reach and grab them. This was relevant 3 years ago at Shopify and remains true today. Work hard, eyes open.

Being tightly embedded in a multi-disciplinary group will give you the opportunity to learn from experts in other crafts, you just need to take some initiative. I got to witness Heather run world class user research sessions, because I simply asked to join. I learned how we manage complex product rollouts, handle production incidents, and develop in large scale codebases because I invested in relationships with our amazing devs and became a sponge.

Regardless of what team you work on, one of the best features of working at Shopify is you get the closest thing possible to root level access to Tobi’s brain. Every couple months, I’d do a slack search for from:@Tobi Lütke and learn how he was thinking about the way things were built. One day it was “Don’t stack abstractions” in response to a discussion around abstracting ActiveRecord¹. Another time it was the importance of setting good defaults in our product so everything just works out of the box. When deciding whether or not something should be built, he’d talk about the importance of having strong opinions and building based on that, rather than waiting for customer demand. Getting front row access to Tobi’s principled thinking and relentless focus on simplicity was easily one of the best things about working here.

data craft

As close as I was to the product, I still spent 90% of my time living and breathing data. Shopify’s data team came about in 2014, back when none of the “Modern Data Stack” existed. Like other big tech companies from that era, they were forced to build many of the frameworks and tools that exist today as standalone companies.

As a full stack data scientist, you get exposure to the data stack end to end and the people who built it. From data extraction and all the pitfalls with change data capture or deletes. To event tracking with kafka and the joys of duplicates, missed events and late arriving data. Out of memory errors, disk spill and lost containers². Slow SQL queries and figuring out when it makes sense to build a new data model. We exist to add value with data, and navigating this stack and learning the ins and outs of each system was one of the favourite parts of my job.

Of course, I wasn’t alone in these endeavours. Across all crafts at Shopify, you’ll be surrounded with senior members who’ve been at it for 5 times as long as you have³. Take advantage of these opportunities and learn from the best. Be vocal and share your feedback about the platform. I did this frequently, and as a result got to participate in helping shape some of the new tooling we built.

Working in an end to end nature also allows you to see the full data value chain. I got to work on analysis that unblocked key product decisions, ran experiments that resulted in shipping changes that positively impacted millions of merchant’s businesses, and built data-driven products that abstracted away some of the gnarlier aspects of commerce. Getting exposure to all these things takes time and persistence. Be patient, and the opportunities will come.

onwards!

So, what’s next? A piece of advice that’s stuck with me for a long time is something my Dad said to me; that “the worst thing that can happen in life is if you look back and say what if?” ⁴. While I could happily spend my career here, I’ve always wanted to take a shot at entrepreneurship and start a company⁵. With kids and a mortgage a few years out, it’s quickly become clear that now is the best time. As scared as I am, I know that 80 year old Ian in a rocking chair would be full of regret if he didn’t try this.

Without question, I’d have nowhere close to the level of confidence required to take this leap if it weren’t for my time at Shopify. So thanks to Tobi for creating this incredible place, and thanks to everyone I got to work with along the way. I am forever grateful.

notes

¹ After being asked to elaborate, Tobi expanded on his point: “Abstractions are bad unless they make something new possible or something that you really need to do 10x easier. The abstractions in rails are the ones that sit at this sweetspot. Stay close to vanilla rails as you can while solving the problem you have. Only deviate if you know exactly what you are doing. Never listen to architecture astronauts. Existence of arguments in favour of an abstraction doesn’t even nearly clear the bar for adopting it.”

² I’m intentionally highlighting many of the more challenging aspects of working in data. Of course it’s not always like this. Yet, when things break and you push their limits is when you’ll be forced to go deep and really understand the ins and outs of how something works.

³ Special shout out to Karl Taylor, Michael Styles and Khaled Hammouda, who taught me pretty much everything I know about Spark.

⁴ Jeff Bezos said something similar when deciding to leave D.E. Shaw to start Amazon.

⁵ More on this later, but I plan to build a B2B SaaS company in the data space. I’m happiest when I’m on the steepest part of the learning curve, and there’s no doubt that entrepreneurship and wearing all the hats required to build a company will bring this.

Unpacking the Spark Web UI

2021-11-14T00:00:00+00:00

Example Job & Data
Navigating the UI
Notes
Generating the dataset

The Spark Web UI provides an interface for users to monitor and inspect details of their Spark application. You can leverage it to answer a host of questions like:

How long did my job take to run?
How did the Spark optimizer decide to execute my job?
How much disk spill was there in each stage? In each executor?
What stage took the longest?
Is there significant data skew?

These capabilities make the Web UI incredibly useful. Unfortunately, it is not the easiest thing to understand. In this post I’ll provide a quick tour of the Web UI by leveraging a simple Spark job as a reference point. If your new to Spark or need a refresher on things like “jobs”, “stages” and “tasks”, I encourage you to read my high level intro of Spark first. It’s also important to note that everything shown in this post is using Spark v2.4.4 ¹.

Onwards!

Example Job & Data

We’ll imagine we have a bunch of e-commerce data, and we want to find out the maximum transaction value on each day in each country. For this example, we’ll have two datasets to help us answer this question. A transactions model with 1 row per transaction, and information like the transaction timestamp and amount.

transaction_id	shop_id	created_at	currency_code	amount
1	123	2021-01-01 12:55:01	USD	25.99
2	123	2021-01-01 17:22:05	USD	13.45
3	456	2021-01-01 19:04:59	CAD	10.22

The transactions model will also have a reference (shop_id) that links it to another model, shop_dimension, which has 1 row per shop and some metadata for that shop.

shop_id	shop_country_name	shop_country_code
123	Canada	CA
456	United States	US

Head to the notes section to see the code I used to generate these two datasets. Using plain SQL, we could find the max transaction value per country & day with:

SELECT
    sd.shop_country_code,
    trxns.created_at_date,
    MAX(amount) AS max_transaction_value
FROM
    transactions AS trxns
    INNER JOIN shop_dimension AS sd
        ON trxns.shop_id=sd.shop_id
GROUP BY 1,2

And in PySpark, the code would look something like:

output = (
    trxns_skewed_df
    .join(shop_df, on='shop_id')
    .groupBy('shop_country_code', 'created_at_date')
    .agg(
        F.max('amount').alias('max_transaction_value')
    )
)

result = output.collect()

Navigating the UI

Jobs

.collect() is an action, and actions trigger jobs in Spark. If you click on the Jobs tab of the UI, you’ll see a list of completed or actively running jobs. From this view, we can see a few things:

The action that triggered the job (collect at <ipython-input-320-...>)
The time it took (6.7 min)
The number of stages (4) and tasks (1493)

When we click into our job we can see some more details, particularly around the stages. Our job has 4 stages, which makes sense since a new stage is created whenever there is a shuffle. We have:

1 stage for the initial reading of each dataset
1 for the join
1 for the aggregation

Stages

From the detailed job view, we can zoom into any of the stages. I clicked on the third one (Stage 89²) where the join on shop_id is happening. Spark throws a bunch of information at us:

High level stats like:
- Shuffle Read: Total shuffle bytes of records read during the shuffle
- Shuffle Write: Bytes of records written to disk in order to be read by a shuffle in a future stage
- Shuffle Spill (Memory): The uncompressed size of data that was spilled to memory during the shuffle
- Shuffle Spill (Disk): The compressed size of data that was spilled to disk during the shuffle
Summary metrics (duration, shuffle, etc.) across all tasks and percentile
Aggregated metrics by executor

When looking at a given stage, it can often be tricky to figure out what is actually happening in that stage. To help with this, you can use the DAG visualization to get a high level sense of what the stage is doing. Below, you can see two datasets being shuffled and merged together. Pairing this with the knowledge of our query from above, you can ultimately deduce that this is where the join is happening.

I intentionally generated a very skewed dataset by having a small % of shops make up a large % of all transactions. The impact of this on our job quickly becomes evident.

We can see there is ~20GB of disk spill happening. This is because there isn’t enough memory available to complete the tasks (shuffling and joining), so Spark has must write data down to disk. This is both expensive (slow) and can potentially take down the entire node if there is too much disk spill.
Looking at the summary metrics across all tasks, we can see that some tasks are taking much longer than others (max time = 4.9 min vs. median time = 17 seconds, that’s 17 times as long!). Similarly, some tasks have much more disk spill than others (max disk spill = 4.7GB vs. median disk spill=36.1MB, 133 times as big!). This is a direct result of our skew: performing the join for shops with a large number of transactions (records) takes longer and spills more because the data is too big!
Looking at the aggregated metrics per executor, we can see that some executors (like #61) are spilling more data to disk than others. This is likely a function of some executors having to deal with much larger partitions than others, again thanks to the skew.

SQL

For most dataframe jobs³, the SQL tab can be leveraged to visualize how Spark is executing your query. You can find the query of interest by selecting the one associated with your job:

You’ll then be presented with a nice graphical visualization of your job. I personally find these the most useful to diagnose what’s going on. We can see each dataset being read in and the associated size of each, the shuffle operation before the join and the eventual join. You can leverage the summary stats on this page to see things similar to what we saw on the Stage page, like the disk spill from the join!

You can hover over different parts of the query to learn more, like which dataset is being scanned (it will show the full GCS path) or how many partitions are being used in the shuffle - in this example, it is 200, the default value set by Spark (see the hashpartition(shop_id#3104, 200) that appears when I hover over the Exchange block).

Plans

At the bottom of the page, you can see the different plans Spark created for your query. I only ever look at the Physical Plan, since that is what actually gets executed⁴:

The Physical Plan tells you how the Spark optimizer will execute your job, in written form. You can use it to understand things like what join strategies are being used. Did Spark decide to try and do a broadcast join? Or you can see what filters have been pushed down to the Parquet level. The graphical representation above is generally easier to use as a starting point, but sometimes you’ll need to go into the physical plan in order to get more details not shown visually.

Note that you can also get the physical plan outside of the Web UI, by calling the explain() method on your dataframe object:

output = (
    trxns_skewed_df
    .join(shop_df, on='shop_id')
    ...
)
output.explain()

Storage, Environment and Executors

I won’t go over the Storage, Environment or Executors tab, since I barely ever use these. You can read more about their use cases here. Very quickly:

Storage will show information about any persisted dataframes (i.e. if you called df.persist() or df.cache()⁴)
Environment will tell you about the different environment and configuration variables that were set for the Spark job
Executors has information about each executor in your cluster, like disk space, the number of cores, memory usage, and more

Notes

¹ In Spark v3, there were some changes introduced, such as improved SQL metrics and plan visualization. Learn more here and here.

² This is Stage 89 cause I’d run a bunch of Spark jobs prior to this one you are seeing.

³ I’m not sure under what scenarios you wouldn’t see this when executing a Spark Job with dataframes.

⁴See this post for an explanation of the differences between each plan type.

⁵ Curious about the difference between cache and persist, see here. Wondering when you should be using them? See here.

Generating the dataset

For the purposes of this example, I wanted the join key (shop_id) to be skewed in order to show how skew can be detected in the Web UI. This is also quite common in practice, no matter what your domain is. Any time you have an event-level dataset, it’s quite possible that certain users/accounts/shops generate a large portion of those events. For this example (shops generating e-commerce transactions), we could rank & sort each shop based on their total transaction count, and then plot the cumulative % of total transactions as we include each shop. You can see what this would theoretically look like for a skewed and un-skewed dataset:

Simulating Skewness

To simulate a high degree of skewness, I sampled from a chi-squared distribution.

ids = np.round(
1 + np.random.chisquare(0.35, size=10000)*100000
)
plt.hist(ids, bins=100);

The resulting shop transaction frequency plot looks like this:

Running some quick analysis on this, we can see that 11% of all transactions come from a single shop in this artifical dataset:

>>> pd.Series(ids).describe()
count    1.000000e+04
mean     3.370065e+04
std      8.156347e+04
min      1.000000e+00
25%      4.600000e+01
50%      2.676000e+03
75%      2.833600e+04
max      1.702097e+06

>>> 100.0*ids[ids == 1].shape[0] / ids.shape[0] 
11.19

Transaction Dataset

Both datasets were generated through a combination of pandas and numpy. The generated transactions dataset had 6.5 million rows (I played around with this until each file was ~120MB, a good aproximate size (compressed) for a single partition in Spark). You can see I leverage the same chi-squared distribution from above to randomly generate shop_ids, with the smaller shop_ids occuring much more frequently. While I didn’t leverage this in this post, I also made the dataset skewed by currency_code, by specifying that 80% of transactions would be USD, 2% CAD, 10% EUR, etc. All transactions were set to occur across 10 days.

N = 6500000 # 6.5 million rows

currencies = ['USD', 'CAD', 'EUR', 'GBP', 'DKK', 'HKD']
currency_probas = [0.8, 0.02, 0.1, 0.05, 0.015, 0.015]

df = pd.DataFrame({
    'transaction_id': np.arange(1, N + 1),
    'shop_id': np.round(
        1 + np.random.chisquare(0.35, size=N)*100000
    ),
    '_days_since_base': np.random.randint(0, 10, size=N),
    'currency_code': np.random.choice(
        currencies, size=N, p=currency_probas
    ),
    'amount': np.random.exponential(50, size=N)
})

df['base_date'] = datetime(2016, 1, 1)
days = pd.TimedeltaIndex(df['_days_since_base'], unit='D')
df['created_at_date'] = df.base_date + days

I then converted the pandas dataframe to parquet and wrote to Google Cloud Storage (GCS):

base_path = "gs://my_bucket/in-the-trenches-with-spark/"

_parquet_bytes = io.BytesIO()
df.to_parquet(_parquet_bytes)
parquet_bytes = _parquet_bytes.getvalue()

gcs_helper.writeBytes(os.path.join(base_path, 'transactions_skewed_part_1.parquet'), parquet_bytes)

6.5 million rows is small. I wanted something 500x as big. You can’t generate that in memory in one-go, so you’d either have to repeat what I did above 500 times, or just make 500 copies of the dataset with a simple bash script (much quicker).

NUM_FILES=500
BASE_PATH="gs://my_bucket/in-the-trenches-with-spark" 

for i in $(seq 1 $NUM_FILES)

    do gsutil cp "$BASE_PATH/transactions_skewed_part_1.parquet" "$BASE_PATH/transactions_skewed_part_$i.parquet"

done

Note, this will naturally result in multiple rows with the same transaction_id, etc..but for the purposes of the examples used in this post, it doesn’t matter.

Shop Dimension Dataset

The shop dimension dataset was created in a similar fashion, with certain countries (like the US) appearing more often than hours - this introduces another source of skew!.

shop_df_size = int(df.shop_id.max())
country_names = ['United States', 'Canada', 'Germany', 'United Kingdom', 'Denmark', 'Hong Kong']
country_codes = ['US', 'CA', 'DE', 'GB', 'DK', 'HK']

shop_df = pd.DataFrame({
    'shop_id': np.arange(1, shop_df_size + 1),
    'shop_country_code': np.random.choice(country_codes, size=shop_df_size, p=currency_probas),
    'shop_country_name': np.random.choice(country_names, size=shop_df_size, p=currency_probas),
    'attribute_1': np.random.random(size=shop_df_size),
    'attribute_2': np.random.random(size=shop_df_size),
})

For this dataset, I just split it up into five files.

num_files = 5
dfs = np.array_split(shop_df, num_files)
for x in range(1, num_files+1):
    _parquet_bytes = io.BytesIO()
    dfs[x-1].to_parquet(_parquet_bytes)
    parquet_bytes = _parquet_bytes.getvalue()    
    path = os.path.join(base_path, 'shop_dimension_{0}.parquet'.format(x))
    print('Writing parquet file {0}'.format(path))
    gcs_helper.writeBytes(path, parquet_bytes)

The resulting datasets are shown below (using Apache Hue’s file explorer):

Spark from 100ft

2021-11-07T00:00:00+00:00

Architecture Overview & Common Terminology
Example 1: Aggregating transaction amounts by app
- Sample Code
- Execution Overview
  - Stage 1
  - Shuffle + Stage 2
Example 2: Enrich a set of user events in a particular timeframe
- Sample Code
- Execution Overview
Notes

Apache Spark is an open-source framework for large-scale data analytics. Large-scale data processing is achieved by leveraging a cluster of computers and dividing the work among them. Spark came after the Hadoop MapReduce framework, offering much faster perforamnce since data is retained in memory instead of being written to disk after each step. It’s available in multiple languages (Scala, Java, Python, R) and offers batch and stream based processing, a machine learning library, and graph data processing. Based on my experience, it is most commonly used for batch data processing. It is also rarely understood. To help with that, this is a quick post for beginners to better understand Spark at a high level (~100ft +/- some), or those with some experience looking for a refresher.

Architecture Overview & Common Terminology

A Spark cluster consists of a single driver and (usually) a bunch of executors. The driver is responsible for the orchestration of the job. Your Spark code is submitted to the driver, which converts your program into a bunch of tasks that run on the executors. The driver is generally not interacting directly with the data¹. Instead, the work happens on the executors. Conceptually, you can think of an executor as a “single computer”² with a single Java VM running Spark. It has dedicated memory, CPUs and disk space³. Executors run tasks in parallel across multiple threads (cores), so parallelism in a Spark cluster is achieved both across and within executors.

With Spark, your dataset will be split up into a bunch of distributed “chunks”, which we call partitions. A task is then a unit of work that is run on a single partition, on a single executor.

Broadly speaking, there are two types of work: transformations and actions. A transformation is anything that creates a new dataset (filter, map, sort, group by, join, etc.). An action is anything that triggers the actual execution⁴ of your Spark code (count, collect, write, top, take).

If we look at the following PySpark code:

  event_logs_df
  .filter(F.col('event_at') >= F.lit('2020-01-01'))
  .join(event_dimension_df, on='event_id')
  .select(['user_id', 'event_at', 'event_type'])
  .collect()

filter, join, and select are all transformations and collect (which asks for all executors to send their data back to the driver) is an action.

An action triggers a job, which is a way to group together all the tasks involved in that computation. A job will consist of a collection of stages, which are in turn a collection of transformations. A new stage gets created whenever there is a shuffle.

A shuffle is a mechanism for redistributing data so that it’s grouped differently across partitions. Shuffles are required by sort-merge joins, sort, groupBy, and distinct operations. If you think about making a distributed join work, you can imagine that you’d need to re-distribute (shuffle) your data such that all records with the same join key(s) are written to the same partition (and consequently the same executor). Only once these records are living on the same machine can Spark do the corresponding join to match the records in each dataset. Shuffles are complex & costly operations since they involve serializing and copying data across executors in a cluster.

Let’s try and ground all this in some examples.

Example 1: Aggregating transaction amounts by app

Sample Code

Imagine we have a dataset that contains 1 row per transaction. Each transaction has some information about it, like when it was created_at, the api_client_id that was responsible for the transaction, and the amount (# of units) that were processed in the transaction.

Say we want to bucket these api_client_ids into a particular app_grouping and see how much each app_grouping has processed since 2020-01-01. Written in SQL, this would look something like this:

  WITH
  trxns_cleaned AS (
    SELECT
      CASE
        WHEN api_client_id=123 THEN 'A'
        WHEN api_client_id IN (456, 789) THEN 'B'
        ELSE 'C'
      END AS app_grouping,
      amount
    FROM
      transactions
    WHERE
      created_at >= TIMESTAMP'2020-01-01'
  )
  SELECT
      app_grouping,
      SUM(amount) AS amount_processed
  FROM
      trxns_cl
  GROUP BY 1

And the corresponding PySpark code could look like this (assuming we’ll write the final results to disk somewhere as a set of Parquet files):

  trxns_cleaned = (
      df
      .filter(F.col('created_at') >= F.lit('2020-01-01'))
      .withColumn(
          'app_grouping', 
          F.when(F.col('api_client_id') == F.lit(123), 'A')
          .when(F.col('api_client_id').isin([456, 789]), 'B')
          .otherwise('C')
      )
  )

  output = (
      trxns_cleaned
      .groupBy('app_grouping')
      .agg(
          F.sum('amount').alias('amount_processed')
      )
      .select(['app_grouping', 'amount_processed'])
  )

  output.write.parquet("result.parquet")

Execution Overview

The output.write line above is an action, which will trigger the job represented below. In this example job, we can see that Spark will read a bunch of files from cloud storage. Each file maps to one partition, the default behaviour in Spark. Our example job has two stages due to the shuffle required by the groupBy transformation.

Stage 1

In the first stage, we can see four different tasks being performed on each partition:

FileScan: this operation the reads the selected columns from the file⁵ into memory
Filter: Remove any transactions created before 2020-01-01
Project: Select the columns we care about and create the new app_grouping column
HashAggregate: An initial aggregation that occurs on each partition prior to shuffling, as part of the groupBy app_grouping operation. This reduces the amount of data that needs to be shuffled before stage 2.

You can see what some example data looks like in a single partition after each task (transformation) is performed on it:

Shuffle + Stage 2

In order to aggregate all the transaction amounts processed by each app_grouping, we need to first perform a shuffle to move all records for each app_grouping across all partitions in stage 1 onto the same partition in stage 2. Because partitions will live on different executors, this shuffle will have to distribute data across the network. Additionally, the new partitions must be small enough to fit on a single executor.⁶

This is best understood by looking at some example data. You can imagine that each partition will contain data for all three app_groupings: A, B and C. All the A’s need to get sent to the same partition, all the B’s to another partition, etc. Once the data has been distributed into these new partitions, a final HashAggregate step can be performed to finish summing the amounts processed by each app_grouping. A final Project transformation is applied to select the desired columns prior to writing the results back to disk.

Example 2: Enrich a set of user events in a particular timeframe

Sample Code

Let’s pretend we work at ~~Facebook~~ Meta and have a dataset of user_event_logs, which contains 1 row for every user event. The user events are categorized by an event_id, which can be looked up in another dataset we’ll call user_event_dimension. For example, event_id = 1 may be a “Like” and event_id = 2 could be a “Post”.

We want to create a dataset with all user events since 2020-01-01. Instead of seeing the event_id, we want to see the actual event_type so we’ll join to the user_event_dimension to enrich our dataset. Here’s what this data pull would look like in plain SQL:

  WITH
  cleaned_logs AS (
    SELECT
      user_id,
      event_id,
      event_at
    FROM
      user_event_logs
    WHERE
      event_at >= TIMESTAMP'2020-01-01'
  )
  SELECT
      user_id,
      event_at,
      event_type
  FROM
      cleaned_logs
      INNER JOIN user_event_dimension
        ON cleaned_logs.event_id=event_dimension.event_id

And the corresponding PySpark:

  output = (
      user_event_logs_df
      .filter(F.col('event_at') >= F.lit('2020-01-01'))
      .join(user_event_dimension_df, on='event_id')
      .select(['user_id', 'event_at', 'event_type'])
      .collect()
  )

Execution Overview

The .collect line above is an action, which will trigger the job represented below. Our example job has three stages: one each dataset and one post-shuffle for the join. Similar to the groupBy in the previous example, all data for each join key needs to be co-located on the same executor in order to perform the operation. In this example, that means all event_ids from each dataset must get sent to the same executor.

You can see how this shakes out below with some example data. Each dataset is read, with the user_event_logs dataset (in green) being filtered. After the shuffle, all the “Likes” are sent to same executor, along with all the “Posts” and “Shares”. Once they are collocated, the join can happen and our final dataset with the new set of columns (user_id, event_type and event_at) can be sent back to the driver for further analysis.

Notes

¹ In most batch Spark applications, the driver doesn’t actually read or process the data. It may do things like index your filesystem to find out how many files exist in order to figure out how many partitions there will be, but the actual reading and processing of the data will happen on the executors. In a common Spark ETL job, data or results will generally never come back to the driver. Some exceptions of this are things like broadcast joins or intermediate operations that calculate results used in the job (i.e. calculate an array of frequently occuring values and then use those in a downstream filter/operation), since these operations send data back to the driver.

² Often times, you’ll actually have multiple executors living in containers on the same compute instance, so they aren’t actually their own physical computers, but instead virtual ones.

³ An executor is shown as having its own disk space in the diagram, but again, due to the fact that multiple executors may live on the same host machine, this will not always be true.

⁴ Spark code is lazily evaluated. This means that your code won’t actual execute any of the code until you intentionally call a particular action that trigers the evaluation. Some advantages of this are described here.

⁵ With popular file formats like Parquet, you can only read in the columns you care about, rather than reading in all columns (which happens when you read a CSV or any plain text file).

⁶ In this diagram it looks like each executor only gets 1 partition in some cases. In reality this will not be the case, and would be really inefficient. Executors will hold and process many partitions.

In the trenches with Spark

2021-11-02T00:00:00+00:00