Hardcoded Variable

Snowflake Northstar Conference: Key Takeaways

Kunal Bhattacharya — Mon, 01 Jul 2024 15:57:13 GMT

As the whole world marched out of the first half of 2024, we - the software engineers of Hyderabad - were admitted into the inaugural edition of Snowflake Northstar conference in Hyderabad, held on June 29. Paraphrasing Snowflake’s website -

Snowflake’s first Northstar Conference helps you expand your skills, explore new tools through hands-on workshops with a chance to earn a badge, and participate in interactive sessions led by experts as they discuss emerging trends shaping the future of data.

Eight hours’ worth of keynotes, breakout sessions, and hands-on workshops later, we emerged from the venue with a better understanding of Snowflake’s ecosystem in general and a host of key takeaways in particular.

What is Snowflake?

For the uninitiated, Snowflake started off in 2012 as a cloud data warehouse and since then, evolved into an end-to-end “data-as-a-service” company, allowing its users to store, analyze, and transform data. Users can further develop native and interactive applications, self-service dashboards and machine learning models all the while scaling the compute as needed for the volume of data under processing. Its main USP has always been the separation of compute and storage, which has unleashed a whole new paradigm of data processing and application development in the industry.

Covering the nitty gritty of the extent of Snowflake’s services is beyond the scope of this post and you would be best served to learn more about them on their official website. We will laser our focus only on the key takeaways from the Snowflake Northstar conference below.

Northstar Takeaways

Snowflake’s foray into observability with Snowflake Trail

Snowflake Trail is a set of capabilities that will allow developers, data engineers, and data ops teams to get enhanced visibility into data quality, pipelines and applications. Built on top of existing observability foundations such as Task History and Dynamic Tables observability, Trail promises to provide effortless telemetry with a view to reducing “time to detect” (TTD) and “time to resolution” (TTR) by helping teams detect and debug issues near instantly. Much of this is still in PrPr (Private Preview), so I am definitely looking forward to this being generally available (GA).

Snowflake Trail, encompassing a set of observability experiences in Snowflake

My 2 cents

Definitely an exciting feature to look forward to. Given that their documentation further promises seamless integration with popular observability tools like Grafana, Monte Carlo, Datadog, etc., I am curious to see what level of monitoring can be natively setup on Snowflake going forward.

Serverless Tasks

While this has been generally available for a while now, I got introduced to this feature at the conference. The essence of serverless tasks is to be able to rely on compute resources being managed and allocated by Snowflake itself. From the official documentation -

Snowflake automatically resizes serverless compute resources as required for each workload. Snowflake determines the ideal size of serverless compute resources for a given run based on a dynamic analysis of statistics for the most recent runs of the same task.

This has its own set of limitations though. Serverless tasks can’t invoke -

UDFs that contain Java or Python code.
Stored procedures written in Scala (using Snowpark) or stored procs which call UDFs that contain Java or Python code.

My 2 cents

While this could make sense for tasks that call compute intensive stored procedures or UDFs, incompatibility with Java and Python based UDFs and Stored Procs severely limits its functionality.

Snowpark - Native App development and Data Engineering

Deep diving into Snowpark was one of my key objectives for attending the conference. And this is where I felt the biggest reward. The Snowpark show started right from the keynote where we were introduced to the Snowpark architecture and how it helps process data at scale.

Captured using my phone

We then joined the hands-on lab for Native App Development on Snowpark, where we got a sneak peek into using Snowflake CLI to bundle all dependencies and source codes together and deploy a Streamlit app from local to Snowflake. This was followed up with another post-lunch hands-on workshop where we deployed a data engineering project on Snowpark using the same Snowflake CLI.

My 2 cents

A blast of a session where we got to understand how to use Snowpark and Snowflake CLI to build and deploy applications at scale, seamlessly. While two hours is a very short time to understand the ins-and-outs, we got the high-level details to plan end-to-end application development and I am definitely fired up to try this out.

Snow on Snow

This was a session where Snowflake data engineering team demonstrated how they built a complete governance solution for their Data Operations team to track pipeline failures by integrating pipeline status logs from tables, on-call roster schedules from PagerDuty, and job criticality information from look-up tables. This information is processed round the clock to identify and categorize failures as per their defined priority and reduce TTR (time to resolution).

The cherry on the cake here was feeding the failures into Snowflake’s very own Cortex ML API to get actionable recommendations for fixes and summarized failure notifications, so that ops teams don’t have to spend lot of time analyzing the issue and identifying the root cause, instead focusing on solving the problem and restoring the status quo.

My 2 cents

As a solution that runs 24*7, I have my reservations on the cost implications for this given the notoriously stealthy and expensive Snowflake compute. Also, how would this work during Snowflake downtimes? As a lead engineer or engineering manager, I would be squeamish about a solution/platform that is monitoring all external processes such as critical airflow jobs and is not highly available beyond 99.9% as mentioned in their own report from 2021. I did not find a revision to this report since then.

Was it worth the effort?

It most certainly was! While there’s still a long way to maturity for some of their offerings such as serverless tasks, Streamlit in Snowflake, and their newer features such as Trail, they continue to evolve on Snowpark and the developer experience around that. Their aggressive approach towards building a greater outreach with the data engineering community also struck a chord with me and I will watch out for their product releases with greater interest in the coming days.

Database Change Management - The Alembic way - Part 2

Kunal Bhattacharya — Sat, 08 Jun 2024 15:16:19 GMT

Recap

In Part 1 of this series, we explored the need for Database Change Management by navigating a worst-case scenario on an application release day. We then ventured into Alembic as a DCM tool, set it up on our local system, and executed our first database change via migration files to create the account table.

If you are new to this space and have never worked with Alembic or similar tools for Database Change Management, I strongly suggest you click the below link and go through Part 1 before delving deeper into more advanced uses of Alembic -

Moving forward

In this post, we will look into a few more migrations and venture into the promised land of seamless rollback/downgrade using Alembic.

Generate migrations and apply…again

Let’s now create another table called customers. Let’s summon the powers of alembic once more to generate a migrations file and customize it thereafter -

λ git main* → pipenv run alembic revision -m "create customers table"
  Generating /Users/a81045729/Documents/constant_variables/alembic-demo/alembic/versions/bfa17a2f890c_create_customers_table.py ...  done

Notice something interesting? Look closely at Line 16 i.e. down_revision and compare it with the down_revision in the previous file.

The down_revision in the first file was set as None, indicating that this is the first migrations file i.e. head.
The down_revision in the second file has the revision id of the first migration file i.e. 07dccd222d6a, indicating the lineage of the migrations.

We will see how this all ties up in a little bit. For now, let’s apply this migration -

λ git main* → pipenv run alembic upgrade head
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade 07dccd222d6a -> bfa17a2f890c, create customers table

Wham bam, van Damme! Not only do we see our customers table created as expected, but also, we see the version_num column in the alembic_version table updated to the latest revision id.

Trial by fire…or in this case, ALTER

Let’s do a third, and final, migration to apply some alter statements to the account table -

λ git main* → pipenv run alembic revision -m "alter account table"
  Generating /Users/a81045729/Documents/constant_variables/alembic-demo/alembic/versions/696ee3ca71f6_alter_account_table.py ...  done

Using ORM, we are renaming name to bank_name. Typically, the kind of quick change that can get missed while checking in the code changes! Applying the migration, we can see the changes reflected as desired -

λ git main* → pipenv run alembic upgrade head                     
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade bfa17a2f890c -> 696ee3ca71f6, alter account table

Downgrade/Rollback changes

Now that we have integrated alembic into our database, let’s look at how we can rollback changes. Personally, this is a truly distinguishing feature and a lifesaver in production settings - the ability to rollback changes seamlessly.

Say, we want to undo our ALTER change -

λ git main* → pipenv run alembic downgrade -1                     
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Running downgrade 696ee3ca71f6 -> bfa17a2f890c, alter account table

Swift as a squirrel!

We can also rollback multiple steps at a time. Let’s roll back to the start of our project where we did not have any tables -

λ git main* → pipenv run alembic downgrade base
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Running downgrade bfa17a2f890c -> 07dccd222d6a, create customers table
INFO  [alembic.runtime.migration] Running downgrade 07dccd222d6a -> , create account table

Tables dropped and alembic_version truncated.

Closing Thoughts

As we see, integrating Alembic into our database migrations strategy can help seamlessly upgrade and downgrade our databases with single commands, all the while, maintaining a clear audit trail of changes in the repository. As our applications become more and more microservice-oriented, it becomes even more crucial to have an iron-clad control of our databases. Tools like alembic can go a long way into easing those nightmares and putting the ghosts of release day to bed.

Database Change Management - The Alembic way - Part 1

Kunal Bhattacharya — Sun, 02 Jun 2024 04:48:00 GMT

Setting the stage

It’s the most anticipated release day! You and your team have checked in all your application codes into the staging branch of your git repo. All of you have pushed through and meticulously resolved all bugs reported in lower environments and received approval for the release. DevOps team needs to sync up the production database with staging before application scripts can be executed. For this, they take the list of DDL and DML statements checked in and start executing them, as a script. A number of commands execute successfully and then - Baam!

ORA-00903: invalid table name

What a nightmare! Someone missed checking in some DDL script that created a table. No pressure, you can still use the equivalent of a GET_DDL() function, provided by most DBMS these days, to get the table’s DDL.

A second later, you realize the problem is actually way bigger than it looks. Your script had more than 200 DDL and DML statements. And there’s no way to know which one failed. And did it fail because of a missing table or a typo in the table name? Since this is production, you can’t undo and redo your steps. Neither is there a way to downgrade and remove everything executed so far.

If Hell ever existed on Earth, this must be what it feels like, right?

All of this nightmare can be avoided by implementing better database change management. What in freezing hell is it?

Database Change Management (DCM)

It can be referred to as the end-to-end process of managing the lifecycle of a database as it changes or evolves over time. As this article summarizes it aptly -

Database Change Management (DCM) is a critical part of maintaining and managing a database. It involves tracking and controlling changes to the database, including both the structure of the database (such as the tables, columns, and relationships between data) and the data itself. The structure change is dictated via Data Definition Language (DDL), while the data change is dictated via Data Manipulation Language (DML).

In enterprise production settings, DCM is primarily overseen by a specialized DevOps or migrations team. They have a specific set of credentials and grants to execute the database change scripts i.e. DDLs and DMLs in production environment, as provided to them by the development teams. With little business and application level understanding of these scripts and their associated dependencies, they have no way to figure out a missing statement and restore the status quo.

In this series of posts, we will explore DCM using Alembic in detail and figure out its nuances in trying to keep our databases in sync and in control.

Ode to Alembic

Alembic is an open-source database change management and migrations tool developed by the authors of SQLAlchemy. It provides for the creation, management, and invocation of change management scripts for a relational database using SQLAlchemy as the underlying engine.

To migrate changes to a database using Alembic, we add a migration script for the intended change, update the model definition, perform the migration, and then start using the database under the migrated schema.

The migration script contains a set of ids, revision and down_revision, essentially a set of hashes that track the sequence of the scripts.
The revision id, as the name suggests, denotes the hash id of the current revision.
The down_revision denotes the hash id of the previous revision i.e. the revision id of the previous migration file in the sequence.
All our DDL and DML statements are placed in ORM syntax inside this migration script.
When we finally perform the migration for the first time, Alembic creates a schema a table called alembic_version. Going forward, this table stores the last executed migration’s revision id.
After every subsequent migration applied on this database, Alembic will update this alembic_version table to only hold the last executed migration’s revision id.

Lot of talk, show me the code!

Installation

We begin by installing alembic in our virtual environment, in this case, pipenv -

λ git main* → pipenv install alembic

Creating a virtualenv for this project...
Pipfile: /Users/a81045729/Documents/constant_variables/alembic-demo/Pipfile
Using default python from /opt/homebrew/Cellar/pipenv/2023.12.1/libexec/bin/python (3.12.2) to create virtualenv...
⠏ Creating virtual environment...created virtual environment CPython3.12.2.final.0-64 in 2713ms
  creator CPython3macOsBrew(dest=/Users/a81045729/.local/share/virtualenvs/alembic-demo-aZ2Jziiy, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, via=copy, app_data_dir=/Users/a81045729/Library/Application Support/virtualenv)
    added seed packages: pip==24.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

✔ Successfully created virtual environment!
Virtualenv location: /Users/a81045729/.local/share/virtualenvs/alembic-demo-aZ2Jziiy
Creating a Pipfile for this project...
Installing alembic...
Resolving alembic...
Added alembic to Pipfile's [packages] ...
✔ Installation Succeeded
Pipfile.lock not found, creating...
Locking [packages] dependencies...
Building requirements...
Resolving dependencies...
✔ Success!
Locking [dev-packages] dependencies...
Updated Pipfile.lock (2385f9483d502934bb4ff98ac15bc099128435489335087cfd4aa60a7a87d23d)!
Installing dependencies from Pipfile.lock (87d23d)...
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.

Let’s validate the alembic version to make sure the installation succeeded -

λ git main* → pipenv run alembic --version
alembic 1.13.1

Alembic init

Now that’s done, we should be able to start using alembic by creating the migration environment, which is a directory of scripts specific to a particular application. It is created once and then maintained along with the application’s source code. To create this migration environment, we need to run the alembic init command -

λ git main* → pipenv run alembic init alembic

  Creating directory '/Users/a81045729/Documents/constant_variables/alembic-demo/alembic' ...  done
  Creating directory '/Users/a81045729/Documents/constant_variables/alembic-demo/alembic/versions' ...  done
  Generating /Users/a81045729/Documents/constant_variables/alembic-demo/alembic/script.py.mako ...  done
  Generating /Users/a81045729/Documents/constant_variables/alembic-demo/alembic/env.py ...  done
  Generating /Users/a81045729/Documents/constant_variables/alembic-demo/alembic/README ...  done
  Generating /Users/a81045729/Documents/constant_variables/alembic-demo/alembic.ini ...  done
  Please edit configuration/connection/logging settings in '/Users/a81045729/Documents/constant_variables/alembic-demo/alembic.ini' before
  proceeding.

Now we see a directory structure generated as below, with the exception of the files under the versions directory. These files, also called the migration files will be generated when we start generating the migrations.

Configuration

We need to setup our alembic to be able to connect to our database objects. For this, we need to modify the alembic.ini file and update the sqlalchemy.url config parameter -

sqlalchemy.url = postgresql://username:password@localhost:5432/database

Here, since we have configured our Postgres database on local, we have its host as localhost and the default port of 5432. We can leave everything else as it is and save the changes.

Generate migrations

To start generating the migration files, we need to execute the alembic migrations command -

λ git main* → pipenv run alembic revision -m "create account table"
  Generating /Users/a81045729/Documents/constant_variables/alembic-demo/alembic/versions/07dccd222d6a_create_account_table.py ...  done

As we can see, the file 07dccd222d6a_create_account_table.py is the same file visible in the versions/ directory in the above screenshot. Let’s check out this file’s contents -

Huh! A lot of Python code without any SQL statements!

Customize migrations

Worry not, these Python codes are at the heart of how alembic works. For now, let’s ignore everything else and just focus on the two functions:

upgrade() - We have to provide our DDL statement, in this case, to create the accounts table with the SQLAlchemy ORM syntax.
downgrade() - This is supposed to do the exact opposite of what upgrade() does. So, if we create a table while upgrading, we drop the same table while downgrading.

Let’s fill out these two functions now, keeping everything else intact -

Before applying the migration, let’s check the state of our database, jerry. As evident below, it has only one empty table - alembic_version.

Apply migrations

Finally, it’s time to apply our first alembic migration and then check the status of our database. In the log and screenshot below, note how the account table is now created and the version_num in alembic_version table is now updated to the revision id -

λ git main* → pipenv run alembic upgrade head
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 07dccd222d6a, create account table

Huh, that was fun!

Closing Thoughts

We have only dipped out toes in the sea of alembic and we can already visualize a lineage for the changes that we apply to our databases. In the next post, we will dig deeper into the capabilities of Alembic like managing ALTER statements and rolling back upgrades (also called downgrades).

Subscribe now

Snowguard - A Snowflake near-real time monitor for straggling queries

Kunal Bhattacharya — Thu, 11 Apr 2024 13:07:09 GMT

Setting the stage

This post is inspired by our implementation of a Straggling Query Monitor (I fancy the name - Snowguard) for a Fortune 50 company, a global behemoth in its domain. As a backbone of our data platform, our Snowflake instances host over 800 TBs of data along with a daily ingestion rate of over 6 TB, including full and incremental loads. These huge data intensive computations are triggered by 300+ dbt workloads and 800+ Airflow jobs. All of this is developed, maintained, and monitored by a team of over 250+ members, ranging from engineers to analysts to data scientists and finally to techno-functional business folks.

Translated to query count analytics per month, we are talking about over 6 million queries (yes, 6+ million) and over USD 200,000 in incurred costs. Let that sink in! Add the fact that Snowflake virtual warehouses (those nifty compute engines for Snowflake queries) can actually run for days, weeks, and months at a stretch without terminating and you have a recipe for disaster! True to form, we found orphaned or abandoned queries running for hours and even days on end frequently, only to be terminated with no outcomes achieved, costing us thousands of dollars in wasted credits.

The pursuit of solutions

We started looking into off-the-shelf capabilities that would flag these straggler queries in near real-time and help reduce our credit wastage. However, from native Snowflake solutions such as Resource Monitors to third-party Saas offerings like Slingshot by CapitalOne, a common limitation stood out - They all offered the capability to get information about long running queries AFTER they had completed, directly breaching our “near real-time” requirement.

Seeds of Snowguard

We started working on a prototype solution with a few assumptions -

It had to be a Snowflake native solution as we wanted to avoid increasing the chances of integration failure by adopting any third-party tool or technology.
It didn’t matter how frequently our monitor had to run throughout the day to flag straggling queries.
Action on flagged queries had to be taken manually to prevent terminations on legitimate long running queries such as back-fills on large tables. Governance guard rails should not end up stifling actual progress.
The alerts need to be sent to a Slack channel for cross-team visibility and swift resolution.

Road to Revolution

Keeping these assumptions in mind, we started working on a parameterized stored procedure that would extract the query information from information_schema.query_history table function, based on filters passed during the procedure call. This function returns detailed information for all queries (running, succeeded, or failed) in the account within a specified time range.

\n \n

\n\n \n

\n\n \n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	SELECT
	warehouse_name,
	warehouse_size,
	database_name,
	schema_name,
	user_name,
	role_name,
	start_time,
	end_time,
	query_id,
	query_type,
	LEFT(query_text, 100) AS query_text,
	error_message,
	CURRENT_TIMESTAMP AS record_insert_time,
	'https://app.snowflake.com/abcbaabc/hola1234/#/compute/history/queries?user=ALL&uuid=' \|\| query_id AS query_hist_url
	FROM
	table(information_schema.query_history(
	END_TIME_RANGE_START =>
	DATEADD('minutes', -TO_NUMBER(:P_LOOKBACK_MIN), CURRENT_TIMESTAMP()), RESULT_LIMIT => 10000))
	WHERE duration_sec >= :P_DURATION_SEC
	AND execution_status = 'RUNNING'
	ORDER BY duration_sec DESC);

\n\n\n

\n\n

\n view raw\n \n running_queries.sql\n \n hosted with ❤ by GitHub\n

\n","stylesheet":"https://github.githubassets.com/assets/gist-embed-320e547052ac.css"}" data-component-name="GitgistToDOM">

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	SELECT
	warehouse_name,
	warehouse_size,
	database_name,
	schema_name,
	user_name,
	role_name,
	start_time,
	end_time,
	query_id,
	query_type,
	LEFT(query_text, 100) AS query_text,
	error_message,
	CURRENT_TIMESTAMP AS record_insert_time,
	'https://app.snowflake.com/abcbaabc/hola1234/#/compute/history/queries?user=ALL&uuid=' \|\| query_id AS query_hist_url
	FROM
	table(information_schema.query_history(
	END_TIME_RANGE_START =>
	DATEADD('minutes', -TO_NUMBER(:P_LOOKBACK_MIN), CURRENT_TIMESTAMP()), RESULT_LIMIT => 10000))
	WHERE duration_sec >= :P_DURATION_SEC
	AND execution_status = 'RUNNING'
	ORDER BY duration_sec DESC);

view raw running_queries.sql hosted with ❤ by GitHub

In this query, P_LOOKBACK_MIN and P_DURATION_SEC are parameters passed into the stored procedure call. The results of this query are then aggregated into an ARRAY object and stored via bind variables:

\n \n

\n\n \n

	SELECT
	COUNT(*),
	(SELECT
	CURRENT_ACCOUNT()),
	ARRAY_AGG
	(
	OBJECT_CONSTRUCT(
	'warehouse_name', warehouse_name,
	'warehouse_size', warehouse_size,
	'database_name', database_name,
	'schema_name', schema_name,
	'user_name', user_name,
	'role_name', role_name,
	'start_time', start_time,
	'end_time', end_time,
	'duration_sec', duration_sec,
	'query_id', query_id,
	'query_type', query_type,
	'query_text', query_text,
	'error_message', error_message,
	'record_insert_time', record_insert_time,
	'query_hist_url', query_hist_url)
	) INTO :row_count, :account_id, :result_arr
	FROM (
	-- This is where we put the above query as a subquery
	);

\n\n\n

\n\n

\n view raw\n \n convert_renamed_query_result_into_array.sql\n \n hosted with ❤ by GitHub\n

\n","stylesheet":"https://github.githubassets.com/assets/gist-embed-320e547052ac.css"}" data-component-name="GitgistToDOM">

Show hidden characters

	SELECT
	COUNT(*),
	(SELECT
	CURRENT_ACCOUNT()),
	ARRAY_AGG
	(
	OBJECT_CONSTRUCT(
	'warehouse_name', warehouse_name,
	'warehouse_size', warehouse_size,
	'database_name', database_name,
	'schema_name', schema_name,
	'user_name', user_name,
	'role_name', role_name,
	'start_time', start_time,
	'end_time', end_time,
	'duration_sec', duration_sec,
	'query_id', query_id,
	'query_type', query_type,
	'query_text', query_text,
	'error_message', error_message,
	'record_insert_time', record_insert_time,
	'query_hist_url', query_hist_url)
	) INTO :row_count, :account_id, :result_arr
	FROM (
	-- This is where we put the above query as a subquery
	);

view raw convert_renamed_query_result_into_array.sql hosted with ❤ by GitHub

Now we check if any records are returned from the subquery, i.e. len(result_arr) > 0. If yes, we insert these records into a table - SQM_RESULTS. We automated this procedure using Snowflake tasks to execute every 10 mins.

For the other half of the solution, we turned to Monte Carlo.

Monte Carlo is a data observability and lineage tracking tool that helps get a granular understanding of the health and quality of the data across its lifecycle. By provide a wide range of data health monitors such as freshness, volume, anomaly and custom SQL, Monte Carlo enables organisations to take full control of their data quality.

For Snowguard, we setup Monte Carlo custom SQL monitors on the SQM_RESULTS table to create an alert for every new record inserted. These alerts would then be sent via Slack Notifications setup on Monte Carlo to our team slack channel.

Eureka!

An anxious few days of wait followed and then voila! We had our first straggler query alert sent to our Slack channel.

This alert is equipped with not just the query metadata details such as database, schema, and warehouse, but also the query history URL which enabled the teams to view the flagged query in just a single click.

The curious case of false negatives

With our solution now live and flagging alerts every time a query crossed the time threshold, our credit wastage was significantly reduced. But what followed was a long period of lull - no alerts in the channel for a few weeks on the trot. Was this because there were no straggling queries or did our monitor stop working as expected?

Boy, those edge cases sneak up on you, eh?

Further debugging using Snowflake’s account_usage.query_history view revealed that there were indeed no long running queries during this period. But now we had an interesting problem - to setup an alert if Snowguard itself stops working!

For this -

We added another INSERT statement at the end of our stored procedure that would insert the status of every execution along with the execution time stamp in a new table - SQM_STATUS. Now, every 10 minutes, this table would have a new SUCCESS entry, if everything ran as expected or a FAILURE/missing entry if something failed along the way.
We then setup another Monte Carlo custom SQL monitor to run every 15 minutes and check for an entry in the SQM_STATUS table with status=SUCCESS. Any instances where no results are returned are reported to the Slack channel as a possible health issue with Snowguard.

Voila, we now had an alert setup to track our beloved Snowguard’s health.

Architecture diagram

Having delved into the technical details of the Snowguard implementation, I feel that closing this newsletter without an architecture diagram would render it incomplete. This diagram will help you understand the control flow through its various components and help you design similar solutions at your place of work.

Do you have any comments, questions or suggestions on how to make this better? Do you have any points regarding the use of any alternate technology stack to enable similar or better levels of governance on Snowflake?

Let me know your thoughts in the comments below!

Six months of Data Platform!

Kunal Bhattacharya — Sun, 19 Nov 2023 08:01:56 GMT

May 2023 marked a big change in my professional life when I transitioned from a data engineer to a data platform engineer. As I traded my well-acquainted path for an obscure trail, I was apprehensive about my skills unlike ever before. Having survived six months of this phenomenal change, I will lay before you the key changes and realizations so that the next generation of platform engineers have a better understanding of what they are getting into.

Before we delve any deeper, let’s answer the basic question - How can we define data engineering? Quoting Joe Reis and Matt Housley from their book, “Fundamentals of Data Engineering” -

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.

Ok! Then how do we define data platform engineering? Quoting and rephrasing extracts of two articles from VMWare and Monte Carlo -

Data platform engineering is the design, development, and maintenance of the infrastructure (also called platform or data platform) that supports storage, processing, and analysis of data. This data platform serves as the central repository and processing hub for an organization’s entire data ecosystem, and plays a crucial role in collecting, cleansing, transforming, and applying data to derive valuable business insights.

In this post, I will narrate the difference between a data platform and a data engineer and how they are very different in what they do but very similar in how they connect.

Different KPIs

KPIs, or Key Performance Indicators, are the benchmarks against which every professional/team’s performance is measured.

Some KPIs of data engineering are data quality, data consistency, and robustness of data pipelines. Making sure that transactions on member accounts or updates in customer data are consistently propagated across all applications, are “must-achieve” benchmarks for all data engineering teams in any industry. Some other KPIs include performance optimized data processing, effective management of sensitive data, and availability of near real-time data.

Whereas KPIs of data platform team are user delight, platform security, and robustness. E.g. a multi-cluster Spark development environment that is easily setup by a new member of the team or a team-based access control for applications containing sensitive data. Some other KPIs include application cost optimization, setting up of governance practices, and automation.

To understand this difference a little better, we can refer to the “data value chain” conceptualized by Aurimas Griciünas -

Same hammer, different nails

While DE (Data Engineers) and DP (Data Platform) work with a lot of the same tools and technologies, their purpose and objectives are very different.

As a DE working with Spark or Airflow, I was concerned with consuming data from a variety of sources, cleaning it of noise and inconsistencies, transforming it as per business needs, and loading it into different targets - all the while looking out for optimized performance. I also wrote tests to make sure the code base is not susceptible to major failures with further changes or bad data.

As a DP working with the same technologies, I am more focused on building internal user facing data products. Some examples include - upgrading Spark/Airflow to the newest version (to support the latest features, performance improvements, and security updates), setting up linting and automated tests in CI processes (for the pipelines developed by DE teams), developing source system connectors with an eye on code reusability, and providing a solid bedrock of dev/QA/prod environments for various teams to work seamlessly.

Shared vision, different stakeholders, diverse peers

Both DE and DP teams are formed with a shared vision to use tools and technologies to aggregate data, add context to it to drive better decision making, and boost revenue and profits. The curated data serves application teams to build better applications for customers, business intelligence teams to develop better visualizations for actionable insights, or data science teams to develop better prescriptive and predictive models. Or, at best, a combination of all three!

However, the key stakeholders and peer groups of these two teams can differ widely. As a DE, I worked with business analysts to get better understanding of the data, the application, and the business served and with downstream BI/Data Science/Business teams to understand how best to transform the data to make it usable. Once done, we developed data pipelines while coordinating with QA and UAT teams for testing, and platform/DevOps teams for deployment.

One of main stakeholders for the data platform team is the data engineering team itself. We gather tech requirements from data teams, engage with third party software vendors of suitable tools and technologies to discuss about available and desired features, conduct PoC (Proof of Concepts), propose solutions to the DE teams for final confirmation, and on agreement, work with info-sec teams to review license agreements and security clearances. If and when the software is onboarded to the platform architecture, we work on setting up the initial access control systems with IAM teams, developing a framework for the DE teams to start working, and planning for the future roadmap of the software.

Looking behind the curtains

One of the biggest transitions as a DP has been to work with applications, I took for granted. E.g. Setting up the forms used to request access for different applications taught me about role-based access control v/s team-based access control. I worked with various IAM teams to implement authentication and authorization protocols and developed access control mechanisms for new users. I have also been working on CI/CD integrations for automated deployments and setting up governance mechanisms to monitor resource consumption and optimize costs.

Final Thoughts

While my experience in this new role has been exciting, it has also helped fill in application-level knowledge gaps and provided additional contexts to take up more challenging assignments in future. A symbiotic relationship between data platform and engineering teams is critical to drive efficient data-driven decision making and improve outcomes. As we move further toward micro-services architecture and incorporate cutting-edge technologies into our data products to address specific needs of increasingly complex systems, platform teams will assume greater and more critical responsibility for security, robustness, ease of use, and cost efficiency.

Let me know if you have any suggestions or questions in the comments.

P.S. Since you are here, you might be wondering - What made this guy start writing on Substack? Why should I follow him? Here’s a brief introduction about me, why Substack and what you can expect from me -

A start to remember, a journey to reminisce!

Kunal Bhattacharya — Mon, 10 Jul 2023 12:26:00 GMT

No, that’s not me!

Beginnings are hard.

In 2007, I began my high school journey in a new school and ended the year with the lowest score in Mathematics in my life, ever. In 2009, I joined my engineering college and scored in the 50th percentile in my class. In 2013, fresh out of graduation, I joined my first job at a civil consultancy firm for a salary of six thousand rupees (approx. 80$) per month. In my first software job, multiple managers told me that I was not good enough and would not be able to make a career in this field.

Beginnings have always been hard for me. But if there’s one thing I have learnt -

It gets harder before it gets easier.

1. Why this, why now?

Why this? Every article I write here will be from the blueprint I have followed over the years (and continue to do so), firsthand. Be it an implementation detail, a technical deep-dive or an occasional post about football - I want to assure you that it is something I have personally encountered, worked on or analyzed.

Why now? At 32 years of age and approaching a decade of professional experience, I believe this is the right time for me to start contributing back to the community at large.

Because, while beginnings are hard, the journey should never be lonely!

2. What lies ahead?

Hopefully, a newsletter with a difference. Introducing -

Hardcoded Variable

3. What’s in it for my prospective readers?

Without you, my dear readers, this entire effort is pointless. You are at the center of every topic I write about. I will strive to provide you with crisp and curated content, without the fluff, with the sole objective that it helps you think, understand, learn, grow, realize, and contemplate. Regardless of the topic or category I write about, my motivation is (and will always be) to give you something that you probably won’t get anywhere else. Definitely not in one place.

Tech implementation articles will be centered around learnings, experiences, and techniques acquired and implemented first-hand, not the boiler plate generic stuff crowding up your Google search results. Deep dives will be based on topics on which I have personally delved deep and want to propagate my learnings further.

Most of all, it’s going to be open-ended. I am not the expert; I am one of you! My posts are intended to help you engage, think, and wonder.

The pre-match huddle:

Pictures show. Stories tell. Silence deafens. Experiences last a lifetime.

As I prepare the draft for my first post (after this one), I urge you all to be patient with me as I navigate the journey of content creation. Even with all my efforts, typos may creep in, inaccuracies are always a possibility. So always feel free to suggest feedback, add comments, ask questions, and correct me if I am wrong - I’m listening!

Last, but not the least, please feel free to hit the “Subscribe now” below if you want to stay connected with my next article.

Photo by Dan Cristian Pădureț on Unsplash

Subscribe now

Hardcoded Variable

Snowflake Northstar Conference: Key Takeaways

What is Snowflake?

Northstar Takeaways

Snowflake’s foray into observability with Snowflake Trail

Serverless Tasks

Snowpark - Native App development and Data Engineering

Snow on Snow

Was it worth the effort?

Further Reading:

Database Change Management - The Alembic way - Part 2

Recap

Moving forward

Generate migrations and apply…again

Trial by fire…or in this case, ALTER

Downgrade/Rollback changes

Closing Thoughts

Database Change Management - The Alembic way - Part 1

Setting the stage

Database Change Management (DCM)

Ode to Alembic

Installation

Alembic init

Configuration

Generate migrations

Customize migrations

Apply migrations

Closing Thoughts

Snowguard - A Snowflake near-real time monitor for straggling queries

Setting the stage

The pursuit of solutions

Seeds of Snowguard

Road to Revolution

Eureka!

The curious case of false negatives

Architecture diagram

Six months of Data Platform!

Different KPIs

Same hammer, different nails

Shared vision, different stakeholders, diverse peers

Looking behind the curtains

Final Thoughts

A start to remember, a journey to reminisce!

Beginnings are hard.

1. Why this, why now?

2. What lies ahead?

3. What’s in it for my prospective readers?

The pre-match huddle: