Code Archives - Leev's

Bridging the gap between eras using Debezium and CDC

Elad Leev — Mon, 27 May 2024 15:00:56 +0000

In a data-driven world, data freshness is a major KPI for modern data teams. Fresh data is much more valuable, and leads to better and more accurate insights.

But not all systems were born equal, and when dealing with legacy systems – achieve that could be quite a challenge, and might involve some scalability issues? How can you transform it into a reactive component?

Let’s start with the basics –

Batch vs Streaming

There’s a well-known saying about distributed systems: because of the complexity they bring, you should only use them when absolutely necessary. In a way, the same goes for Data Streaming which is more complex compared to batch processing. In a batch system, we process known finite-size inputs. We can always fix a bug in our code, and replay the batch. But in a stream, we process unbounded datasets which have no beginning and end.

Having said that, from a business perspective, there are clear benefits from using streaming instead of batching as in many cases, a business event loses its value over time.

Batching vs Streaming – event time and action over time

A good example of this is fraud detection services – Detecting fraudulent activities, such as credit card fraud or cybersecurity threats, requires immediate action to prevent further damage. Batch processing of transaction data introduces delays, during which fraudsters can continue their malicious activities. Streaming analytics enable real-time fraud detection systems to identify suspicious patterns as they occur, allowing for instant alerts and preventative measures.

The closer we can bring the action-taking to the business event, the better.

But what can you do if a legacy system is involved as part of the pipeline? A system that doesn’t “know” how to stream. In our case, an old ETL pipeline which is sourced from a legacy Oracle database.

Introducing, Change Data Capture

One of the ways to overcome this problem is by leveraging a common data pattern called “Change Data Capture” (CDC). It allows us to transform any static database into an event stream we can act upon.

The CDC process includes listening, identifying and tracking changes occurring in a database, and creating an event based on them.

CDC can be implemented using three main methods – Trigger based, Query based and Log based CDC.

A trigger-based CDC works by employing database triggers. In this approach, triggers are set up on tables within a database, and they are designed to activate automatically in response to specific data manipulation events, such as INSERT, UPDATE, or DELETE operations.

Query-based CDC, works by periodically querying the source database for any modifications. It is mainly used when direct modification of the source database is not feasible.

The last, and most efficient one is the log-based CDC. This approach leverages database transaction logs to capture and track changes made to data in a source database. This approach provides a near-real-time and efficient way to identify modifications, additions, or deletions without putting a significant load on the source database.

Debezium

When dealing with CDC and databases, the most common tool for the job is Debezium.

“Debezium is a set of distributed services that capture row-level changes in your databases so that your applications can see and respond to those changes. Debezium records in a transaction log all row-level changes committed to each database table. Each application simply reads the transaction logs they’re interested in, and they see all of the events in the same order in which they occurred.”
Debezium documentation

Debezium is actively maintained and supports the most common databases in the world – MySQL, MongoDB, PostgreSQL, Oracle, Spanner, and more.

Although you can run Debezium as a standalone Debezium Server or inside your JVM-based code using Debezium Engine, the most common practice is to deploy it as a Connector on top of a Kafka Connect cluster to leverage the benefits of the platform like orchestration, parallelism, failure handling, monitoring, and more.

source: Debezium Architecture

Problem Statement

Our story begins with an old ETL pipeline which gets a dump of data every day, and later on loaded and processed on our data platform.

On the left side, we have the source backend which contains the Oracle database. Once a day, a new dump of the data is transferred to the landing bucket for processing.

On the right side, we have a service which holds the “Processing Logic”, that is, loading the data from the landing bucket, doing some data transformation and cleanup, making some business operations, and finally ingesting the result dataset to the Data Warehouse.

As data consumers, we have some ML Models, Data Analysts, and operational dashboards.

To begin with, we need to identify the problems with this solution:

As mentioned before, there are some obvious downsides to using such a method. As we deal with a daily batch of data, the freshness of our data is low.

Then, there are scalability issues and error handling. As companies continue to grows, the data volumes keep piling up to a point where the batches we process might take way too long to complete. It affects both the operation and business side – recovering from an error and re-processing a batch takes much longer over time, and our business decision-making is slower.

Now let’s make it better with Debezium!

The new architecture is fully based on Kafka, Kafka Connect and Debezium, to power (near) real-time ingestion to the Data Warehouse:

Instead of getting a daily dump of the data and processing it with an in-house service, we can leverage open source technologies which are developed and tested by thousands of engineers around the world.

In the new solution, the Debezium Oracle Connector is hosted on the Kafka Connect platform which simplifies the orchestration, scaling, error handling, and monitoring of such services. The connector is using Oracle LogMiner to query the archived and offline redo logs to capture the history of activity in the Oracle database.

Using Kafka Connect SMTs (more on that later), we can clean up and transform our CDC messages, and ingest them into Kafka.

To ingest the data into BigQuery and GCS, we can simply use another open-source Kafka connectors, which we can now scale independently.

The new pipeline is far more efficient, allows us to capture changes in real time, and reduces most of the complexity of the old pipeline. No service to maintain, battle-tested connectors and platform, a simple way to transform data, and the ability to independently scale each one of the components in the system.

Debezium Snapshot

On Oracle, the redo logs usually do not retain the full history of the database from the dawn of history. Consequently, the Debezium Oracle connector cannot access the entire database history from these logs. To allow the connector to create a baseline of the current database state, it takes an initial consistent snapshot of the database when it starts for the first time. This process can be controlled via a parameter called [snapshot.mode].

During the startup procedure, Debezium obtains a ROW SHARE MODE lock on each of the captured tables to prevent structural changes from occurring during the creation of the snapshot. Debezium holds the lock for only a short time.

Data Change Events

Once Debezium starts to capture all changed data (CREATE, UPDATE, DELETE, TRUNCATE), we will get the data change events to Kafka. Every data change event that the Oracle connector emits has a key and a value.

An expected CREATE event might look like the following:

{
    "schema": {
        "type": "struct",
        "fields": [....],
        "optional": false,
        "name": "server1.DEBEZIUM.CUSTOMERS.Envelope"
    },
    "payload": {
        "before": null,
        "after": {
            "ID": 1004,
            "FIRST_NAME": "Anne",
            "LAST_NAME": "Kretchmar",
            "EMAIL": "[email protected]"
        },
        "source": {
            "version": "2.6.1.Final",
            "name": "server1",
            "ts_ms": 1520085154000,
            "ts_us": 1520085154000000,
            "ts_ns": 1520085154000000000,
            "txId": "6.28.807",
            "scn": "2122185",
            "commit_scn": "2122185",
            "rs_id": "001234.00012345.0124",
            "ssn": 1,
            "redo_thread": 1,
            "user_name": "user",
            "snapshot": false
        },
        "op": "c",
        "ts_ms": 1532592105975,
        "ts_us": 1532592105975741,
        "ts_ns": 1532592105975741582
    }
}

In same cases, this is exactly what you would need – a stream of changes that an external system can consume. In this case, I don’t really need the extra information, and I simply want to replicate those events into Kafka while doing some basic transformations.

To do that, we can leverage Single Message Transformations.

Single Message Transformations

Single Message Transformations (SMTs) are applied to messages as they pass through Kafka Connect. SMTs modify incoming messages after a source connector generates them but before they are written to Kafka. They also transform outgoing messages before they are sent to a sink connector.

You can find more about SMTs in Debezium or Confluent documentation, but SMTs can be as easy as:

"transforms": "ExtractField",
"transforms.ExtractField.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.ExtractField.field": "id"

The configuration snippet above shows how to use ExtractField to extract the field name "id"

Before: {"id": 42, "cost": 4000}

After: 42

In this case, we want to strip out most of the excess metadata, and only keep the value of the “after” event – as, the updated value, and some extra fields like ts_ms and scn for tracking.

To do so, we simply use the unwrap SMT:

    transforms: unwrap
    transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
    transforms.unwrap.add.fields: source.ts_ms,ts_ms,scn

Now each one of our events in Kafka will consist of a key and a clean value.

Extend with a Custom SMT

Assuming one of our down-stream systems needs to make an aggregation based on the key of the messages, but it has only access to the message value. Can we fix it with SMT? Yes!

Unfortunately, there is no SMT which can take the message key and insert it as a field in the value, but SMTs are built the same way as the whole Kafka Connect platform – predefined interfaces which are easy to extend.

Introducing – KeyToField – SMT for Kafka Connect / Debezium.

With this custom SMT that I built, we can do exactly that – add the message key as a field. The SMT can also handle complex keys with multiple fields in it, using the transforms.keyToField.field.delimiter variable that will be used when concatenating the key fields.

A custom SMT can be as easy as a single Java file implementing the relevant methods.

@Override
    public R apply(R record) {
        if (record.valueSchema() == null) {
            return applySchemaless(record);
        } else {
            return applyWithSchema(record);
        }
    }

The apply method gets the record and perform the action based on the record schema (or schema-less).

We update the schema, put the new field as a string, and return the new record with the new schema and value:

private R applyWithSchema(R record) {
        ...
        updatedSchema = makeUpdatedSchema(value.schema());
        ...
        final String keyAsString = extractKeyAsString(record.keySchema(), record.key());
        updatedValue.put(fieldName, keyAsString);
        return record.newRecord(
                record.topic(),
                record.kafkaPartition(),
                record.keySchema(),
                record.key(),
                updatedSchema,
                updatedValue,
                record.timestamp()
        );
    }

as simple as that.

Our final SMT configuration now looks like:

transforms: unwrap,insertKeyField
# Extract record
transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.add.fields: source.ts_ms,ts_ms,scn
# Add key as field
transforms.insertKeyField.type: com.github.eladleev.kafka.connect.transform.keytofield.KeyToFieldTransform
transforms.insertKeyField.field.name: PK

With that, our pipeline is complete and relatively straightforward – Debezium generates the events from the source database, we transform the data using SMTs, ingest them to Kafka, and from there we have two connectors taking care of the ingestion to Big Query and GCS.

Conclusion

Dealing with legacy systems that rely on batch processing introducing a significant challenges in terms of data freshness, scalability, and timely business decision-making.

The transition to a modern data architecture leveraging Change Data Capture (CDC) with Debezium and Kafka addresses these issues effectively. By capturing and streaming real-time changes from the legacy Oracle database to Kafka, we significantly enhance data freshness and reduce latency. This new architecture facilitates near-real-time data processing and decision-making, enabling businesses to respond quicker.

In summary, the integration of Debezium and CDC into the data pipeline transforms a legacy batch processing system into a reactive, real-time data streaming architecture. This shift not only improves data freshness and scalability but also enhances the overall efficiency and responsiveness of the business, paving the way for more informed and timely decision-making.

The post Bridging the gap between eras using Debezium and CDC appeared first on Leev's.

How to collect Kafka configuration as metrics?

Elad Leev — Fri, 11 Dec 2020 14:27:43 +0000

Motivation

Unlike other distrusted systems (i.e Aerospike), Kafka does not expose its configurations as metrics.
It makes sense in a way, config parameters and metrics are not the same, and most of the time you cannot represent configurations as metrics (strings).

But there are few useful configuration parameters that might be beneficial to collect in order to improve the visibility and alerting over Kafka in particular.

A good example might be log.retention.ms parameter per topic, which can be integrated into Kafka’s dashboards to extend its visibility, or to integrate it into an alerting query to create smarter alerts or automation based on topic retention.

So how can we do so?

“Kafka Configs Metrics Exporter”

I decided to create a Prometheus exporter to collect those metrics, using Kafka’s AdminClient to quickly pull all the config parameters that can be represented as metrics, expressly, can be converted into int .

After deploying the exporter and point it to your desired Kafka cluster, you will be able to query your metrics data source and create amazing graphs and alerts based on it!

The exporter and its documentation is available on GitHub
↧ ↧ ↧

https://github.com/EladLeev/kafka-config-metrics

The post How to collect Kafka configuration as metrics? appeared first on Leev's.

TinyGo: A Go Compiler For Small Places

Elad Leev — Sat, 03 Oct 2020 16:17:10 +0000

I started my journey with Go more than a year ago. Since then I have tried to find more and more opportunities to play with the language.

At first, I didn’t know that I would like it that much. I thought that I’m just adding another language into my toolbox. But as time goes, and I’m writing more and more Go code, I understand how much I love this language and how much I missed it when I’m writing code in another language.
Isn’t it love?

When I’m learning something new, like a new system, a new language, or anything else, I love to get my hands dirty with it. I feel that I learn much faster – I’m facing errors, doing mistakes, I read more and in general, I’m gaining more experience with it.
And the same goes for Golang.

After I’m gaining enough confidence, I tend to search for something that will take me out of my comfort zone.
And there you have it – TinyGo.

We never expected Go to be an embedded language and so its got serious problems
Rob Pike

TinyGo

TinyGo is a project to bring the Go programming language to micro-controllers and modern web browsers by creating a new compiler based on LLVM.
We can look at it as an “extension” to Go, as it still needs the original Go installation, but allows you to compile almost the same code to a MUCH smaller binary.

On a talk that Ron Evans (one of the creators of the project) gave, he demonstrates how a simple “Hello World” Go code can be compiled into a less than 1% (!!) binary comparing to the original binary created by a go build command.

1148709 vs 12376

Those capabilities and others, allow us to flash our almost identical code on tiny microcontrollers like Arduino Uno (which has 32KB of memory) and became an embedded developer in a few minutes.

The project is notably active and supports the most common boards out there.
On a well-known boards, like the Uno, the project provides almost full support for interfaces like GPIO and UART.

After a quick installation, you can start to write your code and communicate with the different components in the Arduino universe.

Finding a project

I always liked to play with boards like Arduino and Raspberry Pi, and the combination of them with the Go language sounds amazing to me.
So, as expected, I pulled my million-piece Arduino kit, installed everything that I need and started to play with TinyGo.

I started with tiny projects like mimicking a police car lights with the board, 2 LED lights, and 2 resistors –

Using a few lines of code I managed to communicate with the LED lights and to create it –

Next, I decided to connect my board to a Buzzer, which is supported of course, and to play some music!

Don’t you think it’s cool? – and what about Doom or GOT theme songs?

The real challange

After I finished learning the basics I started to search for a bigger project.
For a long time I wanted to build a feeding system for my dog – not something too complicated, but something that will be controlled via Email or a command. No doubt that it can be useful.

A short Google search revealed a good and inspiring project.

Unfortunately, the motor that I got with the kit is unsupported by the project drives and I couldn’t make it work with it.
So the only thing I have to do now is to wait for my new motor to arrive by mail (it’s COVID-19 days after all).

This is the first post of mine about TinyGo. I hope that I could proceed with the project soon enough and log the process in this blog.

Stay tuned.

The post TinyGo: A Go Compiler For Small Places appeared first on Leev's.

My journey from Python to Go

Elad Leev — Tue, 19 Mar 2019 13:19:00 +0000

Originally posted on AppsFlyer blog

I love Python. It has been my go-to language for the past five years. Python is very friendly and easy to learn while still remaining super effective.

You can use it for almost anything — from creating a simple script and web development through data visualizations, and machine learning.

But the growth in the maturity of Go, the strong user base, and the fact that more and more companies have decided to move to Go after successful benchmarking, made me read more extensively about Go, and think about how I can add it into my tool set and apply its benefits to my work.

But this post is not going to talk about which programming language is better — Python or Go, there are plenty of posts and comparisons about this topic online, and it really depends on the use case in my opinion.

In this post, I’m going to tell you about my journey from Python to Go, and provide you with some tips and expose you to some of the resources that helped me succeed on this journey and live to tell the tale.

GIF by Egon Elbre

Key differences that I encountered

Of course, as first step I went over the amazing official “Tour Of Go”, which definitely gave me a strong basic knowledge about the Go syntax.

In order to strengthen that knowledge, I read the Go for Python Programmers ebook that allowed me to continue on to the next step — which I believe to be the most educational — trying and failing.

I took common functions that I had previously been using in Python like JSON serialization or working with HTTP calls, and tried to write them in Go.

By applying similar concepts from Python in Go, and still embracing the static nature of the language, I encountered some of the key differences between Go and Python.

Project Layout

First and foremost, Python typically does not require a specific directory hierarchy, whereas , Go on the other hand does.

Go uses a “standard” layout which is a bit more complex and creates slightly more work, but the upside is a well-structured code base that encourages modular code and keeps things orderly as a project grows in size.

The official “How to Write Go Code” has a section that explains exactly how to build your workspace.

Statically and strongly typed

Go is a statically typed language, which can make you feel uncomfortable at first because of your habits from a dynamically typed languages like Python or Ruby.

There is no doubt that dynamic languages are more error prone, and it takes more effort in terms of input validation to prevent common syntax or parsing errors. Think about a function that calculates the sum of two integers, there is really no guarantee that the user who uses this function won’t pass a string into the function — which will cause a TypeError.
This scenario can’t happen in Go, as you need to declare the type for each variable — which type of variable your function can get, and which type of variable your function will return.

At first it was a bit annoying, and it felt like it makes my coding speed much slower, but after a short time reading and writing in Go, you really get used to it, and it actually can save time, and make your code much more robust.

Native concurrency

Go has native concurrency support using goroutines and channels, which can be really handy nowadays.

At first, the concept of channels can be a bit tricky, and it’s easy to look at it as some kind of data structure or queuing implementation. After some reading though, they become more straightforward, and you can really enjoy the value they bring, and take full advantage of them.

A simple visualization of goroutines and channels by Ivan Daniluk —

package mainfunc main() {
    // create new channel of type int
    ch := make(chan int)// start new anonymous goroutine
    go func() {
        // send 42 to channel
        ch <- 42
    }()
    // read from channel
    <-ch
}

For more examples, take a look at the Hootsuite real life implementation of goroutines, channels and the select statement, or this great explanation from ArdanLabs.

Working with JSON

Well, no more json.loads() for you.
In Python deserializing JSON objects is super simple — just use json.loads and that’s it!
But in Go, as a statically typed language, this simple operation might be more tricky.

In Go, you parse the JSON into a struct you’ve defined before. Any field which won’t fit in the struct will be ignored, which is a good thing. Think of it as a predefined protocol between the two sides. You really don’t need to be “surprised” by the data you received in the JSON, and the JSON fields and types need to be “agreed” upon by both sides.

{
  “first”: “Elad”,
  “last”: “Leev”,
  “location”:”IL”,
  “id”: “93”
}

type AccountData struct {
 First    string `json:"first"`
 Last     string `json:"last"`
 Location string `json:"location"`
 ID       string `json:"id"`
}

Of course you can still deserialize JSONs without structs but it should be avoided if possible, and it’s always better to embrace the static nature of the language.

To better understand JSON encoding in Go, you can take a look at this post , or use “Go By Example” — which is the ultimate cheat sheet resource you will ever have.
Too lazy to convert your JSON into Go struct? No problem — this tool will do it for you.

Clean code

The Go compiler will always try to keep your code as clean as possible.
Go compiler treats unused variables as a compilation error, and moreover, Go took the unique approach of having the machine take care of most formatting issues.
Go will run the gofmt program while saving or compiling, and it will take care of the large majority of formatting issues.

You don’t care about one of the variables? Again — no problem! Just use the _ (underscore) and assign it to an empty identifier.

A must read doc that contains info about Go formatting is “Effective Go”.

My journey continues

Finding the Right Library and Frameworks

I really got used to my Python frameworks and libraries like Flask, Jinja2, Requests and even Kazoo, and I really worried that I won’t find the right ones for Go.
But as you can guess, the great community of Go has it’s own unique libraries that can even make you completely forget about the old ones.
Here are some of my preferences —

Python Requests => net/http
The built in net/httpprovides HTTP client and server implementations that is really great, and super easy to use.
Flask + Jinja2=> Gin
Gin is an HTTP web framework with a really simple API — parameters in path, upload files, grouping routes (/api/v1 , /api/v2), custom log formats, serving static files, HTML rendering and really powerful custom middleware.
Take a look at this benchmark.
CLI Creation=> Cobra
Cobra is both a library for creating powerful CLI applications as well as a program to generate applications and command files.
Many of the most widely used Go projects are built using Cobra including Kubernetes, etcd and OpenShift.

Some other libraries that I highly recommend are: Viper, Gonfig, and this awesome list — Awsome-Go.

Other Resources

There are few other resources that significantly helped me on my journey to Go –

[1] Francesc Campoy — You definitely need to check out his YouTube channel and GitHub profile.
Francesc also has few great workshops — Go Tooling in Action and Web Applications Workshop.

[2] GopherCon Videos

[3] Go Web Examples

[4] Golang Weekly , Gopher Academy, Golang News Twitter accounts.

Summary

As an avid Python user for five years, I was concerned that the transition to Go would be a painful one.

But I was really psyched to see that there is a truly strong community, contributing and maintaining excellent resources to help you succeed with your transition to Go.

Go is one of the fastest growing programming languages today, and I expect that Google will manage to make Go the go-to language for writing cloud applications and infrastructure.

This is an exciting time for Go and I encourage you all to check it out and become Gophers!

The post My journey from Python to Go appeared first on Leev's.