python Archives - Sefik Ilkin Serengil

A Gentle Introduction to Event Driven Architecture in Python, Flask and Kafka

Sefik Serengil — Thu, 04 Sep 2025 12:17:37 +0000

In today’s fast-paced and scalable application development landscape, event-driven architecture (EDA) has emerged as a powerful pattern for building systems that are decoupled, reactive, and highly extensible. Instead of relying on tightly coupled and synchronous workflows, EDA enables services to communicate through events, promoting flexibility, resilience, and parallel processing. One of the key advantages of EDA is that messages can seamlessly travel between different modules and even across domain boundaries, while their flow can be monitored and traced much more effectively than traditional logs—providing deeper visibility into system behavior and data flow.

Photo of Paper Cup on Top of the Table by Pexels

In this blog post, we’ll explore how to implement an event-driven system in Python and Flask using Kafka as a message broker. Kafka allows us to produce and consume messages efficiently, and it enables scalable, parallel processing across multiple machines, overcoming the limitations of threading, which confines execution to a single machine and its CPU cores. While Kafka is a robust and popular choice, we’ll also touch on alternative queueing mechanisms like RabbitMQ and Celery, demonstrating how EDA can be implemented flexibly depending on your stack and use case. Whether you’re building microservices, data pipelines, or real-time applications, adopting EDA can bring significant gains in scalability, observability, and maintainability.

Vlog

Events in Real Life

Have you ever thought about how Starbucks manages queues so efficiently? You walk in, and one barista takes your name and your order, then writes it on a paper cup. That’s their job — fast and focused. Then that cup moves on to another barista who makes your coffee. Meanwhile, someone else might be restocking coffee beans from the supply chain to make sure nothing runs out.

Each person is doing a different part of the job — at their own pace — but everything flows together smoothly. That’s not a coincidence. It’s a system where each step is triggered by an event: a new customer, a new order, a low inventory alert. That’s basically how event-driven architecture works in software.

If you’re preparing for system design interviews, you’ll hear about this model a lot — and for good reason. It’s a smart way to build systems that scale easily, respond quickly, and handle real-world complexity with elegance.

Key Concepts in Event Driven Architecture

There are three main parts in event-driven systems:

Event: This is a message that says something happened — for example, “user registered.”
Producer: The part of the system that creates and sends the event.
Consumer: The part that listens for events and reacts to them.
Broker: Usually, there’s a middleman like Kafka or RabbitMQ that holds these events and delivers them to the right consumers.

One cool thing about events is that they can be designed as chains — where one event triggers another, which triggers the next, and so on.

Or sometimes, when you get an event, you can split it into multiple smaller tasks. Each task can be processed independently and in parallel. This helps break down complex work into manageable pieces and makes scaling easier.

More Reasonable In Python

In Python, we don’t like long-running for loops because they run serially, one after another, which slows things down. While threading or multiprocessing can help, it depends on the number of CPU cores you have. Event-driven systems take a different approach. You can design a system with many servers and many workers running in parallel.

Scalability is simple — if you add a new server that consumes messages from a topic, your app can process more events at the same time without changing your code.

Decoupling Request Handling from Processing

Another big advantage is how your app handles requests. In a traditional system, when you send a request, you often have to wait until the server finishes all the processing before getting a response.

In event-driven architecture, your app can respond immediately, saying, “Request received,” and give you a unique ID. Behind the scenes, the request is stored as an event in a topic.

Another job or worker listens for these events and does the actual processing, like sending emails or updating a database. To keep users informed, you can expose another endpoint where they can check the status of their request using the ID you provided. This way, your main app stays responsive all the time because it’s not doing the heavy lifting immediately.

Even if the worker goes down temporarily, events stay safely in the topic and get processed later when the worker is back up. This makes your system much more reliable.

Common Tools and Technologies

Back in the 90s, enterprise applications used heavy and complex message queues, called MQs.

Once, I had to implement an event-driven-like system using a temporary database table and a trigger. Whenever a new record was added to the main table, the trigger would insert its metadata into the temporary table. I was continuously polling this temporary table to detect new records. After processing a record, I would delete its corresponding entry from the temporary table.

Today, things are much lighter and easier to use. Popular tools include Kafka, RabbitMQ, and Celery.

Personally, I like consuming messages with Flask. When you build a message bus this way, it feeds incoming events to web service methods exposed in Flask. This approach is great because it makes it easy to monitor, debug, and test your event consumers using familiar HTTP endpoints.

Real-Life Example Scenario

Let’s look at a real-life example involving CCTV cameras and facial recognition. Imagine a busy public center with hundreds of people walking in. The CCTV records images continuously — each new image is fed as an event into the system.

A job consumes the image event and detects faces in the image. For each detected face, it creates a new event and puts it into another topic. Once done, that job’s task is complete.

Another job consumes these face events, runs facial recognition, and converts the faces into numerical representations called embeddings. These embeddings are sent as new events to yet another topic.

The final job listens for these embeddings and searches them against a database of wanted people. If it finds a match, it triggers an event that alerts the authorities.

Finally, criminals are reported to the police with their latest location — all done asynchronously and efficiently through a chain of events.

Traditional Approach

Consider the following snippet. When analyze method gets an image, it firstly calls DeepFace’s extract_faces function and this will return a list of faces. Then, it will call DeepFace’s analyze function for pre-detected and pre-extracted face.

def analyze(self, image: str):
    faces = self.deepface_service.extract_faces(image)
    self.logger.info(f"extracted {len(faces)} faces")
    for idx, face in enumerate(faces):
        demography = self.deepface_service.analyze(face)
        self.logger.info(
            f"{face_index+1}-th face analyzed: {demography['age']} years old "
            f"{demography['dominant_gender']} "
        )

Python’s standard for loops are synchronous and blocking, meaning each iteration completes before the next one starts. In contrast, JavaScript for loops can start asynchronous operations (like Promises) in parallel, allowing multiple tasks to run concurrently.

In Python, achieving parallel or asynchronous execution requires using asyncio, threading, or multiprocessing, because a normal for loop alone will not run tasks in parallel. Even if it runs in parallel like in JavaScript, it will still be limited to the multiple cores of a single machine. It is still not suitable for scaling across multiple machines.

Event Driven Approach

Instead, we will publish each item from the for loop to a Kafka topic, and a separate job will consume and process them.

def analyze(self, image: str):
    faces = self.deepface_service.extract_faces(image)
    self.logger.info(f"extracted {len(faces)} faces")
    for idx, face in enumerate(faces):
        encoded_face = base64.b64encode(face.tobytes()).decode("utf-8")
        self.event_service.produce(
            topic_name="faces.extracted",
            key="extracted_face",
            value={
                "face_index": idx,
                "encoded_face": encoded_face,
                "shape": face.shape,
            },
        )

Producing to Kafka will be very fast because it is an asynchronous operation — the message is queued, but it is not verified whether it has actually been written to the topic (unless explicitly requested).

With the Flask-Kafka package, we can listen to a Kafka topic as if it were a web service. You can also test the service using an HTTP POST with the payload of a message placed on the topic. In other words, it’s the same whether you put a message on the faces.extracted Kafka topic or send an HTTP POST to the localhost:5000/analyze/extracted/face endpoint with the same payload.

@bus.handle("faces.extracted")
@blueprint.route("/analyze/extracted/face", methods=["POST"])
def analyze_extracted_face(input_args):
    event = json.loads(input_args.value)
    container: Container = blueprint.container

    try:
        container.core_service.analyze_extracted_face(
            face_index=event["face_index"],
            encoded_face=event["encoded_face"],
            shape=event["shape"],
        )
        return {"status": "ok", "message": "analyzing face asyncly"}, 200
    except Exception as err:  # pylint: disable=broad-exception-caught
        container.logger.error(
            f"Exception while analyzing single face - {err}"
        )
        container.logger.error(traceback.format_exc())
        return {"status": "error", "detail": str(err)}, 500

This way, the analyze_extracted_face function will be triggered whenever a message is placed on the faces.extracted topic, and it will perform the analysis for that individual face.

    def analyze_extracted_face(
        self,
        face_index: int,
        encoded_face: str,
        shape: Tuple[int, int, int],
    ):
        decoded_face = base64.b64decode(encoded_face)
        face = np.frombuffer(decoded_face, dtype=np.float64).reshape(shape)

        demography = self.deepface_service.analyze(face)
        self.logger.info(
            f"{face_index+1}-th face analyzed: {demography['age']} years old "
            f"{demography['dominant_gender']} "
        )

When starting the service with Gunicorn, faces will be analyzed in parallel according to the number of workers specified in the command (in my experiments, I used 2 workers). If we run this service on multiple machines, each machine will serve with the same number of workers. In other words, for scaling, it will be sufficient to adjust the number of workers and the partition count of the Kafka topic through configuration.

Conclusion

Event-driven architecture represents a shift from rigid, tightly coupled systems toward flexible, scalable, and resilient designs. By leveraging tools like Kafka, RabbitMQ, or Celery, developers can decouple services, enable parallel processing, and gain better visibility into the flow of data across their applications. While implementing EDA may introduce new concepts such as brokers, producers, and consumers, the long-term benefits in terms of scalability, maintainability, and fault tolerance make it a valuable investment for modern software systems. Whether you’re orchestrating microservices, handling high-throughput data streams, or building responsive user experiences, adopting an event-driven mindset can help future-proof your architecture and keep your systems ready for growth.

I pushed the source code of this study into GitHub. I strongly recommend you to pull the repo and run the service locally. Read me of the repo explains the steps to get the service up. Finally, you can support this work by starring the repo.

The post A Gentle Introduction to Event Driven Architecture in Python, Flask and Kafka appeared first on Sefik Ilkin Serengil.

Digital Signatures In Python

Sefik Serengil — Sat, 19 Apr 2025 22:26:31 +0000

Digital signatures play a crucial role in securing data integrity and authenticity across modern systems. Whether it’s signing documents, verifying transactions, or securing communication channels, digital signatures ensure that messages come from legitimate sources and haven’t been tampered with. In this post, we’ll explore how to implement digital signatures in Python using the LightDSA library—a lightweight and flexible cryptographic toolkit that supports multiple signature algorithms and elliptic curve configurations.

Person Holding Fountain Pen By Pexels

Vlog

What is LightDSA?

LightDSA is a Python library designed for generating and verifying digital signatures. It supports a variety of signature schemes including:

RSA
DSA
ECDSA (Elliptic Curve Digital Signature Algorithm)
EdDSA (Edwards-Curve Digital Signature Algorithm)

What sets LightDSA apart is its configurability, especially when it comes to elliptic curve–based algorithms like ECDSA and EdDSA.

ECDSA & EdDSA Curve Support

LightDSA provides three elliptic curve forms:

And each form supports hundreds of pre-defined curves. For example, the Bitcoin protocol uses ECDSA over the secp256k1 curve, which is a Weierstrass-form curve.

Here’s how to use custom curves in LightDSA:

# import library
from lightdsa import LightDSA

# build the curve used in bitcoin
dsa = LightDSA(
    algorithm_name = "ecdsa",
    form_name = "weierstrass", # or koblitz, edwards
    curve_name = "secp256k1" # see supported curves
)

For EdDSA:

# import library
from lightdsa import LightDSA

# build an edwards curve based eddsa
dsa = LightDSA(
    algorithm_name = "eddsa",
    form_name = "edwards", # or weierstrass, koblitz
    curve_name = "ed25519" # see supported curves
)

On the other hand, you can use edwards curves in ECDSA and weierstrass curves in EdDSA, too. But this is not common practice.

RSA and DSA

For RSA:

# import library
from lightdsa import LightDSA

# build rsa cryptosystem
dsa = LightDSA(
    algorithm_name = "rsa",
)

For DSA:

# import library
from lightdsa import LightDSA

# build dsa cryptosystem
dsa = LightDSA(
    algorithm_name = "dsa",
)

Customizing Key Sizes

For RSA and DSA algorithms, you can increase the key size to build stronger cryptosystems. For instance, upgrading from a 2048-bit RSA key to a 4096-bit one dramatically enhances security—though it also increases computation time.

# import library
from lightdsa import LightDSA

# build rsa cryptosystem
dsa = LightDSA(
    algorithm_name = "rsa", # or dsa
    key_size=7680
)

Consider this table before setting key sizes:

Key Size Comparison

In contrast, with ECDSA and EdDSA, security is primarily dictated by the order of the elliptic curve—the number of points it defines—rather than the key size itself. This is mentioned in “n (bits)” column of the supported curves.

Exporting Private and Public Keys

Once you built the cryptosystem, you will be able to export private and public keys as

# export private key
dsa.export_keys("secret.txt")

# export public key
dsa.export_keys("public.txt", public = True)

You must keep your private key secret.

Restoring Cryptosystems

You can restore the cryptosystem from a given secret or public key file as

signer_dsa = LightDSA(
    algorithm_name = algorithm_name,
    form_name = form_name,
    curve_name = curve_name,
    key_file = "secret.txt"
)

verifier_dsa = LightDSA(
    algorithm_name = algorithm_name,
    form_name = form_name,
    curve_name = curve_name,
    key_file = "public.txt"
)

Here, you should send the same algorithm name, form name and curve name when you were creating the cryptosystem.

Signing

Signing a message is very straightforward. You must have the private key to sign a message.

# sign a message
message = "Hello, world!"
signature = dsa.sign(message)

Verification

Verification is also very straightforward. You must have the public key to verify message.

verifier_dsa.verify(message, signature)

Why Use LightDSA?

Lightweight and easy to use
Fully configurable cryptographic backend
Supports modern cryptographic standards
Great for learning, prototyping, and even production usage

Conclusion

LightDSA makes it easy to experiment with different digital signature algorithms and elliptic curve configurations. Whether you’re developing secure systems or simply learning how modern cryptography works, it’s a fantastic tool to have in your Python toolkit.

You can support this study by starring its GitHub repo!

The post Digital Signatures In Python appeared first on Sefik Ilkin Serengil.

Elliptic Curve Cryptography In Python

Sefik Serengil — Sat, 19 Apr 2025 20:42:20 +0000

Elliptic Curve Cryptography (ECC) has become a cornerstone of modern cryptographic systems. From signing blockchain transactions to securing encrypted messaging, ECC is everywhere. But why is it so popular? What makes it different from RSA? And how can we implement it in Python? In this blog post, we’ll briefly dive into the theory of ECC and then see how to perform elliptic curve arithmetic operations using the LightECC Python library in just a few lines of code.

Gold Bitcoin by Pexels

Vlog

Why is ECC so widely used?

ECC is heavily used in blockchain systems like Bitcoin and Ethereum—for example, in wallet address generation and transaction signing. Popular signature schemes such as ECDSA (Elliptic Curve Digital Signature Algorithm) and EdDSA (Edwards-curve Digital Signature Algorithm) are based on elliptic curves.

More recently, GnuPG has shifted away from the traditional RSA default. The latest versions now use ECDH (Elliptic Curve Diffie-Hellman) for encryption and EdDSA for signing by default, breaking RSA’s long-standing dominance.

Even though ECC is not quantum-resistant, Apple still relies on ECC in iMessage—in combination with post-quantum algorithms for added security.

ECC vs. RSA: A Matter of Efficiency

At first glance, RSA might seem easier to understand because it’s based on modular exponentiation with large integers. However, for computers, ECC is far more efficient: its operations are based on simple arithmetic—point addition and scalar multiplication—making it faster and more scalable.

When we increase RSA key sizes for stronger security, the performance cost grows exponentially. In contrast, ECC key sizes can be increased linearly in terms of performance overhead while achieving the same level of security. For instance, A 256-bit ECC key is roughly equivalent in security to a 3072-bit RSA key.

RSA vs ECC Key Sizes

ECC Isn’t Simple—And That’s Okay

What makes ECC seem complicated is its variety of curve forms, each with their own equations and rules for arithmetic. The most common forms are:

Weierstrass: Used in ECDSA (e.g., secp256k1 for Bitcoin). Uses the equation y² = x³ + ax + b
Koblitz: A special case over binary fields for hardware design. Uses y² + xy = x³ + ax² + b
Edwards: Used in EdDSA, designed for faster and more secure implementations. Uses x² + y² = 1 + dx²y²

Graphical Interpretation

Weierstrass and Koblitz: When adding two points on the curve, we draw a line through them. This line intersects the curve at a third point, and we reflect that point over the x-axis to get the result.

Edwards curves don’t have such a graphical interpretation, but their formulas can be proven by mathematical induction. Their arithmetic is typically faster and less error-prone.

Additions for Different Elliptic Curve Forms

Let’s Code: ECC with Python using LightECC

You don’t need to implement elliptic curve arithmetic from scratch. With LightECC, you can define a curve and perform operations in just a few lines of code.

Install LightECC as

pip install lightecc

Define your curve as

from lightecc import LightECC

# build an elliptic curve
ec = LightECC(
    form_name = "edwards", # or weierstrass, koblitz. default is weierstrass.
    curve_name = "ed25519", # check out supported curves section
)

# Generator point
G = ec.G

Perform arithmetic operations as

# addition
_2G = G + G
_3G = _2G + G
_5G = _3G + _2G
_10G = _5G + _5G

# subtraction
_9G = _10G - G

# multiplication
_20G = 20 * G
_50G = 50 * G

# division
_25G = _50G / G

Division? Not So Fast…

While LightECC supports division, elliptic curve division is not efficient. In fact, the difficulty of the Elliptic Curve Discrete Logarithm Problem (ECDLP)—that is, figuring out k from P = k * G—is what makes ECC secure. It’s a one-way function: easy to compute, hard to reverse.

Do We Really Need to Know All These Details?

Not necessarily. As developers, we often use cryptographic libraries as black boxes. But having a high-level understanding of how ECC works—and what makes it secure—can help you choose the right tools, write safer code, and even contribute to the libraries you use.

Final Thoughts

Elliptic Curve Cryptography is not only secure and efficient but also elegant once you get past its initial complexity. With libraries like LightECC, you can experiment with curve arithmetic, signature schemes, and even encryption in pure Python.

So the next time you sign a blockchain transaction or send an encrypted message, remember: behind the scenes, a beautiful little curve is working its magic.

You can support this work by starring the LightECC repo!

The post Elliptic Curve Cryptography In Python appeared first on Sefik Ilkin Serengil.

How to Write Idempotent Python Codes

Sefik Serengil — Fri, 20 Dec 2024 20:25:43 +0000

In software development, ensuring that your programs behave consistently is crucial, especially when dealing with large data sets or critical transactions. One of the key concepts in this regard is idempotency — the property of a system or function where repeated executions produce the same result, even if performed multiple times. This is particularly important in scenarios like ETL (Extract, Transform, Load) processes, where interruptions or failures can cause data inconsistencies, such as duplicate entries. In this post, we’ll explore how to write idempotent Python code, the importance of ensuring idempotency in your applications, and how it can help avoid common pitfalls, such as data duplication during processes like money transfers. Through a practical example, you’ll see how to prevent the issues that arise when long-running processes are interrupted and need to be rerun. By understanding and implementing idempotent programming, you can build more reliable and efficient Python applications, ensuring smooth data processing without the risk of duplication or errors.

Black Microphone Windscreen From Pexels

Prerequisites

I will use pickle to read and write data for my complex types. The following are the functions I will use for these tasks:

import pickle

def store_as_pickle(data, filename):
    with open(filename, 'wb') as f:
        pickle.dump(data, f)

def read_from_pickle(filename):
    with open(filename, 'rb') as f:
        data = pickle.load(f)
    return data

1st Use Case

As a use case, I will retrieve data from a table in one source, apply transformations to the retrieved data, and then insert it into a table in another source. To ensure idempotency, I will break these three functionalities into separate sections.

Assume the source data is stored in the table_a table of the first_database. We can begin with a generic query like the following:

source_connection = psycopg2.connect(
   "postgres://postgres_user:postgres_password@postgres_host:5432/first_database"
)
source_cursor = source_connection.cursor()
source_cursor.execute(f"""
    select
    first_column,
    second_column,
    third_column,
    fourth_column,
    fifth_column
    from public."table_a";
""")

The issue is that table_a contains millions or billions of records, and fetching all of them at once could lead to performance issues on both application side and database side. To address this, I need to retrieve the data in smaller batches.

The following snippet fetches data from the source in 10K record batches. Once all the data is retrieved, it will be stored in source_data.pkl. If the source data has already been extracted and the file exists at that path, it will read from the file instead of querying the source, even if the process is re-run. In that way, we will satisfy idempotency while reading data from source.

import os

batch_size = 10000
page = 0

source_path = "source_data.pkl"
results = []
while True:
   # exit if source data is already extracted
   if os.path.exists(source_path):
      results = read_from_pickle(source_path)
      break

   print(f"retrieving [{(page) * batch_size} - {(page+1) * batch_size})")

   # extract data
   sub_results = source_cursor.fetchmany(batch_size)
   print(f"{len(sub_results)} records extracted from source in the {page+1}-th batch")

   # No more data to fetch
   if not sub_results:
      store_as_pickle(results, source_path)
      break # exit loop

   results = results + sub_results
   page += 1

The source data is stored in the results variable. For simplicity, I am skipping the transformation step and feeding the data directly to the target source. If you need to apply transformations, you should do so immediately after extracting the data into results.

Next, I will initialize the connection to the target data source and prepare the statement required to insert data into its table.

target_connection = psycopg2.connect(
   "postgres://user:password@another_host:5432/second_database"
)
target_cursor = target_connection.cursor()

statement = f"""
    insert into public."table_b"
    (
        first_column,
        second_column,
        third_column,
        fourth_column,
        fifth_column
    )
    values
    (
        %s, %s, %s, %s, %s
    );

We are now ready to perform bulk inserts into the target data source. The following snippet processes all the results data in chunks of 1,000 records, inserting each chunk in a single operation. A chunk will only be committed if all the insert statements are executed successfully.

After committing the data for a chunk, a flat file named .checkpoint_index will be created to mark its completion. If you re-run this snippet, it will first check whether a checkpoint file exists for the current chunk. If the checkpoint is found, that iteration will be skipped, ensuring idempotency.

import os
from tqdm import tqdm

commit_interval = 1000

pbar = tqdm(range(0, len(results), commit_interval)
for i in pbar:
    valid_from = i
    valid_until = min(i+commit_interval, len(datas))

    checkpoint_file = f".checkpoint_{i}"
    if os.path.exits(checkpoint_file) is True:
       print(f"chunk of [{valid_from}, {valid_until}) of {len(results)} is already performed")
       continue

    chunk = results[valid_from:valid_until]
    pbar.set_description(f"Inserting [{valid_from}, {valid_until}) of {len(results)}")
    target_cursor.executemany(statement, chunk)
    target_connection.commit()

    # create a flat file for this chunk
    open(checkpoint_file, "w").close()

This use case demonstrates a one-time ETL process for a large table. But what happens if the data in the source table is dynamic and continues to change over time?

2nd Use Case

To handle dynamic data, we can perform ETL on the source periodically. If new data is added to the source after the initial ETL job, I want to ensure that only the newly added data is processed during subsequent ETL runs.

The following snippet will extract data from both source and target.

batch_size = 10000
page = 0

source_cursor.execute(f"""
    select
    first_column,
    second_column,
    third_column,
    fourth_column,
    fifth_column
    from public."table_a";
""")

target_cursor.execute(f"""
    select
    first_column,
    second_column,
    third_column,
    fourth_column,
    fifth_column
    from public."table_b";
""")

source_data = []
while True:
   sub_results = source_cursor.fetchmany(batch_size)

   # No more data to fetch
   if not sub_results:
      break # exit loop

   source_data = source_data + sub_results

target_data = []
while True:
   sub_results = target_cursor.fetchmany(batch_size)

   # No more data to fetch
   if not sub_results:
      break # exit loop

   target_data = target_data + sub_results

Assume that first_column serves as the unique identifier for the data. I will store these identifiers in a set, allowing me to identify unloaded data by leveraging the set’s difference functionality. Similarly, if a record is deleted from the source after being loaded into the target, I can use the set difference again to identify and handle the missing data.

alpha = {i[0] for i in source_data}
beta = {j[0] for j in target_data}

unloaded_ids = alpha - beta
deleted_ids = beta - alpha

unloaded_records = []

pbar = tqdm(range(0, len(results))
for i in pbar:
   if i in unloaded_ids:
      unloaded_records.append(i)

Now, we can use the same procedure to insert unloaded data into target.

import os
from tqdm import tqdm

commit_interval = 1000

pbar = tqdm(range(0, len(unloaded_records), commit_interval)
for i in pbar:
    valid_from = i
    valid_until = min(i+commit_interval, len(datas))

    checkpoint_file = f".checkpoint_v2_{i}"
    if os.path.exits(checkpoint_file) is True:
       print(
          f"chunk of [{valid_from}, {valid_until}) "
          "of {len(unloaded_records)} is already performed"
       )
       continue

    chunk = unloaded_records[valid_from:valid_until]
    pbar.set_description(f"Inserting [{valid_from}, {valid_until}) of {len(unloaded_records)}")
    target_cursor.executemany(statement, chunk)
    target_connection.commit()

    # create a flat file for this chunk
    open(checkpoint_file, "w").close()

As a homework task, could you please implement a function to delete data from the target source based on the deleted_ids?

Conclusion

In this post, we explored how to write idempotent Python code, particularly in the context of ETL processes involving large datasets. We discussed the importance of idempotency in ensuring data integrity, especially when dealing with interruptions or failures during long-running processes like data transfers. By utilizing strategies such as batch processing, checkpointing, and leveraging Python sets for identifying new or deleted data, we can maintain consistency and avoid duplication across multiple runs.

Idempotent programming not only improves the reliability of your applications but also ensures that your ETL jobs can be safely re-run without the risk of data corruption or inconsistencies. By following the practices outlined here, you can build more resilient data pipelines that handle dynamic data sources effectively.

If you’re dealing with large-scale data migrations or complex transformations, incorporating these idempotent techniques will help you achieve smooth, error-free operations every time.

The post How to Write Idempotent Python Codes appeared first on Sefik Ilkin Serengil.

Best Practices to Variable Management in Python Web Services

Sefik Serengil — Fri, 20 Dec 2024 20:05:31 +0000

Environment variables play a critical role in web services, enabling secure management of sensitive data, configuration settings, and API keys. While developers in the Node.js ecosystem have long adopted best practices for handling environment variables, these same techniques can be easily applied to Python web services. By properly managing environment variables, you ensure that your application stays secure, flexible, and easily configurable across different environments. In this post, I will guide you through managing environment variables in Python, utilizing .env files and the python-dotenv package. I’ll show you how to integrate these practices into your local development with Docker, as well as in production environments via CI/CD pipelines. This method ensures sensitive information never makes it to your codebase, and configuration is streamlined for all stages of development. Let’s explore how to manage environment variables effectively in Python web services!

Safes with Keys and Knobs From Pexels

Using .env Files for Environment Variables

A common approach for managing environment variables is to use a .env file, which stores key-value pairs for variables you want to manage. This file can include things like database credentials, API keys, and other sensitive information. However, it is essential to make sure that the .env file is excluded from version control to avoid accidentally committing sensitive data.

First, create a .env file in your project root to store environment variables. For example:

DATABASE_URL=postgres://username:password@localhost:5432/mydatabase
SECRET_KEY=mysecretkey
LOG_LEVEL=DEBUG

To ensure the .env file is not tracked by Git, add it to your .gitignore file:

.env

While the .env file should be kept private, it’s a good practice to include a .env.example file in your repository. This file should contain all the necessary variable names without sensitive data, so others know what variables to define in their own .env file. Here’s what a .env.example file might look like:

DATABASE_URL=
SECRET_KEY=
LOG_LEVEL=DEBUG

This file can be pushed to GitHub so that anyone cloning the repository will know which variables need to be set in their own .env file.

Using the python-dotenv Package

The python-dotenv package makes it easy to load environment variables from a .env file into your application. To install it, run the following:

pip install python-dotenv

Next, you need to load the environment variables at the start of your application. This is typically done in the main initialization file or the entry point of your web service.

from dotenv import load_dotenv
load_dotenv()

This will load all the variables defined in your .env file into the environment, where you can access them using Python’s os.environ.

Accessing Environment Variables

Once loaded, you can access environment variables like this:

import os

database_url = os.getenv('DATABASE_URL')
secret_key = os.getenv('SECRET_KEY')
log_level = os.getenv('LOG_LEVEL', 'INFO')

print(f"Connecting to database at {database_url}")
print(f"Log level is {log_level}")

Using Docker with .env Files

When running your Python service in Docker locally, it’s important to ensure that your .env file is available inside the Docker container. You can achieve this by mounting the .env file into the container. Here’s an example of how to run your Docker container and mount the .env file:

echo "BUILDING DOCKER IMAGE"
docker build -t my-service .

if [ $? -eq 0 ]; then
echo "RUNNING DOCKER IMAGE"
docker run -p 5000:5000 -v $(pwd)/.env:/app/.env my-service:latest
else
    echo "Docker build failed. Exiting."
    exit 1
fi

This command binds the .env file from your local project directory to the /app/.env path inside the Docker container, ensuring that your environment variables are accessible.

Environment Variables in Production

In a production environment, especially when using services like GitHub Actions or GitLab CI/CD, environment variables can be injected directly into the Docker container without the need for an .env file. Those variables can be retrieved from environment variables of your repository in GitHub or GitLab or Vault.

When building the Docker image, you can inject environment variables directly using the -e flag:

docker build -e DATABASE_URL=your-database-url -e SECRET_KEY=your-secret-key -t my-service:latest .

This approach ensures that your production configuration is securely injected into the container, without needing to include sensitive data in the .env file.

Conclusion

Effective variable management is a cornerstone of building secure, maintainable, and flexible Python web services. By adopting best practices from environments like Node.js, Python developers can ensure their applications handle sensitive data securely and configurations are easily managed across different environments.

In this post, we explored how to use .env files, the python-dotenv package, and Docker to manage variables in a Python web service. Whether you’re working locally or deploying to development or production, these practices ensure that your application’s configuration stays clean and secure. Remember, the goal is to never hard-code sensitive information or configuration directly into your codebase, and instead, rely on environment variables for easy configuration management.

By following these techniques, you’ll improve both the security and maintainability of your Python web services, ultimately enabling a smoother development experience. With a solid foundation in variable management, your application will be ready to scale efficiently across different stages of deployment.

Happy coding!

The post Best Practices to Variable Management in Python Web Services appeared first on Sefik Ilkin Serengil.

A Minimalist Guide to Dependency Injection in Flask-based Python Web Services

Sefik Serengil — Thu, 12 Dec 2024 10:22:46 +0000

In the world of web development, managing dependencies efficiently is crucial for building scalable and maintainable applications. One powerful design pattern that helps achieve this is Dependency Injection (DI). DI simplifies the management of components and services in your application, making it easier to replace or extend functionality without modifying existing code. In Python web services, particularly those built with Flask, Dependency Injection can often seem like an unnecessary complexity. However, when used correctly, DI helps keep your codebase clean, modular, and testable. This is particularly important as your application grows, requiring more services and components to interact with each other. In this guide, we’ll explore how to implement a minimalist version of Dependency Injection in Flask-based Python web services. We’ll walk through the process of injecting classes and variables into your Flask application in a simple yet effective way, avoiding the overhead of heavyweight DI frameworks. By the end of this post, you’ll have the tools to build more maintainable and flexible Flask web services using this essential pattern.

Close-up of Ferrari Engine Parts in Detail From Pexels

Use Case

In traditional Flask-based Python web services, managing configuration variables and classes like a logger can quickly become repetitive and cumbersome. Without Dependency Injection (DI), we would need to initialize these components manually in each module where they are used.

Let’s look at a simple example:

In the absence of DI, we would have to call os.getenv() for every variable we need in every module.
Likewise, every module would manually initialize a Logger class, ensuring consistency and configuration in each instance.

Instead, by using DI, we can simplify this process. We initialize the environment variables and the logger class once—typically in a container—and then inject them into the relevant services or modules. This eliminates the need to manually call os.getenv() or create new logger instances throughout the codebase. This reduces redundancy and improves maintainability by centralizing the configuration and logger initialization in one place, making our code cleaner and more modular.

Dependent Class

First, create a logger.py file under the src/commons directory. During initialization, it will accept a log level as an integer and store it. Then, in each logging method, it will check whether the action meets the required level based on the log level provided during initialization.

import logging
from datetime import datetime

# pylint: disable=broad-except
class Logger:
    def __init__(self, log_level: int):
        self.log_level = log_level

        if self.log_level == logging.DEBUG:
            logging.basicConfig(level=logging.DEBUG)

    def info(self, message):
        if self.log_level <= logging.INFO:
            self.dump_log(f"{message}")

    def debug(self, message):
        if self.log_level <= logging.DEBUG:
            self.dump_log(f"{message}")

    def warn(self, message):
        if self.log_level <= logging.WARNING:
            self.dump_log(f"{message}")

    def error(self, message):
        if self.log_level <= logging.ERROR:
            self.dump_log(f"{message}")

    def critical(self, message):
        if self.log_level <= logging.CRITICAL:
            self.dump_log(f"{message}")

    def dump_log(self, message):
        print(f"{str(datetime.now())[2:-7]} - {message}")

Next, create a variables.py file under the src/dependencies directory. This file will be responsible for loading the necessary variables from the environment.

# built-in dependencies
import os

# 3rd party dependencies
from dotenv import load_dotenv
load_dotenv() # load env vars from .env file

class Variables:
    def __init__(self):
       self.log_level = int(os.getenv("LOG_LEVEL", "20"))

Third, create a container.py file under the src/dependencies directory. This file will be responsible for loading the necessary classes and modules, which will be injected into your services later.

# project dependencies
from modules.core.service import CoreService
from dependencies.variables import Variables
from commons.logger import Logger

class Container:
    def __init__(self, variables: Variables):
       self.variables = variables
       
       # initialize logger
       logger = Logger(log_level = variables.log_level)

       # initialize core service
       self.core_service = CoreService(logger = logger)

As you can see, the Logger class is initialized in the container.py file, while the log level information is passed from the variables. This log level is retrieved from the environment variables, ensuring that the logger is configured based on the environment-specific settings. We also stored variables in our container as well.

Next, create the service functionalities in the src/modules/core/service.py file. This service will retrieve the container during its initialization, and its welcome method will log a “Homepage called” message using the logger.

# project dependencies
from commons.logger import Logger

class CoreService:
def __init__(self, logger: Logger):
   self.logger = logger

   def welcome(self):
      self.logger.info("homepage called")
      return "Welcome Home"

Now, let’s create the service endpoints in the src/modules/core/routes.py file.

# 3rd party dependencies
from flask import Blueprint, request

# project dependencies
from dependencies.container import Container

# initializations
blueprint = Blueprint("routes", __name__)

@blueprint.route("/")
def home():
   # container was initialized in app initialization once, inject it here
   container: Container = blueprint.container
   return container.core_service.welcome()

Here, we retrieved the container from blueprint but we haven’t stored the container into blueprint yet. We will do this while creating the app.

Let’s create the application in the src/app.py file. As you can see, we first create the variables object, which loads the necessary variables from the environment. Then, we initialize the container while passing the variables to it. Finally, we store the container into the blueprint, completing the dependency injection process.

# 3rd party dependencies
from flask import Flask
from flask_cors import CORS

# project dependencies
from modules.core.routes import blueprint as core_blueprint
from dependencies.variables import Variables
from dependencies.container import Container

def create_app():
    app = Flask(__name__)
    CORS(app)

    core_blueprint.container = Container(variables=Variables())

    app.register_blueprint(core_blueprint)
    return app

To run our service, we will run the following shell script. This will get the application up at localhost’s 5000 port.

cd src
gunicorn --workers=8 --timeout=3600 --bind=0.0.0.0:5000 "app:create_app()"

When the localhost:5000/ endpoint is called, it will log the message using the logger we injected.

More Dependencies

As the project grows and we need to log messages in different modules, we won’t need to initialize the logger again. Instead, we will simply use it from the injected container, ensuring consistent logging across the entire application. Let’s expand the project to understand better.

Create a healthcheck service as service.py under src/modules/health. In this service, we will log a “Healthcheck service called” message using the logger available in the container. Note that the logger was initialized in the application and injected into the container, making it accessible here.

# project dependencies
from commons.logger import Logger

class HealthService:
    def __init__(self, logger: Logger):
        self.logger = logger

   def is_healthy(self):
      self.logger.info("healthcheck service called")
      return "Health check: OK", 200

Similar to injection of logger in core service, we will initialize health service once in our container and inject logger to it here.

# project dependencies
from modules.core.service import CoreService
from modules.health.service import HealthService
from dependencies.variables import Variables
from commons.logger import Logger

class Container:
    def __init__(self, variables: Variables):
       self.variables = variables
       
       # initialize logger
       logger = Logger(log_level = variables.log_level)

       # initialize core service
       self.core_service = CoreService(logger = logger)

       # initialize health service
       self.health_service = HealthService(logger = logger)

Now, create its endpoint under src/modules/health/routes.py. Health service was already initialized in our container and logger was passed to it already. So, we will use health service from container in routes instead of initializing it from scratch.

# 3rd party dependencies
from flask import Blueprint, request

# project dependencies
from dependencies.container import Container

# initializations
blueprint = Blueprint("routes", __name__)

@blueprint.route("/health")
def is_healthy():
   # container was initialized in app initialization once, inject it here
   container: Container = blueprint.container
   return container.health_service.is_healthy()

Finally, we will add health check endpoint’s blueprint into our existing application at src/app.py.

# 3rd party dependencies
from flask import Flask
from flask_cors import CORS

# project dependencies
from modules.core.routes import blueprint as core_blueprint
from modules.health.routes import blueprint as health_blueprint
from dependencies.variables import Variables
from dependencies.container import Container

def create_app():
    app = Flask(__name__)
    CORS(app)

    core_blueprint.container = Container(variables=Variables())
    health_blueprint.container = Container(variables=Variables())

    app.register_blueprint(core_blueprint)
    app.register_blueprint(health_blueprint)

    return app

As you can see, the Logger class is initialized once in container.py, but it is used in both src/modules/core/service.py and src/modules/health/service.py. This is possible because we injected the initialized container into our application, making the logger accessible throughout the application. Similarly, we can load the necessary variables from the environment variables once in this manner, making them accessible throughout the application without needing to load them repeatedly in each module.

Although we’ve demonstrated a basic implementation of dependency injection, it can be easily expanded to suit more complex needs. You can use the framework outlined in this post as a foundation for your own project, adapting it as necessary.

Conclusion

In conclusion, by implementing Dependency Injection (DI) in our Flask-based Python web service, we’ve simplified the process of managing environment variables and logging. Through DI, we centralized the initialization of crucial components like the logger and environment variables, making them easily accessible throughout the application. As the project expands, this approach ensures maintainability and scalability, as we no longer need to repeatedly initialize or configure these components in every module. DI helps streamline the architecture, reduce redundancy, and foster cleaner, more modular code—making it an essential practice for larger applications.

The post A Minimalist Guide to Dependency Injection in Flask-based Python Web Services appeared first on Sefik Ilkin Serengil.

Face Anti-Spoofing for Facial Recognition In Python

Sefik Serengil — Sat, 08 Jun 2024 18:52:03 +0000

In today’s world, facial recognition technology is widely being used for security purposes, but it comes with some vulnerabilities. Imagine having a facial recognition-based entrance system, where someone could gain access to your building using a printed photo of an authorized person. This is where face anti-spoofing or liveness detection comes into play. In this blog post, we’ll explore how to implement these security measures using the DeepFace library in Python. We’ll cover real-time analysis, face detection, face verification, face recognition, and even analyzing facial attributes like age, gender, emotion, and race. With just a few lines of code, you can enhance the security of your facial recognition systems and protect against spoof attacks.

Face Anti-Spoofing Test

Vlog

You can either continue to read this tutorial or watch the following video tutorial. They both cover hands-on the face anti-spoofing for facial recognition pipelines with DeepFace in python.

Pre-trained Model

DeepFace utilizes MiniVision’s Silent Face Anti-Spoofing models in the background, which are licensed under the Apache License 2.0. This means the models are free to use for both private and commercial purposes. Built on the PyTorch framework, the model functions as a classification system, expecting 80×80 RGB images as input. The implementation involves feeding these inputs into the MiniFASNetV1 and MiniFASNetV2 models and summing the probabilities in the classification layer. Based on the dominant probability, the system determines whether the input image is real or spoofed.

MiniFASNet Backbone

Of course, you don’t need to understand the details of the backbone model. You can easily use it through DeepFace with just a single line of code!

Real Time Spoofing Test

You can perform spoofing tests using DeepFace’s real-time analysis module. This module supports both facial recognition and facial attribute analysis, including age, gender, emotion, and race prediction. In this experiment, I will disable facial attribute analysis and just perform facial recognition. By default, it can detect and analyze faces in real-time through the webcam, distinguishing between real and spoofed ones.

The Stream function simply requires the database path, which is the exact location of the folder containing your facial images. For this experiment, I put the following images of myself and Zuckerberg in the target database folder.

Facial Database

from deepface import DeepFace
DeepFace.stream(
   db_path = &quot;/Users/sefik/Desktop/db&quot;,
   enable_face_analysis=False
)

Default Behavior’s of Real Time Analysis with DeepFace

As you can see from the image above, the Stream function can detect faces from the webcam and identify them from our database. However, even a printed image of Mark’s face can still be identified. Imagine implementing this technology at the entrance of Meta—someone could use a picture of Mark to gain access! This poses a significant security risk.

To address this security issue, you simply need to set the anti-spoofing argument to True in the Stream function.

from deepface import DeepFace
DeepFace.stream(
   db_path = &amp;amp;#039;/Users/sefik/Desktop/db&amp;amp;#039;,
   enable_face_analysis=False,
   anti_spoofing = True,
)

Anti-Spoofing Enabled Real Time Analysis

Once anti-spoofing is enabled, the system can still detect faces belonging to both myself and Mark. However, it distinguishes between them by highlighting my face in green and Mark’s face in red. This indicates that a spoof attack has been detected in Mark’s face. With this added layer of security, individuals attempting to gain access using a printed image of Mark’s face will no longer be able to enter Meta’s building.

It’s important to note that the Stream function of DeepFace is designed solely for demonstration purposes and does not store analysis results anywhere. However, you can still utilize the anti-spoofing module within facial recognition pipelines using DeepFace, enhancing security in real-world applications.

Real Time Spoofig In The Browser

Meanwhile, you can run face verification tasks directly captured image from your webcam in real time in your browser with its custom ui built with ReactJS.

Face Detection

Similar to the stream function, you have the option to set the anti_spoofing argument to True in the extract faces function. This will result in an additional key being returned in the response, containing is_real and antispoofing_score keys. By checking the is_real key, you can determine the authenticity of the extracted faces.

face_objs = DeepFace.extract_faces(img_path=img_path, anti_spoofing = True)
assert face_objs[0].get(&quot;is_real&quot;) is True, &quot;spoof attacked detected&quot;

Face Verification

To enable liveness testing in face verification, you can set the optional anti_spoofing argument to True in the verify function. Ensure that the exact image paths of the image pairs are provided as inputs to the function for accurate verification results.

DeepFace.verify(
   img1_path=&quot;dataset/img1.jpg&quot;,
   img2_path=&quot;dataset/img2.jpg&quot;,
   anti_spoofing = True
)

In the event that a spoof is detected in one of the image pairs, the function will raise a ValueError with the message “Spoof detected in given image.” This exception serves as an alert for potential security breaches during face verification.

Face Recognition

While searching for an identity in a database, the find function of DeepFace fulfills this task. By setting the anti_spoofing argument to True in the find function, it will verify whether the given image path is real or fake. It’s important to note that this process does not conduct spoofing tests on the items within your database; rather, it specifically evaluates the authenticity of the provided image path for the searched identity.

DeepFace.find(
   img_path=&quot;dataset/img1.jpg&quot;,
   db_path=&quot;/Users/sefik/Desktop/db&quot;,
   anti_spoofing=True
)

Similar to the verify function, if the given image is not real, the find function will raise a ValueError with the message “Spoof detected in the given image.” This exception serves as an indication of potential spoofing attempts during the identity search process.

Facial Attribute Analysis

Finally, spoof attacks can also be tested in age, gender, emotion, and race & ethnicity prediction tasks, similar to other functionalities within DeepFace. This comprehensive approach ensures that the system remains robust and reliable across various facial recognition tasks, safeguarding against potential security breaches posed by spoofing attempts.

DeepFace.analyze(
   img_path=&quot;dataset/img1.jpg&quot;,
   anti_spoofing = True
)

In the event that the given image is fake, the system will raise a ValueError with the message “Spoof detected in the given image.” This serves as a crucial alert, helping to prevent potential security breaches caused by spoofing attempts.

API

DeepFace comes with a restful web service for verification, represent and analysis tasks. You can set optional anti_spoofing boolean argument to true if you want perform anti spoofing analysis in your API calls.

Also, you can use containerized deepface service as well! But pytorch is an optional dependency and you have to customize dockerfile and install this package if you want to use that feature.

React JS Based UI

DeepFace also has an UI built with React JS for real time analysis purposes.

Anti-spoofing feature is available in the facial recognition ui built with React JS and DeepFace

Liveness detection is also available in the facial attribute analysis ui with React JS and DeepFace

Conclusion

In conclusion, implementing face anti-spoofing techniques is crucial for enhancing the security of facial recognition systems. With the DeepFace library in Python, you can easily perform real-time analysis to detect and prevent spoof attacks, ensuring that only authorized individuals gain access. By incorporating features such as face detection, verification, recognition, and attribute analysis, you can build a robust and reliable security system. Whether you’re using facial recognition for building access, personal devices, or other applications, these measures will help protect against unauthorized entry and maintain the integrity of your security infrastructure. Start enhancing your facial recognition systems today with DeepFace and take a significant step towards a more secure future.

The post Face Anti-Spoofing for Facial Recognition In Python appeared first on Sefik Ilkin Serengil.

A Step by Step Approximate Nearest Neighbor Example In Python From Scratch

Sefik Serengil — Sun, 31 Dec 2023 09:22:24 +0000

Nearest neighbor search is a fundamental problem for vector models in machine learning. It involves finding the closest vectors to a given query vector, and has many applications, including facial recognition, natural language processing, recommendation systems, image retrieval, and anomaly detection. However, for large datasets, exact nearest neighbor (k-NN) search can be prohibitively slow. That’s where approximate nearest neighbor (ANN) search comes in. ANN algorithms trade off some degree of accuracy for faster query times, making them a useful tool for large-scale data analysis. In this blog post, we’ll provide a gentle introduction to ANN in Python, covering the math behind the algorithm as well as a Python implementation from scratch. By the end of this post, you’ll have a solid understanding of how to perform approximate nearest neighbor search and billion-scale fast vector similarity search in milliseconds!

Assorted-color Wall Paint House Photo by pexels

Vlog

You can either continue to follow this tutorial or watch the following video. They both cover the math behind approximate nearest neighbor and its implementation in python from scratch.

Popular Approximate Nearest Neighbour Libraries

There are several popular libraries available for approximate nearest neighbor search in Python, including Spotify’s Annoy and Facebook’s Faiss. These libraries implement state-of-the-art algorithms for approximate nearest neighbor search, making it easy for developers to perform efficient and scalable nearest neighbor search without requiring in-depth knowledge of the underlying processes. Annoy is a popular choice for tasks that involve indexing high-dimensional data, while Faiss is well-suited for applications that require extremely fast query times and support for large datasets. Both libraries offer a range of features, including support for different distance metrics, query types, and indexing methods. Developers can choose the library that best suits their needs based on their specific use case and dataset characteristics.

While libraries such as Annoy and Faiss are excellent choices for approximate nearest neighbor search in Python, implementing the algorithm from scratch will give you a deeper understanding of how it works. This approach can be useful for debugging, experimenting with different algorithms, and building custom solutions. It’s important to note that implementing ANN from scratch can be time-consuming, and may not always be the most efficient or practical solution. Additionally, it’s worth considering that the popular libraries have been developed and tested by teams of experts, and offer a range of advanced features and optimizations. However, for those who believe in the mantra “code wins arguments,” implementing ANN from scratch can be a valuable exercise in understanding the underlying concepts and algorithms.

From the author of Annoy library – Erik Bernhardsson, I strongly recommend you to read this blog post, about how tree-based a-nn is working. It helped me to understand how the algorithm is working a lot. Then, I decided to implement it from scratch.

Data set

In this study, we are going to use 2-dimensional vectors because we can visualize them. However, vectors are multi-dimensional in real world. For instance, FaceNet produces 128-dimensional vectors and VGG-Face produces 2622-dimensional vectors.

import numpy as np
import matplotlib.pyplot as plt

dimensions = 2
num_instances = 100

# generating the data set
vectors = []
for i in range(0, num_instances):
    x = round(100 * random.random(), 2)
    y = round(100 * random.random(), 2)
    
    vectors.append(np.array([x, y]))

# visualize the data set
fig = plt.figure(figsize=(10, 10))
for x, y in vectors:
    plt.scatter(x, y, c = 'black', marker = 'x')
plt.show()

This will generate 2-dimensional 100 vectors.

Dataset

Data Structure

We are going to split the space half and half according to the randomly selected two vectors. The line splitting these vectors will be stored in a decision tree as a decision rule. Then, we are going to distribute all vectors according to the being on the left or right of that line. Thereafter, we are going to split the space of the subset of vectors recursively.

We can split the space with a line if the vectors are 2-dimensional. But if vectors are n-dimensional, we will be able to split the space with a hyperplane.

So, we can use the following class to construct our tree. Each leaf of our tree will be a Node class and I will connect it to parent Node class’ left or right parameter. Hyperplane will store the equation of the hyperplane splitting the space, value will store the vectors for that level, instances will store the number of vectors for that level.

class Node:
    def __init__(self, hyperplane = None, value = None, id = None, instances = 0):
        self.left = None
        self.right = None
        self.hyperplane = hyperplane
        self.value = value
        self.id = id
        self.instances = instances

Hyperplane

Hyperplane parameter in my Node class will be type of list. Suppose that its value is

[1, 2, 3, 4]

Then the equation of hyperplane will be

x + 2y + 3z = 4

Finding the hyperplane

To determine the hyperplane equidistant from two given n-dimensional vectors. First, calculate the midpoint of the two vectors by averaging their corresponding components. Next, find the direction vector pointing from the first vector to the second vector. Normalize this vector to obtain the unit vector in the same direction. This unit vector serves as a guide to define a normal vector for the hyperplane. With the midpoint and the normal vector in hand, calculate the distance between the midpoint and the hyperplane using the dot product. Finally, formulate the equation of the hyperplane by combining the normal vector components and the distance term. This method ensures the hyperplane is equidistant from the two given vectors in the n-dimensional space.

def find_hyperplane(v1, v2):
    '''
    finds the hyperplane equidistant from
    two given n-dimensional vectors v1 and v2
    '''
    # find the midpoint of two vectors
    midpoint = (v1 + v2) / 2
    
    # find the direction vector from v1 to v2
    direction_vector = v2 - v1
    
    # find the unit vector of the direction vector
    unit_vector = direction_vector / np.linalg.norm(direction_vector)
    
    # define a normal vector to the hyperplane
    normal_vector = unit_vector
    
    # calculate the distance between midpoint and the hyperplane
    distance = np.dot(midpoint, normal_vector)
    
    # define the equation of the hyperplane
    hyperplane = np.concatenate((normal_vector, [distance]))
    
    return hyperplane

Decide a vector is on the left or right of a hyperplane

To determine whether a vector lies on the left or right side of a given hyperplane, one can calculate the signed distance from the vector to the hyperplane. The signed distance is obtained by taking the dot product of the vector and the normal vector of the hyperplane, subtracted by the hyperplane’s constant term. If the resulting signed distance is negative, the vector is considered to be on the left side of the hyperplane. Conversely, if the signed distance is positive, the vector is on the right side. In cases where the signed distance is exactly zero, indicating that the vector lies on the hyperplane, conventionally, it is still treated as being on the right side. This convention helps maintain consistency in determining the orientation relative to the hyperplane.

def is_on_left(v, hyperplane):
    # calculate the signed distance from v to the hyperplane
    signed_distance = np.dot(hyperplane[:-1], v) - hyperplane[-1]
    
    if signed_distance &amp;amp;lt; 0:
        return True
    else:
        return False

Splitting the space recursively

In the process of constructing the entire tree, the split nodes function is pivotal. Within a set of vectors, the algorithm selects two random points and determines the hyperplane that is equidistant from these chosen vectors. Subsequently, the vectors are divided into left and right nodes based on their respective positions relative to the identified hyperplane. To facilitate the recursive building of the tree, a Node class is created, and its left nodes are populated by invoking the split nodes function recursively. The same recursive approach is employed to set the right nodes of the Node class. This recursive partitioning continues until the number of vectors within a given level falls below or equals a predefined threshold, which is set at 5 in this particular experiment. This stepwise process ensures the systematic construction of the tree, with nodes being recursively split until the specified threshold is met.

def split_nodes(vectors, ids = None):
    if ids is None:
        ids = [*range(0, len(vectors))]
    
    # pick two random points
    point_1st_idx = 0; point_2nd_idx = 0
    
    while point_1st_idx == point_2nd_idx:
        point_1st_idx = random.randint(0, len(vectors) - 1)
        point_2nd_idx = random.randint(0, len(vectors) - 1)
    
    v1 = vectors[point_1st_idx]
    v2 = vectors[point_2nd_idx]
    
    # find the hyperplane equidistant from those two vectors
    hyperplane = find_hyperplane(v1, v2)
    hyperplanes.append(hyperplane)
    
    # split vectors into left and right nodes
    left_nodes = []
    right_nodes = []
    
    left_ids = []
    right_ids = []
    
    for idx, vector in enumerate(vectors):
        is_left_node = is_on_left(v=vector, hyperplane=hyperplane)
        
        if is_left_node is True:
            left_nodes.append(vector)
            left_ids.append(ids[idx])
        else:
            right_nodes.append(vector)
            right_ids.append(ids[idx])

    assert len(left_nodes) + len(right_nodes) == len(vectors)
    
    current_node = Node(
        hyperplane=hyperplane,
        value=vectors,
        id=ids,
        instances=len(vectors)
    )
    
    if len(left_nodes) &amp;amp;gt; subset_size:
        current_node.left = split_nodes(
            vectors=left_nodes,
            ids=left_ids
        )
    else:
        current_node.left = Node(
            value=left_nodes,
            id=left_ids,
            instances=len(left_nodes)
        )
    
    if len(right_nodes) &amp;amp;gt; subset_size:
        current_node.right = split_nodes(
            vectors=right_nodes,
            ids=right_ids
        )
    else:
        current_node.right = Node(
            value=right_nodes,
            id=right_ids,
            instances=len(right_nodes)
        )
    
    return current_node

Once our recursive split nodes function is ready, we can construct our tree.

tree = split_nodes(vectors)

Search

Now, we can use the built tree to find the nearest neighbours.

# search nearest neighbours to this vector
v = [50, 50]

# find k nearest neighbours
k = 3

node = tree
while node.instances &amp;amp;gt;= k and node.hyperplane is not None:
   parent = node
   if is_on_left(v, node.hyperplane) is True:
      node = node.left
   else:
      node = node.right

print(f'nearest neighbor vectors: {node.values}')

Visualization

Once we built our tree, we can find the 5 nearest neighbors of a given vector as shown below. Red colored one is the target vector, whereas blue colored ones are the nearest neighbors.

Nearest Neighbor Results

Training Steps

We have interesting visualizations when we visualize the tree building steps. Two randomly selected vectors are shown with red x markers. The hyperplane that equidistant to them are also shown with red line for the current iteration whereas previous iteration’s hyperplanes are shown with black lines. Also, the equation of each hyperplane for the current iteration (red line) is shown on the top of the graph. To sum up, tree can be constructed with 28 steps. However, please consider that you should build this tree as offline.

Step 1

Step 2

Step 3

Step 28

Random Forest

From the nature of the algorithm, we picked 2 random points in each iteration. You may consider to build many trees with random forest algorithm to have a robust tree and get rid of the being lucky or unlucky of random point selection.

Time complexity

Suppose that we have n vectors in our data set. To find the k-nearest neighbors of a given vector, we firstly need to find the distance of given vector to all vectors in our dataset. That requires n calculations. In other words, complexity of this part is O(n) and n is the number of instances in our dataset. Then, we need to sort those n values. Python’s built-in sorting functionality is using Timsort and its complexity is O(n logn). To sum up, we need to perform O(n) + O(n logn) operations with respect to the time complexity.

In our experiment, we had 100 instances. So, we have to perform 300 operations to find the k nearest neigbours with exact nearest neighbor search.

O(100) + O(100 x log100) = O(100) + O(100 x 2) = O(100) + O(200) = 300

On the other hand, you can list the 5 nearest neighbors in 4 steps once you built the tree. This is 75x faster even for a small sized dataset. Of course, with exact nearest neighbor approach, you do not have to build a tree. Approximate nearest neighbor comes with space complexity but most of the time it is worth!

root: 100 instances
├── go to left: 47 instances
│   ├── go to left: 28 instances
│   │   ├──go to left: 11 instance
│   │   │   ├──go to left: 5 instances

Conclusion

In this blog post, we’ve covered the basics of approximate nearest neighbor search in Python, including the mathematical concepts behind the algorithm and a Python implementation from scratch. We’ve also introduced two popular libraries for ANN – Spotify’s Annoy and Facebook’s Faiss – and discussed their strengths and weaknesses. ANN algorithms are a powerful tool for large-scale data analysis, allowing us to trade off some degree of accuracy for faster query times. However, it’s important to keep in mind that the level of approximation will depend on a heuristic approach and we will not have the exact nearest neighbours always. With the knowledge and tools presented in this blog post, you should be well on your way to performing efficient nearest neighbor search in your Python-based data analysis projects.

I pushed the source code of this study to GitHub. If you do like this work, please star⭐ its repo.

The post A Step by Step Approximate Nearest Neighbor Example In Python From Scratch appeared first on Sefik Ilkin Serengil.

Digital Signature Algorithm (DSA) In Python From Scratch

Sefik Serengil — Wed, 14 Jun 2023 20:38:52 +0000

In today’s digital landscape, where memes, emojis, and cat videos reign supreme, it’s easy to overlook the serious business of secure communication and data integrity. But fear not! We have a cryptographic superhero ready to save the day. Enter the Digital Signature Algorithm (DSA), the caped crusader of authenticity and integrity in the digital realm. DSA, the cool cousin of the ElGamal encryption scheme, takes cryptographic awesomeness to the next level. It’s like ElGamal’s bolder, more extroverted sibling, sporting a flashy suit of mathematical prowess. In this blog post, we’ll peel back the mask and reveal the inner workings of DSA, all while having a laugh or two along the way. So fasten your seatbelts, put on your Python coding cape, and get ready to embark on a thrilling adventure to master the art of creating and verifying digital signatures using the fantastic Digital Signature Algorithm (DSA). Let’s dive into the wonderful world of DSA and boldly go where no Python programmer has gone before!

Person Holding Black Android Smartphone by pexels

Vlog

To use DSA in python with a few lines of code

Dependent Prime Numbers

DSA requires to generate two dependent prime numbers: p and q where p minus 1 must be divisible by q without remainder.

p – 1 mod q = 0

Generating large prime numbers are costly operations. Luckily, python pycryptodome package can generate these dependent prime pairs easily and relatively fast.

# !pip install pycryptodome
from Crypto.PublicKey import DSA
key = DSA.generate(3072)
p, q = key.p, key.q

Prime Modulo and Prime Divisor

These pair is going to be publicly known. Mostly we are using pre-generated prime numbers for p and q! So, you do not have to generate these pair always. We are going to use following ones in this experiment. Here, p is 3072-bit prime integer and q is 256-bits prime integer.

# p, q = (283, 47)
# p, q = (1279, 71)

p = int ("\
368577123415647035185869509923454362988806654876528082212642441963073\
112178399735415809071414511788930869249295250430304540353013866431229\
965510814733779933506798365268561425933870873229773800684661220325186\
508845233129736449679530102708242450177182372322415658482081901982139\
935504459436526193127136706104380369832924830561868635645974615813718\
599034288471386879791087503489121436698353515121613823867525619537313\
836546517502082093400007321208415057847562620627644914725375992993318\
465393374569764496785505998125381607827118352697037326000376764847745\
255637988916261264753020692214535700561224725217079718071094435237402\
156088273408028838936890398130926616753252644546343571080376158118499\
400126944433056814392717271382689271187098581742948664096320444706415\
422463846704028520445125935059579157543820582424507879000158982185479\
411941493007828836744389091928984640165167590618063453847542820383591\
5397282804819083435616816897")

q = 65032841498903519040222055260781303700863228372896251521604890600319447022433

assert (p - 1) % q == 0

Testing the primeness

A prime number is a number that is divisible only by 1 and itself, and it has no other factors. Once we set values to p and q, we should apply primeness test for them. If we cannot find any number divisible to them from 2 to its squared root value, we can classify it as prime.

def is_prime(number):
    if number &amp;lt; 2:
        return False
    for i in range(2, int(number**0.5) + 1):
        if number % i == 0:
            return False
    return True

assert is_prime(p) is True
assert is_prime(q) is True

However, this will take a long time for large integers. On the other hand, Miller-Rabin primality test can answer this question much faster. Sympy python package runs this approach algorithm with a simple interface.

# !pip install sympy
from sympy import isprime

assert isprime(p) is True
assert isprime(q) is True

Generator

Similar to Diffie-Hellman, DSA requires a public generator value. As mentioned, prime numbers p and q must be dependent. P minus 1 must be divisible by q without remainder as

(p – 1) mod q = 0

This can be represented as

p – 1 = a * q

Now, we are going to generate a random integer h in scale of [2, p-2] and find its a-th power in modulo p.

import random

a = int( (p - 1) // q )
h = random.randint(2, p-2)
g = pow(h, a, p)

assert g &amp;gt; 1

Notice that we used double slash characters while calculating a because p and q are large integers.

Fermat’s Little Theorem in DSA

We calculated generator g as

g mod p= h^(p-1)/q mod p

If we find the q-th power of two sides of the equation

g^q mod p= (h^(p-1)/q )^q mod p

q terms can be simiplified in the right hand side because it apperars in both numerator and denominator.

g^q mod p= h^(p-1) mod p

Fermat’s little theorem states that (p-1)-th power of anything must be equal to 1 if p is prime.

g^q mod p = 1

This can be verified in the DSA scheme as

# Fermat's Little Theorem
assert pow(g, q, p) == 1

Generating private and public keys

Alice is going to generate a large integer between [1, q-1].

# private key of Alice
x = random.randint(1, q-1)

Then, she is going to calculate her public key. This is going to be generator g to the power of her private key in modulo p.

# public key of Alice
y = pow(g, x, p)

Alice will keep her private key x secret and publish her public key y publicly. Meanwhile, arguments p, q, a and g will be publicly known as well.

Hashing

Alice firstly uses a hash function to digest a plain message. This hash function will be public and will be used by recipient as well. Basically, that function digests a message and represent as integer.

def find_hash(m) -&amp;gt; int:
    
    if isinstance(m, int):
        m = str(m)
    
    # to bytes
    m = str.encode(m)
        
    hash_value = hashlib.sha1(m).digest()
    # Convert the hash value to an integer
    return int.from_bytes(hash_value, 'big') % q

message = "attack tomorrow!"
H = find_hash(message)

Signing a message

She is going to generate a random integer for signing. Then, she will find r as random key k-th power of generator g in modulo q; and s as multiplicative inverse of random key times hash plus her private key times r.

# random key
k = random.randint(1, q-1)

r = pow(g, k, p) % q
s = ( pow(k, -1, q) * (H + x * r) ) % q

assert r != 0 and s != 0

Please notice that calculation of r requires to find generator g to the power of random key k in modulo p and then in modulo q respectively. We are going to consider this nested modulo pattern while proving the DSA algorithm.

r = (g^k mod p) mod q

Finally, she is going to send plain message and (r, s) pair to Bob. Notice that her public key y, and public arguments p, q, a and g are known publicly as well.

Verification

Bob will use same hash function to digest the coming message. Secondly, he will find w as the multiplicative inverse of s part of the signature. Then, he will find u1 as hash times w in modulo q; and u2 as r part of signature times w in modulo q. Thereafter, he calculates generator g to the power of u1 in modulo p times Alice’s public key to the power of u2 in modulo p. Later, he will find it in modulo p and modulo q respectively. Finally, result must be equal to the r part of the signature.

H = find_hash(message)
w = pow(s, -1, q)
u1 = (H * w) % q
u2 = (r * w) % q
v = ( ( pow(g, u1, p) * pow(y, u2, p) ) % p ) % q

assert v == r

Proof

Let’s remember how Alice calculated s

s = k^-1 * (H + x * r) mod q

If we find k from this equation

k = s^-1 * (H + x * r) mod q

Let’s move multiplicative inverse of s multiplier into the parenthesis

k = H * s^-1 + x * r * s^-1 mod q

We represented multiplicative inverse of s as w

k = H * w + x * r * w mod q

We can move this equation to exponents of same base g and same modulo p. Please notice that we intentionally use modulo p here.

g^k mod p = g^{(Hw + xrw mod q)} mod p

According to the product rule of exponents, this can be re-arranged as

g^k mod p = g^{Hw mod q} * g^{xrw mod q} mod p

According to the power of power rule, this can be represented as

g^k mod p= g^{Hw mod q} * (g^x)^{rw mod q} mod p

Generator g to the power of Alice’s private key x is equal to her public key y

g^k mod p = g^{Hw mod q} * y^{rw mod q} mod p

Bob represented Hw mod q as u1 and rw mod q as u2

g^k mod p = g^u1 * y^u2 mod p

Basically, Bob calculates generator g to the power of k in verification. Please notice that k is the random key generated by Alice and Bob does not know this value directly.

Meanwhile, Alice calculated r as generator g to the power of random key k in modulo p and modulo q respectively.

(g^k mod p) mod q = (g^u1 * y^u2 mod p) mod q

r = (g^u1 * y^u2 mod p) mod q

So, Bob’s calculation must be equal to r always!

Conclusion

In conclusion, we’ve embarked on a thrilling journey through the realm of cryptography, specifically exploring the Digital Signature Algorithm (DSA) in Python from scratch. We’ve witnessed the power of DSA as an extension of the ElGamal encryption scheme, providing us with the tools to create and verify digital signatures with confidence. Armed with our newfound knowledge, we can now navigate the digital landscape with a sense of security, knowing that our communications and data can be authenticated and protected from tampering. Remember, though, that this is just the beginning of your cryptographic adventures. There’s a whole world of algorithms and protocols waiting to be explored, each with its own unique quirks and applications. So, keep diving deeper, keep learning, and keep pushing the boundaries of your Python skills. The world of cryptography is ever-evolving, and you’re now equipped to be part of that exciting journey. May your digital signatures always be secure, and may your coding endeavors be filled with endless fun and discovery. Happy coding!

I pushed the source code of this study into GitHub. You can support this work if you star its repo.

The post Digital Signature Algorithm (DSA) In Python From Scratch appeared first on Sefik Ilkin Serengil.

Magic of Diffie-Hellman From A Programmer’s Perspective

Sefik Serengil — Tue, 30 May 2023 19:43:20 +0000

Diffie-Hellman marked a pivotal moment in the history of cryptography. With the introduction of Diffie-Hellman, public key cryptography was born, forever changing the landscape of data security. This remarkable algorithm, devised by Whitfield Diffie and Martin Hellman in the 1970s, unlocked a world of possibilities for programmers and computer scientists, enabling secure communication, authentication, and data protection on an unprecedented scale. In this post, we will dive into the magic of Diffie-Hellman and explore its profound impact from a programmer’s perspective in python programming language from scratch.

Monochrome Photography of Keys by pexels

Vlog

Join this video to see the magic of Diffie-Hellman key exchange algorithm from a python programmer’s perspective.

Public configuration of the algorithm

We will need a base generator g – possibly a small number, and a large prime modulus p – most likely around 1K bits. These values are going to be public for everyone. Let’s use the following values in this experiment.

# base generator
g = 17

# prime modulus
p = 158 * ( pow(2, 800) + 25 ) + 1

Generating private keys

Alice and Bob will generate a large private keys. I will use 1024 bit private key values in this experiment. As understand from the name, private keys must be kept secret.

# Alice's private key
a = random.getrandbits(1024)

# Bob's private key
b = random.getrandbits(1024)

Notice that 1024-bit Diffie-Hellman key equivalent is 160-bit elliptic curve cryptography key. You can adopt Elliptic Curve Diffie-Hellman to exchange keys faster with smaller keys!

Public keys

Alice and Bob will calculate the base generator g to to power of their private keys to find their public keys.

# Alice's public key
# ga = pow(g, a) % p
ga = pow(g, a, p)

# Bob's public key
# gb = pow(g, b) % p
gb = pow(g, b, p)

Notice that I commented lines where we find the base generator g to the power of private keys first and find the final value for modulo p second. Instead of this, I used python’s built-in pow function with 3 arguments which are base, exponent and modulo respectively. Even though they will give same result, built-in pow function is much faster.

Besides, g to the power of a means multiplying g to itself a times.

g^a = g * g * g * … * g

Here, if you find the result for modulo p in each multiplication, you will have the same result faster.

g^a = ( ( (g * g mod p) * g mod p ) *g mod p * … )

This is the magic of Diffie-Hellman from the programmers’ perspective! Moreover, python pow function implements this with binary exponentiation technique which is much faster. This is another magic from the programmers’ perspective!

Discrete Logarithm Problem

Calculating a public key is very easy and fast if base generator g, private key a or b, and prime modulus p are known. This calculation takes just milliseconds. On the other hand, finding a is computationally hard if Q, g and p are known. This is called discrete logarithm problem and Diffie-Hellman depends on the hardness of this problem.

Q = g^a mod p

An attacker should run a brute force attack to find the power a from this equation. To be honest, attacker might spend more than the age of the universe to find this value!

i = 0
while True:
    i += 1
    if pow(g, i, p) == ga:
        print(f"a is {i}")
        break

Key exchange

Once Alice and Bob calculated their public keys, they will publish these key values publicly. So, Alice will know Bob’s public key gb but she does not know Bob’s private key b. Similarly, Bob will know the public key of Alice, but he will not know the private key of Alice.

They will calculate the other side’s public key to the power of their own private key to calculate the shared key. Notice that this shared key must be kept secret.

# Alice's shared key
sa = pow(gb, a, p)

# Bob's shared key
sb = pow(ga, b, p)

assert sa == sb

Now, Alice and Bob can use this shared key as a key of a symmetric key encryption algorithm such as DES or AES to encrypt & decrypt messages.

Math behind the key exchange

Alice calculated Bob’s public key to the power of her private key. Similarly, Bob calculated Alice’s public key to the power of his private key. This must be equal to the generator g to the power of multiplication of a and b.

s_a = (gb)^a mod p

s_b = (ga)^b mod p

Bob calculated his public key as generator g to the power of his private key b, and Alice calculated her public key as generator g to the power of her private key a. Let’s replace these calculations in the shared key calculation.

s_a = (g^b)^a mod p

s_b = (g^a)^b mod p

These must be equal to each other according to the power rule of exponent.

s_a = s_b = (g^b)^a = (g^a)^b = g^a*b

Let’s the see correctness of this in the implementation.

g_to_a_times_b = pow(g, a*b, p)

assert g_to_a_times_b == sa
assert g_to_a_times_b == sb

Man in the middle attack

Carol knows ga and gb values because these are published publicly. If she multiplies these values then she will get generator g to the power of sum of a and b according to the product rule of exponents. This is not equal to the shared key Alice and Bob calculated. It was generator g to the power of multiplication of a and b.

ga x gb = (g^a) x (g^b) = g^a+b

Let’s the see correctness of this in the implementation.

g_to_a_plus_b = ( ga * gb ) % p

# this must be different than the shared key
assert g_to_a_plus_b != g_to_a_times_b

assert g_to_a_plus_b == pow(g, a+b, p)

Conclusion

In conclusion, the Diffie-Hellman key exchange stands as a testament to the power of innovative thinking and the transformative potential of cryptography. From its humble beginnings, this algorithm has evolved into a cornerstone of modern cybersecurity, enabling secure transactions, encrypted communications, and secure data storage across countless applications and platforms. As programmers, we have the privilege of harnessing the magic of Diffie-Hellman to create robust and secure systems that safeguard the sensitive information of users around the globe. By understanding the principles and inner workings of Diffie-Hellman, we can continue to push the boundaries of secure communication, protect against cyber threats, and contribute to the ever-evolving landscape of digital security. As we bid farewell to this exploration of Diffie-Hellman, let us carry its spirit of innovation, collaboration, and resilience forward as we embark on the next chapter of securing the digital world.

I pushed the source code of this study into GitHub. You can support this work if you star its repo.

The post Magic of Diffie-Hellman From A Programmer’s Perspective appeared first on Sefik Ilkin Serengil.