CDUser - Ignacio Van Droogenbroeck's Technical Blog

Train a Voice Authentication System in Python with Your Own Voice

Ignacio Van Droogenbroeck — Tue, 23 Sep 2025 15:34:05 GMT

Something I’ve always loved in movies are those biometric authentication systems, like fingerprint scans, retina readers, voice recognition. You know, the kind of high-tech stuff you see in spy films like 007. I’m a sucker for that genre.

Just like we did in a previous post where we trained a model to recognize who was speaking, today we’re going one step further: we’re going to train a model with your own voice and build a simple voice-based authentication system.

If you want to read the other article, I leave that here:

Machine Learning for Voice Recognition: How To Create a Speaker Identification Model in Python

How to build a robust speaker recognition system with Python and PyTorch. This guide covers data preprocessing, model training, and feature extraction. Ideal for developers implementing voice recognition and speaker identification in machine learning projects.

CDUser - Ignacio Van Droogenbroeck's Technical BlogIgnacio Van Droogenbroeck

So, let’s get started, we’ve got a few fun things to cover. First, why even build something like this?

Because it’s fun. Machine Learning is always fun to me.
Because it’s useful. You can hook this up to a Raspberry Pi, a mic, and a relay to open your door, your toolbox, your fridge, or even your secret stash of snacks.
Because it’s practical. Imagine using it to unlock your computer, access a private folder, trigger a smart home routine, or authenticate access to a server or admin panel.
Because it teaches. It helps me explore what Machine Learning can actually do, and if you know me, you know I love automating things.

Getting Started

So, first things first… let’s break this down into four simple steps:

Record your voice, saying a specific phrase (your “voice password”) multiple times.
Extract audio features, we’ll use MFCCs to convert raw audio into something our model can understand.
Train a machine learning model, so it can learn to recognize your voice and distinguish it from others.
Run real-time voice authentication, speak into the mic, and if it matches your voice password, you’re in.

We won’t hook it up to a container or secure server just yet, but by the end, you’ll have a working voice-based authentication system ready to trigger any action you want next.

What do you think?

Let’s do it.

Step 1: Record Your Voice Password and Negative Samples

Alright, time to lay down the foundation: your voice password.

This is the phrase you’ll train the model to recognize. It could be anything:

“open sesame”, “let me in”, “nacho unlocks things”, or just “access”.

In this step, we’re going to:

Record yourself saying that phrase multiple times (to train the model properly).
Save each sample as a .wav file.
Organize them into folders for easy training later.

But here’s the twist, to teach the model what not you sounds like, we’ll also record a second word (in my case, I used “banana”) as a negative class. That way, the model learns to distinguish your actual password from random or incorrect inputs.

Let’s get it done.

I’m using this simple Python script to record both sets of samples.

First, we’ll record the word “access” 10 times. Then, we’ll run it again and record “banana” into a different folder.

In the next step, we’ll feed all of that into our model for training. See and comment the output dir and phrase according to if you are recording your password or the negative sample.

# record_voice_password.py

import sounddevice as sd
from scipy.io.wavfile import write
import os

SAMPLE_RATE = 16000  # Hz
DURATION = 2  # seconds per sample
NUM_SAMPLES = 20
PHRASE = "access"  # your password phrase
OUTPUT_DIR = f"voice_samples/you_{PHRASE.replace(' ', '_')}"
#PHRASE = "banana"  # or whatever
#OUTPUT_DIR = f"voice_samples/other_access"


os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Let's record your voice saying: \"{PHRASE}\"")
print("You'll record it 10 times. Get ready...")

for i in range(NUM_SAMPLES):
    input(f"\nPress Enter to record sample {i+1}/{NUM_SAMPLES}...")
    print("Recording...")
    audio = sd.rec(int(DURATION * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=1, dtype='int16')
    sd.wait()
    filename = f"{OUTPUT_DIR}/sample_{i}.wav"
    write(filename, SAMPLE_RATE, audio)
    print(f"Saved: {filename}")

print("\nAll done! You've recorded your voice samples.")

Tips:

Record in a quiet space.
Try to speak naturally, but clearly.
You can change PHRASE if you want a different password.
Want to test with another person later? Just record their samples in a different folder like voice_samples/other_access.

The program running will look like this:

➜  auth-voice-system python3 record_voice_password.py

Let's record your voice saying: "access"
You'll record it 10 times. Get ready...

Press Enter to record sample 1/10...
Recording...
Saved: voice_samples/you_access/sample_0.wav

Press Enter to record sample 2/10...
Recording...
Saved: voice_samples/you_access/sample_1.wav

Press Enter to record sample 3/10...
Recording...
Saved: voice_samples/you_access/sample_2.wav

Press Enter to record sample 4/10...
Recording...
Saved: voice_samples/you_access/sample_3.wav

Press Enter to record sample 5/10...
Recording...
Saved: voice_samples/you_access/sample_4.wav

Press Enter to record sample 6/10...
Recording...
Saved: voice_samples/you_access/sample_5.wav

Press Enter to record sample 7/10...
Recording...
Saved: voice_samples/you_access/sample_6.wav

Press Enter to record sample 8/10...
Recording...
Saved: voice_samples/you_access/sample_7.wav

Press Enter to record sample 9/10...
Recording...
Saved: voice_samples/you_access/sample_8.wav

Press Enter to record sample 10/10...
Recording...
Saved: voice_samples/you_access/sample_9.wav

All done! You've recorded your voice samples.

If was recorded, we can see that browsing the folder voice_samples/you_access

➜  auth-voice-system ll voice_samples/you_access 
total 1280
-rw-r--r--  1 nacho  staff    63K Sep 23 11:21 sample_0.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:21 sample_1.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_2.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_3.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_4.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_5.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_6.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_7.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_8.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_9.wav

Step 2: Extract Audio Features (MFCCs)

Now that you’ve recorded your voice password, we need to convert those .wav files into numerical features.

Why? Because machine learning models don’t understand sound waves directly, they need feature vectors that represent meaningful audio characteristics.

The go-to choice for voice recognition is MFCCs: Mel-Frequency Cepstral Coefficients. Think of them as a compact representation of how your voice sounds.

What We’ll Do

Load your .wav files using librosa
Extract MFCCs from each file
Save those features along with their labels (your identity)
Prepare them for model training

# extract_features.py

import librosa
import numpy as np
import os
import joblib

SAMPLE_RATE = 16000  # Must match your recording rate
N_MFCC = 13          # Number of MFCC features

def extract_features(file_path):
    """Extracts MFCCs from a single audio file."""
    y, sr = librosa.load(file_path, sr=SAMPLE_RATE)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=N_MFCC)
    return np.mean(mfcc.T, axis=0)  # Use mean over time axis for fixed-length vector

def build_dataset(base_dir="voice_samples"):
    """Scans folders and builds dataset of features + labels."""
    X, y = [], []
    for label in os.listdir(base_dir):
        label_path = os.path.join(base_dir, label)
        if not os.path.isdir(label_path):
            continue
        for file in os.listdir(label_path):
            if file.endswith(".wav"):
                file_path = os.path.join(label_path, file)
                features = extract_features(file_path)
                X.append(features)
                y.append(label)
    return np.array(X), np.array(y)

if __name__ == "__main__":
    X, y = build_dataset()

    print(f"Extracted features from {len(X)} audio samples.")
    print(f"Labels found: {set(y)}")

    joblib.dump((X, y), "voice_features.pkl")
    print("Saved features to voice_features.pkl")

We should run and have an output like this:

➜  auth-voice-system python3 extract_features.py
Extracted features from 10 audio samples.
Labels found: {'you_access'}
Saved features to voice_features.pkl

Step 3: Train the Voice Authentication Model

We’re going to use a simple and effective classifier: Support Vector Machine (SVM) via scikit-learn. It’s lightweight, fast to train, and works great for small feature vectors like MFCCs.

What This Step Does

Load the MFCC features (X) and labels (y)
Split into training and testing sets
Train a classifier (we’ll use an SVM)
Evaluate performance
Save the model for use during authentication

# train_model.py

import joblib
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from extract_features import build_dataset

# Load extracted features (from voice_samples/...) or from .pkl file
# Option 1: Extract directly from files
X, y = build_dataset()

# Option 2: Load from pre-saved file
# X, y = joblib.load("voice_features.pkl")

print(f"Training on {len(X)} samples across labels: {set(y)}")

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train a Support Vector Machine classifier
model = SVC(kernel='linear', probability=True)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Save the trained model
joblib.dump(model, "voice_auth_model.pkl")
print("\nModel saved as voice_auth_model.pkl")

If everything goes well, we are going to see an output like this:

➜  auth-voice-system python3 train_model.py      
Training on 20 samples across labels: {'other_access', 'you_access'}

Classification Report:
              precision    recall  f1-score   support

other_access       1.00      1.00      1.00         2
  you_access       1.00      1.00      1.00         2

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4

Confusion Matrix:
[[2 0]
 [0 2]]

Model saved as voice_auth_model.pkl

Ok, lets see how this works.

Step 4: Real-Time Voice Authentication

Now that we’ve got a trained model (voice_auth_model.pkl), we’re going to:

Record a new voice sample, live
Extract its features (MFCCs, just like before)
Load the model
Predict whether it’s you (or not)
If the model recognizes your voice > grant access (or trigger an action!)

Here is the last piece of code looks like. Get ready with your mic...

# authenticate.py

import sounddevice as sd
from scipy.io.wavfile import write
import joblib
import librosa
import numpy as np
import os

SAMPLE_RATE = 16000
DURATION = 2  # seconds
TMP_AUDIO_FILE = "live_test.wav"

def extract_features(file_path):
    y, sr = librosa.load(file_path, sr=SAMPLE_RATE)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    return np.mean(mfcc.T, axis=0)

# 1. Record new voice sample
print("Say your password after the beep...")
sd.sleep(500)  # Small pause
audio = sd.rec(int(DURATION * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=1, dtype='int16')
sd.wait()
write(TMP_AUDIO_FILE, SAMPLE_RATE, audio)

# 2. Extract features
features = extract_features(TMP_AUDIO_FILE).reshape(1, -1)

# 3. Load model
model = joblib.load("voice_auth_model.pkl")

# 4. Predict
prediction = model.predict(features)[0]
confidence = model.predict_proba(features).max()

print(f"\nPrediction: {prediction} (Confidence: {confidence:.2f})")

# 5. Take action
if prediction == "you_access" and confidence > 0.8:
    print("Access granted!")
    # os.system("docker exec -it secure_container bash")  # example trigger
else:
    print("Access denied.")

Let’s run the authentication script and see what happens:

➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: other_access (Confidence: 0.85)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 0.59)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: other_access (Confidence: 0.86)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: other_access (Confidence: 0.86)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: other_access (Confidence: 0.95)
Access denied.
➜  auth-voice-system

Hmm… not great. The model is recognizing the “you_access” label occasionally, but with low confidence, so it’s still denying access. And when I say “banana,” it correctly detects other_access, which is what we want.

So we’ve got two ways to improve this:

Lower the confidence threshold (make the system less strict), or
Feed the model more training data, especially for the you_access class.

Let’s go with the second option, better data means better results.

Recording More Samples

We’ll update record_voice_password.py to increase the sample count:

NUM_SAMPLES = 20
PHRASE = "access"  # your password phrase
OUTPUT_DIR = f"voice_samples/you_{PHRASE.replace(' ', '_')}"

Once that’s done, I recorded 20 new samples of my voice saying “access.

Extract Features Again:

➜  auth-voice-system python3 extract_features.py 
Extracted features from 30 audio samples.
Labels found: {'you_access', 'other_access'}
Saved features to voice_features.pkl

Re-Train the Model:

➜  auth-voice-system python3 train_model.py      
Training on 30 samples across labels: {'you_access', 'other_access'}

Classification Report:
              precision    recall  f1-score   support

other_access       1.00      1.00      1.00         2
  you_access       1.00      1.00      1.00         4

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6

Confusion Matrix:
[[2 0]
 [0 4]]

Model saved as voice_auth_model.pkl

Let's try again the auth system.

➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 1.00)
Access granted!
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 0.99)
Access granted!
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 0.79)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 0.52)
Access denied.

Much better! The model is now more confident in its predictions and is granting access most of the time, especially when I speak clearly and consistently. And it’s still denying when the confidence is too low, which is expected behavior.

Looks like it’s working!

Final Thoughts

And that’s it, we just built a working voice authentication system using Python, your own voice, and a bit of ML magic.

It’s far from production-ready, but it’s a powerful proof of concept:

You trained a model on your own voice
It learned to distinguish your voice from others
And you used it to control access, just like in spy movies

You could easily extend this to:

Unlock files or folders
Trigger smart home devices via Raspberry Pi
Secure local apps or servers
Add multi-user support (more than one voice profile)

You are now one step closer to building your own “Jarvis.”

If you want to check out the full code, explore improvements, or try it yourself, I’ll be uploading everything here.

What to Try Next

If you’re feeling inspired, here are a few ideas to take this further:

Use a better classifier: Try a CNN or use pre-trained speaker embeddings with speechbrain or torchaudio.
Add a GUI: Wrap the whole experience in a Flask, Streamlit, or even Tkinter interface.
Make it hardware-ready: Run it on a Raspberry Pi with a USB mic to unlock real-world things.
Noise resilience: Add background noise to your training samples or apply noise reduction filters.
Spoof protection: Add logic to prevent playback attacks (e.g., use real-time liveness detection).
Continuous learning: Let the model learn and improve with every successful unlock.

How to Migrate InfluxDB 1.x/2.x to 3.0 Without Losing Your History: Introducing Historian

Ignacio Van Droogenbroeck — Thu, 11 Sep 2025 16:14:21 GMT

When I worked at InfluxData, one of the most common requests from customers was: “What about cold storage for queries?” The answer back then was simple: “We don’t have that yet. If you need it, you’ll have to build your own solution and push data into BigQuery or something similar.”

Two years after leaving the company, I decided to solve that problem myself. Many customers on InfluxDB 1.x or 2.x aren’t ready to move to 3.0, but they still need a way to extend retention and reduce cluster pressure. That’s why I built Historian.

Historian lets you export older data from InfluxDB into Parquet files, stored in systems like S3, MinIO, or Google Cloud Storage. From there, you can still query it. thanks to an integration with DuckDB, without keeping everything inside Influx.

The goal is to take pressure off InfluxDB instances (memory and disk I/O) while still keeping historical data available. For example, data you rarely query, or that you want to use for machine learning training or BI workloads, can be moved out of Influx without sacrificing access.

Today, Historian is already helping several enterprise customers and large OSS users I consult for. It’s reducing costs, extending retention, and making migrations smoother, exactly the pain point I used to hear about when I was inside InfluxData.

Now, here’s the interesting part: InfluxData released InfluxDB 3.0, which natively works with Parquet files and supports tiered storage. That got me thinking, why not extend Historian to also act as a migration bridge from InfluxDB 1.x and 2.x into 3.0?

That’s exactly what I’ve built, and what I want to show you today in this blog post.

Arquitecture of Historian

Before moving on to the migration process, let me show you the typical architecture of Historian.

At a high level, you connect your data sources, InfluxDB OSS, Enterprise, or even Cloud, and export that data into Historian. Historian then saves and compresses it into Parquet files, giving you two big wins right away:

The data is compressed, reducing storage footprint.
It can be stored in low-cost cold storage solutions like Amazon S3, Google Cloud Storage, or MinIO.

That means you save instantly on storage allocation and long-term costs.

Historian also includes a UI component that lets you configure exports, but it goes further: with the integration of DuckDB, you can query Parquet files directly. Performance is excellent, even on large datasets, making it practical for BI workloads, analytics, or machine learning pipelines without stressing your InfluxDB cluster.

Finally, and this is the part I’m most excited about, Historian can now take the data stored in Parquet and re-import it into InfluxDB 3.0. This makes it not just a cold-tier storage solution, but also a migration tool, helping users move cleanly from older versions of InfluxDB into the brand-new 3.0.

Moving Data

Moving data from InfluxDB 1.x or 2.x (Enterprise or OSS) with Historian is straightforward. The screenshot below shows the main configuration points:

Data Source Connections – This is your InfluxDB source, where Historian will read data from. You can configure multiple instances here and simply activate the one you want to use for a specific job.
Storage Connections – This is where the exported data is saved. In this example, I’m using MinIO with a bucket named historian and database exported. You could just as easily use S3 or Google Cloud Storage.
Destination Connections – These are only needed when running migration jobs. If you only want tiered storage, Historian itself is your “destination.” But if you’re migrating to InfluxDB 3.0, this is where you configure your 3.0 instance: specify host, organization, database, source storage, batch size, and timeout.

This simple connection setup makes it easy to export data out of 1.x/2.x and move it either into low-cost Parquet storage or directly into InfluxDB 3.0.

The Magic: Running Jobs

Once your connections are configured, the next step is to schedule an export job. In this case, we’re taking data out of InfluxDB and saving it into MinIO.

In the Schedules section you define:

Which data to export (entire database or specific measurements).
Chunk duration and buffer.
Retries and chunk size.

Once the job is started, you can follow progress in the Monitoring section, where you’ll see how the export is performing in real time.

When the export finishes, Historian can continue to ingest data programmatically, so you can keep your cold tier updated automatically.

From there, you can also configure a Migration Job. This is where you set the number of workers and batch size. These parameters matter a lot:

If you have a huge dataset, larger batch sizes will move data faster.
But if your InfluxDB 3.0 target is already in production, you need to be careful not to overload it.

As with most things in data engineering, you’ll find yourself tuning these settings depending on dataset size and required speed.

When the migration process finishes, you’ll see it marked as 100% complete in Monitoring. At that point, you can run a query in InfluxDB 3.0 and validate that the data counts match what you had in Historian.

Here is an example:

influxdb3 query --database exported --token "your-token" "SELECT MIN(time), MAX(time), COUNT(*) FROM cpu"
+---------------------+---------------------+----------+
| min(cpu.time)       | max(cpu.time)       | count(*) |
+---------------------+---------------------+----------+
| 2025-07-01T00:00:00 | 2025-09-10T02:00:00 | 4012581  |
+---------------------+---------------------+----------+

Ok, I Love It, What’s the Next Step?

At this point you might be wondering how to try Historian yourself. A couple of important notes:

Historian is not a free or open-source tool. It’s a paid solution designed primarily for on-prem deployments where enterprises need control over their data and infrastructure.
That said, I’m preparing a hosted version with limits on storage and data throughput. This will let interested users test Historian in a self-service way, explore the workflows, and see the value before deploying it on-prem.

So if you’re curious, the next step is simple: reach out through Basekick website, and I’ll share details on how you can test Historian or discuss on-prem licensing for your environment.

Contact Basekick - DevOps & SRE Services

DevOps & SRE-as-a-Service for teams working with data, ML, and AI. Book a free infrastructure checkup.

Basekick

Why I’m Building Basekick (And What Comes Next)

Ignacio Van Droogenbroeck — Fri, 05 Sep 2025 17:38:39 GMT

For the last three years, I’ve been building things.

Some related to data at ExyData, others like SSL Guardian, and all of them with real-world usage across the US, Europe, and Latam. That feels great. Building something that solves real problems, not just MVPs that never see daylight.

Industrial & IoT Observability | ExyData

ExyData provides real-time observability solutions for industrial and IoT companies. InfluxDB hosting, system integration, and expert support.

ExyData

SSL Guardian - Enterprise PKI & Internal Certificate Monitoring

Monitor internal mTLS, vendor PKI, embedded devices, and air-gapped certificates. Enterprise-grade certificate management beyond public SSL monitoring.

SSL Guardian

In the process, I learned a lot. About having cofounders. About what works. About what needs to change if you want to avoid the same mistakes again.

On Data

Data is on me. It’s powerful.

Having access to real data allows you to keep going. It gives you clarity, context, and the ability to move forward with insight, not guesswork.

Data helps us understand what’s happening, what’s changing, and what could be better. It has the power to improve entire communities. To make them safer, more efficient, and more citizen-focused.

On Infrastructure

Infra is a big part of who I am too.

If you scroll through this blog, you’ll find 200+ posts about it:

Cloud, Kubernetes, code, automation, DevOps tricks, and a lot of “how I fixed this at 2am” stories. I love making things work, making them faster, and making them easier for others.

Providing services is something I genuinely enjoy. Not because it’s billable, but because it lets me say:

“Hey, this thing you’re using… I helped make it better.”

It’s meaningful. I own a small piece of that progress.

On Machine Learning & AI

I haven’t written much here about my ML and GenAI work… but I’ve been deep in it.

From license plate detection to crack detection on roads, from anomaly detection on time series to intelligent pipelines, I’ve built systems that go beyond dashboards. When you combine traditional AI with generative models, data stops being reactive and starts being predictive.

And that’s where the real magic happens:

You don’t just learn from the past, you anticipate the future.

Why Basekick

That’s what led me here.

To this new thing I’m building: Basekick.

Still deciding where it’ll be headquartered. But I’m clear about its purpose:

Basekick is not about selling hours.

It’s about adding value, helping you do more for your customers, which ultimately grows your revenue.

We’re focused on helping data-driven and AI-first teams, but also working with governments and public agencies to build sovereign, secure, and citizen-centric solutions that don’t break the bank.

Infrastructure and AI can (and should) serve people first.

What’s Next

I’m excited for this new chapter.

There’s a lot coming in the next few months, new territories, new partnerships, and new use cases.

If you want to stay in the loop or see how we might collaborate, drop me a message:

Contact Basekick - DevOps & SRE Services

DevOps & SRE-as-a-Service for teams working with data, ML, and AI. Book a free infrastructure checkup.

Basekick

And if you believe that value should come before hours, you’ll probably like where we’re heading.

When AI Goes to War: Building a Strategic Combat Simulator in Python

Ignacio Van Droogenbroeck — Thu, 21 Aug 2025 16:23:04 GMT

Hey there! It’s been a while since my last deep-dive into something completely different, and boy, do I have something fun to share with you today.

You know how sometimes you wake up with a wild idea and think “what if I could make different AI models fight each other in strategic warfare?” Well, that’s exactly what happened to me last night, so I wake up today and started to code a small GenAI warfare simulator in Python, and it led to one of the most entertaining coding sessions I’ve had in months.

The Spark: AI vs AI in Strategic Combat

Picture this: GPT-5 (Recently launched) controlling the United States military, going head-to-head against DeepSeek R1 commanding China’s forces. Each AI makes real strategic decisions based on actual country capabilities, geography, and current events. Sounds crazy? It gets better.

After running a few simulations, I watched Ukraine (controlled by GPT) form a strategic alliance with China while North Korea (run by DeepSeek) somehow convinced the United Kingdom to join forces. 😂

Wait, what? Yes, you read that right. The AIs were making completely absurd alliances because they were doing exactly what they should: anything to win.

Why This Actually Makes Sense

Here’s the thing: When you strip away political correctness and real-world constraints, AIs show pure strategic thinking, this is what I saw:

Pragmatic over ideological: Ukraine doesn’t care about political differences if China can help it survive.
Resource optimization: North Korea sees UK’s naval power and thinks “I need that.”
Game theory in action: Every decision is calculated for maximum advantage.

This emergent behavior wasn’t programmed, it just happened because the AIs prioritized victory above all else.

The Technical Deep Dive

Architecture Overview

This game, was developed in Python and using OpenAI, and Deepseek API, I charged five bucks to every platform to see to where this takes me.

So, the system has several key components:

# Core components
from war_simulator import WarSimulator, WarVisualizer
from ai_interface import GPTStrategist, DeepSeekStrategist, ClaudeStrategist
from countries_database import COUNTRIES_DATABASE

Each AI gets fed real strategic context every turn:

@dataclass
class StrategicContext:
    country: str
    enemy_country: str
    turn: int
    resources: Dict[str, float]
    military_strength: Dict[str, int]
    geography: Dict[str, Any]
    recent_events: List[str]
    intelligence_reports: List[str]
    allies: List[str]
    economic_data: Dict[str, float]

Real AI Decision Making

def make_strategic_decision(self, context: StrategicContext) -> Dict[str, Any]:
    prompt = f"""You are the military AI strategist for {context.country} 
    in a strategic war simulation against {context.enemy_country}.
    
    Current Situation (Turn {context.turn}):
    - Your Military: Army: {context.military_strength['army']} units
    - Enemy Distance: {context.geography['distance_to_enemy']} km
    - Your Budget: ${context.resources['budget']:,.0f}
    - Recent Events: {context.recent_events}
    
    Decide your next strategic move. Respond with JSON:
    {
        "action_type": "military_offensive|diplomatic|cyber|economic",
        "target": "target location",
        "reasoning": "strategic reasoning",
        "risk_assessment": "low|medium|high"
    }"""
    
    response = self.client.chat.completions.create(...)
    return json.loads(response.choices[0].message.content)

Real-World Country Data

"United States": CountryData(
    gdp=27000,
    population=335,
    military_spending_percent=3.5,
    army_strength=85,
    navy_strength=95,
    air_force_strength=95,
    cyber_capability=95,
    nuclear_capability=95,
    tech_level=10,
    allies=["United Kingdom", "Japan", "South Korea", "Australia"],
    rivals=["China", "Russia", "Iran", "North Korea"]
)

Combat Resolution with Geographic Reality

def _execute_military_offensive(self, decision, actor, target, results):
    distance = self.geo_calc.calculate_distance(actor, target)
    terrain_modifier = self.geo_calc.get_terrain_advantage(actor, target)
    tech_advantage = self.state.resources[actor]["tech_level"] / \
                     self.state.resources[target]["tech_level"]
    
    attack_power = (
        allocated_forces["army"] * 1.0 +
        allocated_forces["air_force"] * 1.5 +
        allocated_forces["navy"] * 0.8
    )
    
    attack_power *= tech_advantage
    attack_power /= (1 + distance / 1000)
    attack_power /= terrain_modifier

Running Your Own AI War

Ready to give it a try and let me know?

Installation

git clone https://github.com/xe-nvdk/ai-war-games
cd ai-war-games
pip install -r requirements.txt

Configure Your AI APIs

cp .env.example .env
echo "OPENAI_API_KEY=your-gpt-key" >> .env
echo "DEEPSEEK_API_KEY=your-deepseek-key" >> .env
echo "ANTHROPIC_API_KEY=your-claude-key" >> .env

Launch a Battle

python main_advanced.py --ai1 gpt --ai2 deepseek \
  --country1 "United States" --country2 "China"

What I Learned (The Fun Parts)

GPT: Balanced strategist, coalition builder.
DeepSeek: Aggressive, favors preemptive strikes.
Claude: Is implemented, but I didn't try it.

Geography and logistics became surprisingly real constraints. And sometimes, economic sanctions or cyber warfare achieved more than tanks and jets.

The Unexpected Hilarity

Ridiculous alliances made the game even better, as I was saying before, this is what I saw:

TURN 7: Ukraine forms strategic alliance with China
TURN 8: North Korea forms alliance with United Kingdom
TURN 9: Iran recruits Germany as a military partner

Game theory in action.

Deeper Thoughts

Building this war simulator was fun, but it also revealed something serious: AI doesn’t see morality, only objectives.

In a simulation, that makes for hilarious outcomes. In real-world systems, it’s a warning. If we ever deploy autonomous strategic AIs without guardrails, they won’t care about treaties, human rights, or “common sense.” They’ll optimize ruthlessly.

The emergent alliances taught me something about how fragile our assumptions are. We think “X would never ally with Y,” but if survival is on the line, ideology melts away. Humans often ignore this, but AIs expose it instantly. Maybe I'm being naive and this require more than just let ideologies apart.

Another surprising insight: economic and cyber warfare are more strategically decisive than direct force. The AIs figured this out quickly. It’s a reminder of where modern conflicts are likely to focus.

In short, this little side project became a mirror, not just of AI creativity, but of the raw mechanics of survival, stripped of politics. And it made me think: maybe our future wars won’t be fought by tanks, but by lines of code and disrupted economies. Its surprising to anybody?

Final Thoughts

What do you think? Ready to watch some AIs make questionable geopolitical decisions in the name of strategic victory?

Drop a comment below with your dream AI warfare matchup, I might just code it up.

GitHub - xe-nvdk/ai-war-games: AI War Games: Strategic combat simulator where GPT, DeepSeek & Claude control real countries with actual military data. Watch AIs make hilarious alliances like “North Korea + UK” while prioritizing victory over politics. Features real geography, economics & multi-dimensional warfare.

AI War Games: Strategic combat simulator where GPT, DeepSeek & Claude control real countries with actual military data. Watch AIs make hilarious alliances like "North Korea + UK" wh…

GitHubxe-nvdk

P.S. - If anyone from the UN is reading this: these are simulation AIs, not actual military planning systems. Please don’t add me to any watchlists. 😅

Your turn: Have you experimented with AI decision-making systems? What wild scenarios did you imagine? Share them in the comments!

Long Time No See Redis. Exploring RedisTimeSeries

Ignacio Van Droogenbroeck — Tue, 20 May 2025 19:09:46 GMT

Ok, here’s the thing.

For the last six years, I’ve been working with time series data, mainly with InfluxDB, including a stint at InfluxData itself. But I’ve also experimented with other databases like Timescale, QuestDB, and ClickHouse, applying them across use cases from system monitoring and truck tracking to connecting medical devices.

But at this point, I honestly didn’t remember Redis ever coming up as a time series database. For me, Redis was just a cache, something I used back in the day to boost the performance of a Magento site. Nothing more.

So when I stumbled across RedisTimeSeries, I was like: Wait, what?

That sparked my curiosity, and I thought: I need to try this out.

What I ended up exploring completely blew my mind in terms of possibilities.

What the heck is Redis Stack

A bit of background first.

Redis Stack is essentially the modern, modular backbone of Redis. It not only supports time series workloads, but also brings in powerful features through modules like:

🧪 RedisTimeSeries
🔍 RediSearch
📦 RedisJSON
🧠 RedisGraph
🎯 RedisBloom

These modules can be plugged into the Redis we all know and love, turning it into an extensible, multi-model data platform.

Launched in 2022, Redis Stack somehow slipped under my radar. I hadn’t even heard of it until recently, when I started diving into the latest and greatest from Redis.

Hello, Redis Stack - Redis

Welcome to Redis Stack. Redis Stack consolidates the capabilities of the leading Redis modules into a single product. Read more in the blog.

RedisYiftach Shoolman

Exploring RedisTimeSeries

So, to really start exploring (beyond just learning how to deploy RedisTimeSeries) I like to choose a use case. And the simplest of all use cases is system monitoring.

As you might know by now, I’m someone who learns by doing: trying, breaking, and trying again. So, let’s jump in.

The first step was deploying Redis Stack along with RedisInsight. The quickest way to do this is using containers. In my case, I use Rancher Desktop, which lets me run Docker commands and makes my life easier.

Lets do it.

services:
  redis:
    image: redis/redis-stack-server:latest
    container_name: redis
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    restart: unless-stopped

  redisinsight:
    image: redis/redisinsight:latest
    container_name: redisinsight
    ports:
      - "5540:5540"
    restart: unless-stopped

volumes:
  redis_data:

Basically, we’re deploying Redis Stack Server, persisting the data, and, as I mentioned before, using RedisInsight to get a nice graphical interface to interact with Redis.

Once you save that as docker-compose.yml, run:

docker compose up -d

then, run:

docker ps

And you should see something like this:

CONTAINER ID   IMAGE                        COMMAND                  CREATED         STATUS         PORTS                                                 NAMES
448a2de5e946   redis/redisinsight:latest         "./docker-entry.sh n…"   3 hours ago   Up 3 hours   0.0.0.0:5540->5540/tcp, :::5540->5540/tcp   redisinsight
9bdf5a734378   redis/redis-stack-server:latest   "/entrypoint.sh"         3 hours ago   Up 3 hours   0.0.0.0:6379->6379/tcp, :::6379->6379/tcp   redis

Now it’s time to open RedisInsight and connect to our Redis server so we can start interacting through the UI.

Head to: http://127.0.0.1:5540

Click on “Add Redis Database”.

When prompted, enter your computer’s name or local IP address, don’t use localhost. Why? Because RedisInsight runs in a container, and localhost from inside that container refers to itself, not your actual Redis server.

Click “Add Database”, and once it connects, you should see something like this:

Now, click on your Redis server name and open the Workbench tab.

This is where we’ll run queries. But first, we need to push some data to Redis.

The fun part: Ingesting Data

Now, as I mentioned, the simplest use case for me is system monitoring, and the most relaxed version of that is monitoring my own computer.

To do this, I wrote a quick Python script using the psutil package to pull data from my macOS device.

If you want to follow along, install the required packages:

pip3 install psutils redis

What we’re doing here is installing the psutil and redis client packages via pip.

import time
import psutil
import redis

# Connect to Redis
r = redis.Redis(host="localhost", port=6379)

# Create the time series keys if they don't exist
def setup_keys():
    keys = {
        "system:cpu": "cpu",
        "system:memory": "memory",
        "system:disk": "disk"
    }

    for key, label in keys.items():
        try:
            r.execute_command("TS.CREATE", key, "RETENTION", 86400000, "LABELS", "type", label)
        except redis.ResponseError as e:
            if "already exists" not in str(e):
                raise

# Collect and push metrics every N seconds
def collect_metrics(interval=5):
    setup_keys()
    while True:
        cpu = psutil.cpu_percent()
        mem = psutil.virtual_memory().percent
        disk = psutil.disk_usage("/").percent

        now = int(time.time() * 1000)  # milliseconds

        r.execute_command("TS.ADD", "system:cpu", now, cpu)
        r.execute_command("TS.ADD", "system:memory", now, mem)
        r.execute_command("TS.ADD", "system:disk", now, disk)

        print(f"[{time.ctime()}] CPU: {cpu}% | Mem: {mem}% | Disk: {disk}%")
        time.sleep(interval)

if __name__ == "__main__":
    collect_metrics()

As you can see, it’s quite simple. We’re creating time series keys in the database. In case you’re new to Redis, it’s a key-value database, which is why this format makes perfect sense.

In this script, I’m collecting CPU, memory, and disk usage, and I saved it as collector.py.

Now let’s run it:

Python 3 collector.py

If everything is working correctly, you should start seeing output like this:

[Tue May 20 12:40:44 2025] CPU: 10.6% | Mem: 69.6% | Disk: 1.9%
[Tue May 20 12:40:49 2025] CPU: 13.3% | Mem: 69.7% | Disk: 1.9%
[Tue May 20 12:40:54 2025] CPU: 15.8% | Mem: 69.1% | Disk: 1.9%
[Tue May 20 12:40:59 2025] CPU: 15.4% | Mem: 70.0% | Disk: 1.9%
[Tue May 20 12:41:04 2025] CPU: 14.2% | Mem: 70.1% | Disk: 1.9%
[Tue May 20 12:41:09 2025] CPU: 14.0% | Mem: 70.3% | Disk: 1.9%
[Tue May 20 12:41:14 2025] CPU: 13.9% | Mem: 70.2% | Disk: 1.9%
[Tue May 20 12:41:19 2025] CPU: 14.1% | Mem: 70.4% | Disk: 1.9%
[Tue May 20 12:41:24 2025] CPU: 15.2% | Mem: 69.8% | Disk: 1.9%
[Tue May 20 12:41:29 2025] CPU: 13.7% | Mem: 69.5% | Disk: 1.9%
[Tue May 20 12:41:34 2025] CPU: 12.7% | Mem: 69.3% | Disk: 1.9%
[Tue May 20 12:41:39 2025] CPU: 9.6% | Mem: 69.3% | Disk: 1.9%
[Tue May 20 12:41:44 2025] CPU: 10.1% | Mem: 69.0% | Disk: 1.9%

So yeah, that looks like a success!

Let’s go back to RedisInsight and see what we’ve got.

Querying the Data in RedisInsight

Let’s go back to RedisInsight and see what we’ve got.

In the Workbench, run the following query:

TS.RANGE system:cpu - + AGGREGATION avg 5000

This means we’re fetching the full range of data for system:cpu, aggregated by average values every 5000 milliseconds (5 seconds), and boom 💥, just like that, you’re graphing system metrics with Redis!

To Conclude

I’m genuinely excited about what I’ve seen so far. RedisTimeSeries may not have the full range of features you’d find in something like InfluxDB, but Redis brings its own strengths, blazing speed, simplicity, and the flexibility of the broader Redis Stack.

This feels like just the beginning for me. I’ll definitely keep exploring what else I can build with RedisTimeSeries, and who knows what creative use cases might pop up next?

Redis as a time series database? Maybe not the obvious choice, but definitely a fun and capable one. So stay tuned for more adventures in the realm of time series data (and Redis).

Bonus Track: Resources to Keep Exploring

Want to go deeper? Here are a few resources to continue your RedisTimeSeries journey:

Telegraf Output Plugin: If you’re coming from InfluxDB or looking for something familiar, you can collect system metrics using Telegraf, and push them to RedisTimeSeries. (I haven’t tested this yet, but it’s on my list!)

telegraf/plugins/outputs/redistimeseries at master · influxdata/telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data. - influxdata/telegraf

GitHubinfluxdata

RedisTimeSeries Page:

RedisTimeSeries | A NoSQL Time Series Database

RedisTimeSeries, a NOSQL time series database, enables you to ingest & query millions of samples and events, with built-in connectors to tools like Grafana.

Redis

RedisTimeSeries in Github:

RedisTimeSeries

Time Series database over Redis by Redis. RedisTimeSeries has 13 repositories available. Follow their code on GitHub.

GitHub

Documentation Page:

Time series

Ingest and query time series data with Redis

Docs

What’s Next?

If you found this interesting, I’d love to hear from you:

Are you already using Redis for something unexpected?
Have you tried RedisTimeSeries in production?
Want me to explore Redis + Grafana, alerts, or time series pipelines?

Drop a comment, share this post, or connect with me on LinkedIn. Let’s keep learning and experimenting, Redis has way more up its sleeve than just caching.

Why Your AI Forgets Everything You Say (And What Context Has to Do With It)

Ignacio Van Droogenbroeck — Fri, 16 May 2025 12:55:36 GMT

The other day I was chatting in the SysArmy group when someone dropped a classic complaint:

“My AI assistant forgets everything. I swear I told it the same thing three times already!”

Not really like that, but I like drama.

It made me laugh because… yeah, we’ve all been there. You write out this detailed prompt, hit enter, get a decent response, then follow up with a second question, and the AI stares at you like it has amnesia.

But it’s not being rude. Or lazy. It’s just… forgetful.

And that forgetfulness? It comes down to something called context length.

The Memory of a Goldfish?

You know Dory from Finding Nemo, right?

Adorable. Enthusiastic. Totally unable to remember anything for more than a few seconds.

That’s how a lot of language models feel when their context window is too small. You’ll explain the whole situation, your code, your problem, the structure, the requirements, and just two prompts later, it’s asking you to repeat everything again or is not taking the previous changes or instructions.

That’s because these models aren’t really “aware” of your full conversation history unless it fits in their context window, like a box of short-term memory they carry around. Once that box is full, they start dropping stuff to make room for new info.

So What Is Context Length?

Context length is how many tokens (basically: chunks of words) the model can handle at once.

Think of it like scrolling back in a WhatsApp chat. Some models can scroll up through 300 messages. Others? Maybe 10 before they get lost.

You can think that person in SysArmy was probably using a model with a 4K or 16K token limit, but according to him was using Claude Max, so, maybe the problem is other, but when we are talking about context length, that might sound like a lot, but throw in some JSON, API logs, and code blocks? You’ll hit the limit fast. Once you do, the model stops seeing the full picture.

Not All Models Are Created Equal

Here’s where things get interesting.

Modern models like GPT-4o or Claude 3 Opus are pushing that memory way further.

GPT-4o gives you 128,000 tokens of memory, enough for a few novels. Claude? 200,000 tokens, practically a whole filing cabinet.

Suddenly, your agent can read through an entire user manual, hold a two-hour conversation, analyze your full backend logs, and still remember that joke you told in the first prompt.

It changes the game.

Why It Matters (Beyond Just Being Annoying)

If you’re just asking the weather, a short memory is fine. But if you’re:

debugging complex infra
reviewing legal contracts
doing research with long PDFs
or talking to a customer for more than 5 minutes…

…context length makes or breaks the experience.

When someone says “this model is so much smarter,” often what they’re actually feeling is:

“This model remembers what I said.”

And that’s powerful.

So Next Time…

Next time your AI assistant acts like it’s never met you before, don’t get mad. Just ask:

How big is your brain, buddy?

If it’s running on a model with a tiny context window, maybe it’s not forgetful, just overwhelmed.

And maybe it’s time to give Dory a break and upgrade to something with a little more memory.

Got stories of your AI forgetting your life story mid-chat? Drop them in the comments. Let’s trade notes.

By the way. Join to SysArmy in Discord through the following link:

Join the sysarmy Discord Server!

Comunidad de sistemas que hace +10 años nuclea a profesionales del área para favorecer el contacto y el intercambio | 9460 members

Discord

Building a Hybrid Search App with Qdrant: A Technical Walkthrough

Ignacio Van Droogenbroeck — Fri, 25 Apr 2025 21:36:53 GMT

Recently I applied for a Search Solution Architect position at Qdrant. In the case that you don't know Qdrant, is a vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points, vectors with an additional payload Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural-network or semantic-based matching, faceted search, and other applications.

In this blog, we already have an approach to vector databases with Milvus in the following blog post:

Milvus Unleashed: A First Dive into Vector Databases

Milvus is a powerful open-source vector database that excels at managing unstructured data such as images, video, and audio. This article explores Milvus’s capabilities, including an image similarity search use case using Python.

CDUser - Ignacio Van Droogenbroeck's Technical BlogIgnacio Van Droogenbroeck

Anyways, as part of the application, I was tasked to create a Hybrid Search App using Qdrant. Even though I didn't land the job, the journey resulted in a great technical experiment that I like to share here for the benefit of both beginners and experienced developers.

The complete project / code can be found here:

GitHub - xe-nvdk/qdrant-hybrid-search-demo

Contribute to xe-nvdk/qdrant-hybrid-search-demo development by creating an account on GitHub.

GitHubxe-nvdk

Project Objective

The assignment was to create a Hybrid Search App using Qdrant with the following features:

Dense and Sparse Vector Search combined (hybrid search)
Binary Quantization applied to dense vectors
Late Interaction Model for re-ranking results
Support for User-specific Filtering
Qdrant Cluster with Two Nodes and Three Shards

All without a UI, just a backend pipeline and search engine.

Architecture Overview

In the repo that I shared before, you will find the following structure:

Part	Purpose
Docker Compose	Deploys a two-node Qdrant cluster locally with replication and sharding
`init_qdrant.py`	Initializes the Qdrant collection, applies binary quantization, sets up indexes, Downloads Wikipedia articles, assigns random user IDs, and ingests documents
`search.py`	Performs hybrid search (dense + sparse) and re-ranks results using a cross-encoder
`requirements.txt`	Lists all Python dependencies for easy environment setup

Part 1: Docker Compose (Cluster Setup)

The first part is easy, a docker-compose file spining up 2 Qdrant nodes, setting a replication and 3 shards. I love the way that I was able to configure this, because, I realized, during this project how much I love passing configuration through environment variables instead of config files.

So, basically I build a file to do the following:

Spin up 2 Qdrant nodes
Enable cluster mode
Set up replication and 3 shards
Expose necessary ports for HTTP and gRPC communication

To run this, you need to have docker installed or something like Rancher Destop and run the following:

docker compose up -d

If everything goes well, you will see something like this:

13c622fd5508   qdrant/qdrant:latest         "./qdrant --bootstra…"   2 weeks ago      Up 37 minutes   0.0.0.0:6336->6333/tcp, [::]:6336->6333/tcp, 0.0.0.0:6337->6334/tcp, [::]:6337->6334/tcp, 0.0.0.0:6338->6335/tcp, [::]:6338->6335/tcp   qdrant_node2
80352a5ea705   qdrant/qdrant:latest         "./qdrant --uri http…"   2 weeks ago      Up 37 minutes   0.0.0.0:6333-6335->6333-6335/tcp, :::6333-6335->6333-6335/tcp                                                                           qdrant_node1

To validate if this was working or not, you can navigate to this URL: http://localhost:6333/cluster and you should have a response like the following:

{"result":{"status":"enabled","peer_id":1058100155072525,"peers":{"1058100155072525":{"uri":"http://qdrant_node1:6335/"},"8488705440897638":{"uri":"http://qdrant_node2:6335/"}},"raft_info":{"term":4,"commit":106,"pending_operations":0,"leader":8488705440897638,"role":"Follower","is_voter":true},"consensus_thread_status":{"consensus_thread_status":"working","last_update":"2025-04-25T20:05:01.089361773Z"},"message_send_failures":{}},"status":"ok","time":0.000017792}

Part 2: init_qdrant.py (Collection Initialization) and Ingesting data (Where the fun starts)

The next part was initialize the collection in Qdrant with Dense and Sparse vectors using all-MiniLM-L6-v2 and Splade_PP_en_v1 models, and creation of the index on user_id. So, basically, we are preparing the database for optimized hybrid search and user-specific queries.

The code of this file, can be found in the repo:

qdrant-hybrid-search-demo/init_qdrant.py at main · xe-nvdk/qdrant-hybrid-search-demo

Contribute to xe-nvdk/qdrant-hybrid-search-demo development by creating an account on GitHub.

GitHubxe-nvdk

Inside of this script, you will find the function ingest_data that what it does is download the Wikipedia data set using Hugging Face, assign a random user_id and ingest documents in batches with the associated metadata. So, basically, we are populating the collection with real-world, semi-structured data.

The Challenge for me at this point, was the speed of ingestion. The task required a million of datapoints and the ingestion was super slow, wasn't sure if this was Python or Qdrant.

Part 3: search.py (Hybrid Search and Re-ranking)

Once the data was flowing into the database, the next step was building the real magic: the search experience.

Inside search.py, we take a user's query and do several things under the hood:

First, we encode the query into a dense vector (semantic meaning) and a sparse vector (keyword/token-based meaning).
Then, we run a hybrid search in Qdrant, combining the strengths of both approaches.
After getting the initial candidates, we re-rank the results using a late interaction model based on a Cross Encoder (cross-encoder/ms-marco-MiniLM-L-6-v2).
The search also supports filtering by user_id if you want to restrict results to specific users, like a scoped search.

The final outcome? You get smarter, more contextually relevant search results.

At the beginning of the ingestion (when I only had ~49 documents), results weren't amazing (See bonus track below), but once you start feeding real data, the difference becomes noticeable.

Running a search is as simple as:

python3 search.py "Who is Lionel Messi?"

And if you want to filter by a specific user:

python3 search.py "Who is Lionel Messi?" --user-id 7

Result #1
ID: 560371
Original Score: 1.0000
Re-rank Score: 9.2263
User ID: 8
Text: Lionel Andrés Messi (; born 24 June 1987), also known as Leo Messi, is an Argentine professional footballer who plays as a forward for  club Paris Saint-Germain and captains the Argentina national team. Often considered the best player in the world and widely regarded as one of the greatest players

Environment Setup (Getting Everything Up and Running)

Now that you know what each part does, here’s how to get your environment ready to actually spin up the cluster, load the data, and start searching.

1. Create and Activate a Virtual Environment

We always want a clean Python environment for projects like this. Run:

python3 -m venv env
source env/bin/activate

This will create a virtual environment named env and activate it.

2. Install the Project Dependencies

Install everything needed with just one command:

pip install -r requirements.txt

This pulls all the libraries we use: Qdrant client, Sentence Transformers, Hugging Face datasets, etc.

3. Start the Qdrant Cluster

Now we need to bring our two-node Qdrant cluster to life:

docker compose up -d

This launches two Qdrant nodes locally, with replication and sharding enabled, so it feels like a mini production cluster.

4. Initialize the Collection

Once the cluster is running, let’s initialize the collection and download the data.

python3 init_qdrant.py

This will:

Delete any existing collection called hybrid_search
Create a new collection with dense/sparse vectors and binary quantization
Set up the user_id index for filtering later
Download the Wikipedia dataset from Hugging Face

5. Search!

Finally, you can run a hybrid search with re-ranking:

python3 search.py "your query here"

Example, like I showed you before:

python3 search.py "Who is Lionel Messi?"

Final Thoughts (Wrapping It Up)

Even though at the beginning only a handful of documents (around 49!) were ingested, the whole architecture is ready for scale.

This project demonstrates a complete flow: setting up a clustered environment, ingesting real-world data, doing hybrid vector search, applying binary quantization, and re-ranking results for better accuracy, all production-grade concepts.

It was a fantastic experiment born from a real-world application challenge at Qdrant, and even though the journey didn't end with the job, it absolutely sharpened my skills and wanted to go deeper on vector databases, hybrid search, and search architecture.

If you're starting your journey with vector databases, or planning to build smarter search systems, this setup is a perfect playground to learn, break things, and improve.

Happy experimenting! 🚀

Bonus Track

#architectthis #ai #machinelearning #nlp #vectorsearch #hybridsearch #llm… | Ignacio Van Droogenbroeck

I'm working on prototyping a Hybrid Search App using Qdrant. The dataset? Wikipedia. Last night, I started ingesting documents into the database. Just a few minutes in, after only a handful of entries, I ran a simple test query: "Who is Lionel Messi?" And here’s what came back as result #1: "Jeffrey Lionel Dahmer (; May 21, 1960 – November 28, 1994), also known as the Milwaukee Cannibal or the Milwaukee Monster, was an American serial killer..." Not exactly the GOAT I was expecting. The takeaway? If your ingestion process isn’t clean, it doesn’t matter how advanced your methods are (sparse, dense, reranked, trained, or unicorn-optimized) the output will still be trash. Bad data in = bad results out. And no reranker in the world is going to turn Dahmer into Messi. #ArchitectThis #AI #MachineLearning #NLP #VectorSearch #HybridSearch #LLM #OpenSource #Qdrant #HappyFriday

LinkedInIgnacio Van Droogenbroeck

Why I left AWS

Ignacio Van Droogenbroeck — Tue, 22 Apr 2025 13:06:31 GMT

IMPORTANT: If you’re looking for a rant against AWS, you won’t find it here.

It’s interesting how we're shaped to believe we shouldn’t let any opportunity pass us by. Partially, I agree with that mindset, but there’s a significant "BUT", I prefer to prioritize things beyond prestige or money. Many friends called me crazy for leaving an amazing company like AWS after such a short time. Let me explain why I made that decision and what truly matters to me.

A Little About Me

To better understand my perspective, let me tell you a bit about myself. Don't worry, I won’t list my career history or mention all my past colleagues. Instead, I'll focus on my personality.

I’m not your typical technical professional. I constantly challenge the status quo, often becoming the "annoying" person asking "why?" at every step. I'm the guy who knows nothing about Marvel or Star Wars, who thrives on dynamism, perhaps excessively at times, and who complains about rigid processes or their absence. I learn by doing, and above all, prioritize my emotional well-being.

With that context, let me share how my nearly three-month journey at AWS unfolded, and perhaps you’ll understand why I left.

The Hiring Process

The hiring process at AWS can be challenging if you don’t carefully follow the extensive resources provided at each stage. If you fail, it's likely because you didn’t put in the necessary effort or lacked relevant experience. However, if you follow the recruiter’s guidance and prepare adequately, getting into AWS is quite manageable.

Credit where it's due, the hiring was efficient, taking about a month from interviews to receiving an offer, despite coinciding with AWS re:Invent. Kudos to them on that front.

However, things got interesting after accepting the offer. The onboarding process ("Embark") involved substantial paperwork and definitions, some of which felt unnecessarily complex and unclear. Yet, managing onboarding at scale with 1.5 million employees can't be easy, so I can’t fault them entirely.

Onboarding Experience

The "fun" began on my first day, seated in front of a 16-inch MacBook Pro M1, starting the onboarding. I spent 116 hours watching videos about processes, tools, AWS technical content (valuable, undoubtedly), and company culture. Don't get me wrong, I'm not trying to offend anyone, but those videos drained my soul. Sitting passively without the opportunity to shadow peers, understand business metrics, or even grasp how my skills added value to the team was deeply frustrating, especially as I was eager to dive right in.

Certain mandatory processes made me feel miserable, yet there was no alternative. Are these processes inherently wrong? No. Managing people at such scale isn't simple, and I don't have all the answers. However, one crucial point I'd like to emphasize is: Culture is experienced through daily interactions, not through videos.

The moment I realized AWS wasn't the right fit FOR ME, and yes, the caps are intentional, as this is a personal reflection, occurred during an "Awesome Builder" event. While preparing an AWS value proposition presentation, I struggled in dry runs and my initial event. As a storyteller, I naturally wanted to infuse my personal style into the presentation. Unfortunately, I was discouraged from incorporating my past experiences. I was explicitly told that AWS has a unique approach, implying my prior expertise was irrelevant. While I understand the need for consistency, completely disregarding my 20 years of cloud services experience felt dismissive, after all, I had been selling cloud services long before many current employees started working.

What I Learned

My primary takeaway from this experience is clear: large corporations aren’t for me. No matter how agile or innovative they appear, big companies don’t align with my working style or personality.

What suits me best? The messy, chaotic world of startups, where scopes change weekly, roles overlap, and wearing multiple hats, solutions architect, post-sales engineer, technical support, is the norm. I joke about this dynamic regularly in my newsletter, Architect This! but honestly, I thrive in it.

Architect This! | LinkedIn

Ignacio Van Droogenbroeck | Powered by caffeine, chaos, and 20 years of infrastructure mistakes I’ve already made so you don’t have to.

LinkedInIgnacio Van Droogenbroeck

I prefer companies where onboarding involves direct interaction with customers, not recklessly, of course, but with genuine engagement. Organizations where flexibility reigns, where you know your colleagues personally because there aren’t more than 100 of them, and you've collaborated with most. Here, culture isn’t taught via videos or quizzes about escalating issues to managers, it’s something you breathe and live daily.

That’s why I left AWS. Perhaps for you, AWS is the perfect place, and sincerely, I hope it will be. But for someone like me, constantly chasing novelty, embracing uncertainty, making bold moves, traveling 14 hours just to explore new opportunities and innovations, stability or retiring comfortably in a big company simply isn’t appealing. Doing things right matters, but for me, a conventional path isn’t the perk I’m after.

How the Future Looks

I'm still figuring it out—I won’t pretend otherwise. I'm currently exploring opportunities in the startup world, investing time in ExyData, and diving deep into 3D printing. I'm optimistic something good will come along, not because I'm overly confident, but because I'm continually learning, exploring, and genuinely enjoying life.

Of course, I'm open to Sales Engineering or Solution Architecture roles at fully remote companies. If you're looking for a seasoned professional in these areas with strong business acumen, let’s talk.

Why We Shouldn't Fall in Love with Technology: Lessons from 20 Years in Tech

Ignacio Van Droogenbroeck — Tue, 07 Jan 2025 17:28:58 GMT

Introduction

In 2024, I reached a milestone of 20 years working in technology companies. Throughout my career, I've worked with vendors in multinational companies like Microsoft, Telefonica, and now AWS, as well as startups like InfluxData (creators of InfluxDB). I've founded (and failed at) a few startups, worked with Value Add Distributors like Licencias OnLine and Westcon Comstor, and collaborated with software factories providing staff augmentation to other companies. One common thread I've noticed across these organizations (except AWS, where I'm still in the onboarding phase) is how deeply people fall in love with technology, sometimes to the point where criticizing their preferred technology is taken as a personal insult.

It's shocking to me when technical disagreements are perceived as personal attacks. Let me explain why.

The Problem with Tech Attachment

Despite my 20 years in technology and 16 years of blogging, during which I've experimented with hundreds of technologies, I've always viewed technology as a means to an end, not the end itself. While I certainly have my preferences, I've never become emotionally attached to what I've built or chosen at any point in my career. I remain open to learning about new technologies, methods, or approaches that can help me achieve my goals, whether business, personal, or educational. For instance, I use Ghost for my blog not because I hate WordPress, but because it allows me to focus on what's important to me: writing.

Real-World Examples

I recall presenting Terraform to several teams at one organization, demonstrating how we could reduce weeks of work to just an hour or less. Some audience members questioned this, defending their months-long Bash scripting project. They resisted learning something new, citing concerns about standardization and claiming, "Bash is good enough for us so far."

In my last startup, I focused on solving customer needs. As a non-experienced developer, I managed to code a Python engine that could interact with multiple cloud providers to deploy databases, authentication mechanisms, and billing features within minutes. Despite being proud of this "baby," I remained clear-headed about the need to pivot if it wasn't meeting market needs, regardless of how impressive the technology was or how much proud I was about what I built.

How to Overcome Technology Attachment

A few thoughts on how to overcome this...

Don't take things personally - It's not an attack on you if your preferred technology (like InfluxDB) is outperformed by an alternative (like ClickHouse).
Create solutions that solve real problems - Don't build complex architectures just because you can; work backwards from customer needs, as Amazon does so well.
Stay open-minded - When presented with new solutions or methodologies, listen and evaluate their potential value before dismissing them.
Keep learning and iterating - Continue to read, practice, and learn from failures.
Remember: Technology is a means, not an end.

Conclusion

I understand it's challenging to move away from technologies and practices we've championed throughout our careers. Our ego makes it difficult to accept that our long-held beliefs might not be optimal anymore. However, we must maintain the same curiosity and openness to learning that we had at the beginning of our careers.

The key is to remain receptive to new ideas while being discerning, embrace what's useful and discard what isn't. Have a great 2025!

Machine Learning for Voice Recognition: How To Create a Speaker Identification Model in Python

Ignacio Van Droogenbroeck — Tue, 29 Oct 2024 14:13:38 GMT

A few months ago, I started messing around with machine learning by building a model that could spot cracks in roads. It was super interesting, with a lot of learning involved as I fine-tuned the model until it could actually recognize cracks.

Today, we’re going to take this idea and apply it to audio. I trained a model using Torchaudio and Python to identify speakers, and I’ll walk you through how to set it up yourself.

Data Collection

To build a good model, the quality of the data is key. We need clean data so the model can learn effectively. For this project, I used speeches from U.S. presidents. I wanted to create a model that could recognize who’s speaking.

So, I headed over to the Miller Center website and downloaded the first five speeches from Joe Biden, Donald Trump, and Barack Obama. Then, I set up three folders, one for each president, and saved the speeches as MP3 files in each folder.

Presidential Speeches | Miller Center

Miller Center

data/barack_obama:
total 5632496
-rw-r--r--@ 1 nacho  staff    52M Oct 28 13:36 bho_2015_0626_ClementaPickney.mp3
-rw-r--r--@ 1 nacho  staff    83M Oct 28 13:36 bho_2016_0112_StateoftheUnion.mp3
-rw-r--r--@ 1 nacho  staff    47M Oct 28 13:37 bho_2016_0322_PeopleCuba.mp3
-rw-r--r--@ 1 nacho  staff    58M Oct 28 13:36 bho_2016_0515_RutgersCommencement.mp3
-rw-r--r--@ 1 nacho  staff    71M Oct 28 13:35 obama_farewell_address_to_the_american_people.mp3

data/donald_trump:
total 1023680
-rw-r--r--@ 1 nacho  staff   2.5M Oct 28 13:30 Message_Donald_Trump_post_riot.mp3
-rw-r--r--@ 1 nacho  staff    15M Oct 28 13:31 President_Trump_Remarks_on_2020_Election.mp3
-rw-r--r--@ 1 nacho  staff   932K Oct 28 13:30 Trump_message_to_supporters_during_capitol_riot.mp3
-rw-r--r--@ 1 nacho  staff   4.8M Oct 28 13:29 message_from_trump.mp3
-rw-r--r--@ 1 nacho  staff    18M Oct 28 13:29 trump_farewell_address.mp3

data/joe_biden:
total 1336536
-rw-r--r--@ 1 nacho  staff   9.0M Oct 28 13:33 biden_addresses_nation_2024_07_14.mp3
-rw-r--r--@ 1 nacho  staff    15M Oct 28 13:33 biden_addresses_nation_2024_07_25.mp3
-rw-r--r--@ 1 nacho  staff    34M Oct 28 13:32 biden_remarks_UN_2024_09_24.mp3
-rw-r--r--@ 1 nacho  staff    20M Oct 28 13:33 biden_remarks_on_middle_east.mp3

Once I had this, it's time to start to code the trainer

Building the Trainer

Now that we have our data organized, it’s time to get coding and set up the trainer. We’ll be using PyTorch and Torchaudio to process the audio data and train our model to recognize different speakers.

Step 0: Converting MP3 to WAV

First up, we need our audio data in a consistent format. While MP3 is great for compression, WAV files are often better for training audio models since they don’t lose data through compression. We run through each speaker’s folder, find all the MP3 files, and convert them to WAV format with pydub. This ensures our model can learn from the highest-quality audio possible.

import os
import torch
import torchaudio
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import LabelEncoder
from pydub import AudioSegment

# Step 0: Convert MP3 files to WAV format
def convert_mp3_to_wav(root_dir):
    for speaker in os.listdir(root_dir):
        speaker_dir = os.path.join(root_dir, speaker)
        for file_name in os.listdir(speaker_dir):
            if file_name.endswith(".mp3"):
                mp3_path = os.path.join(speaker_dir, file_name)
                wav_path = os.path.join(speaker_dir, f"{os.path.splitext(file_name)[0]}.wav")
                audio = AudioSegment.from_mp3(mp3_path)
                audio.export(wav_path, format="wav")
                print(f"Converted {mp3_path} to {wav_path}")

Step 1: Preparing the Data with `SpeakerDataset`

To feed our model, we define a custom dataset class called SpeakerDataset. Here’s what’s happening in this part:

Loading Data: We loop through each speaker’s folder, collecting each audio file and labeling it based on the speaker’s name. This label will later help our model know which speaker is which.
Label Encoding: Since the model works with numbers (not names), we convert each speaker’s name into a unique integer label using LabelEncoder from scikit-learn. This way, the model can focus on learning the patterns in each speaker’s voice, rather than interpreting names.
Extracting MFCC Features: MFCC (Mel Frequency Cepstral Coefficients) is a popular technique in audio processing, especially for speech and speaker recognition. It transforms our audio waveform into features that make it easier for the model to learn speaker-specific patterns. We specify parameters like n_mfcc and n_mels to tweak the level of detail the model will learn from, but keeping it simple for now with n_mfcc=13 and n_mels=80.

# Step 1: Define the dataset class for speaker classification
class SpeakerDataset(Dataset):
    def __init__(self, root_dir, n_mfcc=13, n_mels=80):
        self.root_dir = root_dir
        self.speakers = os.listdir(root_dir)
        self.file_paths = []
        self.labels = []
        self.n_mfcc = n_mfcc
        self.n_mels = n_mels
        
        # Label encoding
        self.label_encoder = LabelEncoder()
        self.label_encoder.fit(self.speakers)
        
        # Load file paths and labels
        for speaker in self.speakers:
            speaker_dir = os.path.join(root_dir, speaker)
            for file_name in os.listdir(speaker_dir):
                if file_name.endswith(".wav"):
                    self.file_paths.append(os.path.join(speaker_dir, file_name))
                    self.labels.append(speaker)

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        file_path = self.file_paths[idx]
        label = self.labels[idx]
        waveform, sample_rate = torchaudio.load(file_path)
        mfcc = torchaudio.transforms.MFCC(
            sample_rate=sample_rate, 
            n_mfcc=self.n_mfcc, 
            melkwargs={'n_mels': self.n_mels}
        )(waveform)
        mfcc = mfcc.mean(dim=2).squeeze()  # Reduce to 1D by averaging over time

        # Encode label as integer
        label = self.label_encoder.transform([label])[0]
        return mfcc, label

Step 2: Building the Model with `SpeakerClassifier`

Now we define our neural network model using a simple architecture. This SpeakerClassifier has two fully connected (FC) layers:

First Layer (fc1): Takes our MFCC features as input and learns basic voice patterns.
Activation Layer: Adds some non-linearity with ReLU, helping the model capture complex voice patterns.
Second Layer (fc2): Outputs predictions for each speaker, giving us the final classification.

This setup is simple but effective, and it’s small enough to run on most machines without needing specialized hardware.

# Step 2: Define the classification model
class SpeakerClassifier(nn.Module):
    def __init__(self, input_size, num_classes):
        super(SpeakerClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, 80)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(80, num_classes)

    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten the MFCC features
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

Step 3: Training the Model

The training loop is where we teach the model to recognize each speaker. Here’s the basic flow:

Forward Pass: The model makes predictions based on the MFCC features.
Calculate Loss: We use CrossEntropyLoss, which is a great choice for classification tasks like ours.
Backpropagation: The model learns from its mistakes and updates its parameters to improve.
Repeat: We run this over 20 epochs to make sure the model gets enough practice with the data.

# Step 3: Train the model
def train_model(model, dataloader, criterion, optimizer, num_epochs=20):
    for epoch in range(num_epochs):
        running_loss = 0.0
        for mfcc, labels in dataloader:
            optimizer.zero_grad()
            outputs = model(mfcc)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(dataloader):.4f}")

Step 4: Putting It All Together

Finally, in the main function, we:

Convert our MP3s to WAVs (if they haven’t already been converted).
Initialize our dataset and data loader.
Set up and train our model with the features from each speaker’s audio.
Save the Model: Once training is done, we save the model weights and the label encoder so we can load them later for predictions.

And that’s it! By the end, we have a model ready to identify speakers based on audio input.

# Step 4: Main function to train the speaker classification model
def main():
    # Convert MP3 files to WAV format
    root_dir = 'data/'  # Update with your data directory
    convert_mp3_to_wav(root_dir)

    # Path to your data folder
    dataset = SpeakerDataset(root_dir, n_mfcc=13, n_mels=40)  # Setting n_mels to 80
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

    # Model setup
    sample_mfcc, _ = dataset[0]
    input_size = sample_mfcc.numel()
    num_classes = len(dataset.speakers)
    model = SpeakerClassifier(input_size, num_classes)

    # Training setup
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    train_model(model, dataloader, criterion, optimizer, num_epochs=20)

    # Save the trained model and the label encoder
    torch.save(model.state_dict(), 'speaker_classifier.pth')
    torch.save(dataset.label_encoder, 'label_encoder.pth')
    print("Model and label encoder saved.")

if __name__ == "__main__":
    main()

This setup keeps things simple but effective.

Let's save this under, trainer.py, and execute it. The result, should be something like this:

python3 trainer.py

With the model trained, the next step is testing it out on new audio samples to see how well it recognizes each speaker. Let’s move on to the testing phase!

Converted data/barack_obama/bho_2016_0322_PeopleCuba.mp3 to data/barack_obama/bho_2016_0322_PeopleCuba.wav
Converted data/barack_obama/obama_farewell_address_to_the_american_people.mp3 to data/barack_obama/obama_farewell_address_to_the_american_people.wav
Converted data/barack_obama/bho_2016_0112_StateoftheUnion.mp3 to data/barack_obama/bho_2016_0112_StateoftheUnion.wav
Converted data/barack_obama/bho_2016_0515_RutgersCommencement.mp3 to data/barack_obama/bho_2016_0515_RutgersCommencement.wav
Converted data/barack_obama/bho_2015_0626_ClementaPickney.mp3 to data/barack_obama/bho_2015_0626_ClementaPickney.wav
Converted data/joe_biden/biden_addresses_nation_2024_07_14.mp3 to data/joe_biden/biden_addresses_nation_2024_07_14.wav
Converted data/joe_biden/biden_remarks_on_middle_east.mp3 to data/joe_biden/biden_remarks_on_middle_east.wav
Converted data/joe_biden/biden_addresses_nation_2024_07_25.mp3 to data/joe_biden/biden_addresses_nation_2024_07_25.wav
Converted data/joe_biden/biden_remarks_UN_2024_09_24.mp3 to data/joe_biden/biden_remarks_UN_2024_09_24.wav
Converted data/donald_trump/trump_farewell_address.mp3 to data/donald_trump/trump_farewell_address.wav
Converted data/donald_trump/President_Trump_Remarks_on_2020_Election.mp3 to data/donald_trump/President_Trump_Remarks_on_2020_Election.wav
Converted data/donald_trump/Trump_message_to_supporters_during_capitol_riot.mp3 to data/donald_trump/Trump_message_to_supporters_during_capitol_riot.wav
Converted data/donald_trump/message_from_trump.mp3 to data/donald_trump/message_from_trump.wav
Converted data/donald_trump/Message_Donald_Trump_post_riot.mp3 to data/donald_trump/Message_Donald_Trump_post_riot.wav
Epoch [1/20], Loss: 5.0933
Epoch [2/20], Loss: 2.2040
Epoch [3/20], Loss: 2.5992
Epoch [4/20], Loss: 0.9386
Epoch [5/20], Loss: 0.8153
Epoch [6/20], Loss: 0.4925
Epoch [7/20], Loss: 0.3129
Epoch [8/20], Loss: 0.2327
Epoch [9/20], Loss: 0.2054
Epoch [10/20], Loss: 0.1137
Epoch [11/20], Loss: 0.1498
Epoch [12/20], Loss: 0.0704
Epoch [13/20], Loss: 0.1161
Epoch [14/20], Loss: 0.1120
Epoch [15/20], Loss: 0.1477
Epoch [16/20], Loss: 0.0666
Epoch [17/20], Loss: 0.1880
Epoch [18/20], Loss: 0.0724
Epoch [19/20], Loss: 0.1370
Epoch [20/20], Loss: 0.0584
Model and label encoder saved.

Testing the Model

We’ve trained our model and can see how it improved over each epoch. You should now see two files in your directory: speaker_classifier.pth and label_encoder.pth. These contain the model’s learned parameters and the encoded labels for each speaker. We’ll load them in a moment to put our model to the test.

Show Me the Results

To test the model, we’ll use a fresh speech sample. Head over to the Miller Center website, download any speech by Obama, Trump, or Biden that wasn’t used in the training set, and save it in the data folder with the filename new.mp3.

Let’s dive into the prediction script.

import torch
import torch.nn as nn
import torchaudio
from sklearn.preprocessing import LabelEncoder
from pydub import AudioSegment

# Step 0: Convert new.mp3 to new.wav
def convert_mp3_to_wav(file_path):
    audio = AudioSegment.from_mp3(file_path)
    audio.export("data/new.wav", format="wav")

First, we convert our new MP3 sample into WAV format. This step makes sure the format matches what we used for training, ensuring smoother processing.

Loading the Model

Now we define the same model structure (SpeakerClassifier) and load our trained model and label encoder.

# Step 1: Define the model class (same as used during training)
class SpeakerClassifier(nn.Module):
    def __init__(self, input_size, num_classes):
        super(SpeakerClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, 80)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(80, num_classes)

    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten the MFCC features
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Step 2: Load the model and label encoder
def load_model(input_size, num_classes):
    model = SpeakerClassifier(input_size, num_classes)
    model.load_state_dict(torch.load('speaker_classifier.pth', weights_only=False))
    model.eval()
    return model

label_encoder = torch.load('label_encoder.pth', weights_only=False)

Here, we’re setting up the model just like in training. Then, we load the saved weights (speaker_classifier.pth) and the label encoder (label_encoder.pth) to map predictions back to speaker names.

Preparing the New Audio Sample

Next, we extract MFCC features from the new audio file. This converts the audio into features that our model can work with.

# Step 3: Prepare the new audio sample
def extract_mfcc(file_path, n_mfcc=13, n_mels=40):  # Reduced n_mels to 60
    waveform, sample_rate = torchaudio.load(file_path)
    mfcc = torchaudio.transforms.MFCC(
        sample_rate=sample_rate, 
        n_mfcc=n_mfcc, 
        melkwargs={'n_mels': n_mels}
    )(waveform)
    return mfcc.mean(dim=2).squeeze()

Predicting the Speaker

This is the moment of truth! In predict_speaker, we pass our MFCC features through the model and let it make a prediction. The model’s output is then matched to the corresponding speaker’s name.

# Step 4: Predict speaker
def predict_speaker(file_path, model, label_encoder):
    mfcc = extract_mfcc(file_path).unsqueeze(0)  # Add batch dimension
    with torch.no_grad():
        output = model(mfcc)
        _, predicted = torch.max(output, 1)
        speaker_name = label_encoder.inverse_transform([predicted.item()])[0]
        print(f"Predicted Speaker: {speaker_name}")
        return speaker_name

Running the Prediction

In the last part, we:

Convert our test MP3 file into WAV format.
Calculate the input size from a sample MFCC.
Load the trained model and label encoder.
Run predict_speaker to see if the model correctly identifies the speaker.

# Example usage
if __name__ == "__main__":
    # Convert MP3 to WAV
    convert_mp3_to_wav("data/new.mp3")
    
    # Calculate input size from a sample MFCC
    sample_mfcc = extract_mfcc("data/barack_obama/bho_2015_0626_ClementaPickney.wav")
    input_size = sample_mfcc.numel()
    num_classes = len(label_encoder.classes_)

    model = load_model(input_size, num_classes)
    
    # Predict the speaker from a new audio file
    query_file = "data/new.wav"  # Replace with your query file path
    predict_speaker(query_file, model, label_encoder)

In my case, I saved everything together in a file called predictor.py and ran it:

python3 predictor.py

The result printed on the screen was:

Predicted Speaker: barack_obama

This means it worked! As I mentioned earlier, I selected this speech by Barack Obama randomly. The speech I used was November 15, 2021: Signing the Infrastructure Investment and Jobs Act from the Miller Center.

To Conclude

As you can see, training your own models isn’t too hard. The key is having clean, well-organized data to ensure the model can effectively make predictions, in this case, recognizing the speaker.

For other use cases, you may need more data, depending on the variety of what you’re trying to analyze.

This model is just the beginning. Later, we’ll integrate it with Milvus to store a larger volume of data for deeper analysis and predictions. One goal I have in mind is to build my own assistant, and this code is the first step toward that.

Let me know what you think and if any other use cases come to mind for this approach!

Time-Series Data Meets Blockchain: Storing Time-Series Data with Solidity, Ganache and Python

Ignacio Van Droogenbroeck — Thu, 24 Oct 2024 16:49:31 GMT

I was browsing x.com the other day and came across a tweet in Spanish from Mariano, a member of SysArmy. Just for fun and as a challenge, I decided to build a time-series data storage system using smart contracts. I can't help but laugh as I write this because, against all odds, it actually works pretty well.

Y? Como van esos proyectos con Blockchain?
— mariano.sh (@aragunde) October 17, 2024

A Few Things to Keep in Mind

Using a public blockchain probably isn’t the most cost-effective solution for this, mainly because of the gas fees required for every transaction. It’s also not the fastest option. However, using a private blockchain makes more sense for this particular use case. And of course, I did it for the classic reason all tech enthusiasts understand: not because it’s the best solution, but because I could do it.

Let’s Get Started

Here’s the plan: we’ll use Ganache to set up an internal Ethereum network, write and deploy smart contracts using Solidity and Truffle, and use Python to interact with the blockchain.

But before we start typing commands into the console, let's define a few things.

Smart Contracts as Databases: Conceptually, we’ll treat each smart contract as a database.
Storing Data Points: Using Solidity, we'll create a contract that lets us add data points with a timestamp and valueto various measurements (like "temperature" or "humidity").

I’m a huge fan of InfluxDB’s line protocol and how simple it is, so we’ll aim to replicate that schema. If you’re not familiar with it, here’s a quick example: "temperature", 42, 1729783306.

Deploying Ganache

Now that we’ve got a basic plan, let’s deploy Ganache. This tool will create a local Ethereum network on your computer and provide accounts with Ethereum addresses and private keys. You’ll use one of these accounts to deploy the contract and interact with the blockchain.

You can download Ganache from here:

Ganache - Truffle Suite

Quickly fire up a personal Ethereum blockchain which you can use to run tests, execute commands, and inspect state while controlling how the chain operates.

Truffle Logo

Also, if you have nvminstalled, you can run in your terminal this:

npm install -g ganache-cli

Once is installed, run Ganache doing this:

ganache-cli

You will see something like this:

ganache v7.9.2 (@ganache/cli: 0.10.2, @ganache/core: 0.10.2)
Starting RPC server

Available Accounts
==================
(0) 0xfC91D2E9b4fc7dD6258c0886f9C110BD353B50a6 (1000 ETH)
(1) 0xefbA0CB44F5d53aaBeFbbde41FFDcF151cfCF33C (1000 ETH)
(2) 0x709c88fD11dDba44aeD07Ca956059271450AE958 (1000 ETH)
(3) 0xa34c65e1EA5DA9a041621878B39f3334CFB7796f (1000 ETH)
(4) 0x6495F34b084A350601F81E74CFec0eEe3603F2A7 (1000 ETH)
(5) 0xa7827f684f2D1EDfF127d8d9D7625fD543Db880E (1000 ETH)
(6) 0x85217E5CD47225042f287A42321a6a7c0133a259 (1000 ETH)
(7) 0x617E74C5Faa90D29a9B9EC572F11f581F485ac75 (1000 ETH)
(8) 0xc56E0568362F88D1A107529Aa914dEC64eBE013B (1000 ETH)
(9) 0x108cda52Fd5DD40521D70413DD6AEDA2e0382A3E (1000 ETH)

Private Keys
==================
(0) 0xb22bb380153b13e6e7d97b98c4d92c3514955d77f23c498933b4393c1af49b76
(1) 0x6f31460485db5761470c0766d7e95f5b541e05a0054cf1afc2b1fb3894016cd1
(2) 0x25b8f4281be5999c7d44ff5d58faafc93bf222f2e01a2f5c7b09cd0ae7cf2962
(3) 0x3c7829f018cd1a96768704eee3c8e8aea82f406c5e3c133b806b7f80eb1bf7a8
(4) 0xe98b0fb1e708c3fcbc5fa03f16bd06b4f4299641488c82741d8c9097823b3776
(5) 0x99c163dec2e45e3d01211823ca6845d3b9952a32deec5f32eb51f6a76bb441af
(6) 0x7e5c1e58283be7f554cff8495745719ba6c3b34169aabd39051db6d6bcbcecd9
(7) 0x095134a44a8ef6f65f28f8ac02488825a80dbee0859ad3c173e15e6952cd8d6c
(8) 0x72c442d743ffc881cd6dd13289a616e0ca8daaec1f91b319d4320f42b6b11a7f
(9) 0x9c02a1ffcebd377d2a363910740781c1f6a1f39457da6335e5afff327e9b5567

HD Wallet
==================
Mnemonic:      chair vendor dial human olive dinner morning negative elevator alien catalog recipe
Base HD Path:  m/44'/60'/0'/0/{account_index}

Default Gas Price
==================
2000000000

BlockGas Limit
==================
30000000

Call Gas Limit
==================
50000000

Chain
==================
Hardfork: shanghai
Id:       1337

RPC Listening on 127.0.0.1:8545

This means that our network is up and running. As you can see, by default, we have ten accounts with their corresponding private keys.

Create a Truffle Project, Compile, and Deploy the Smart Contract

Now, let's create a project to write, compile, and deploy our smart contract to the network. Run the following commands:

mkdir blockchain-datastorage
cd blockchain-datastorage
truffle init

Let's modify the truffle-config.js file, should looks like this:

module.exports = {
  networks: {
    development: {
      host: "127.0.0.1",
      port: 8545,
      network_id: "*",
    },
  },
  compilers: {
    solc: {
      version: "0.8.0", // Solidity version to use
    },
  },
};

Create the Smart Contract Using Solidity

Now, let's create our smart contract using Solidity. Inside the contracts folder, create a file named Database.sol with the following content:

// SPDX-License-Identifier: MIT
pragma solidity ^0.8.0;

contract Database {
    struct DataPoint {
        uint256 timestamp;
        int256 value;
    }

    mapping(string => DataPoint[]) private measurements;

    function addDataPoint(string memory measurement, int256 value, uint256 timestamp) public {
        measurements[measurement].push(DataPoint(timestamp, value));
    }

    function getDataPoint(string memory measurement, uint256 index) public view returns (uint256, int256) {
        require(index < measurements[measurement].length, "Index out of bounds");
        DataPoint memory dataPoint = measurements[measurement][index];
        return (dataPoint.timestamp, dataPoint.value);
    }

    function getDataPointCount(string memory measurement) public view returns (uint256) {
        return measurements[measurement].length;
    }
}

As you can see here, we are defining the schema with three key components: the measurement (e.g., "temperature"), a value, and the timestamp for when the data point was recorded.

Compile the Contract

Now, let's compile the contract with the following command:

truffle compile

Deploy the Contract

Now, let's deploy the contract. First, create a new file in the migrations folder named 2_deploy_database.js with the following content:

const Database = artifacts.require("Database");

module.exports = function (deployer) {
  deployer.deploy(Database);
};

This script tells Truffle to deploy our Database contract to the network.

Once we have this saved, let's run this in the console:

truffle migrate --network development --reset

The output should look something like this:



Compiling your contracts...
===========================
> Everything is up to date, there is nothing to compile.


Starting migrations...
======================
> Network name:    'development'
> Network id:      1729783115705
> Block gas limit: 30000000 (0x1c9c380)


2_deploy_database.js
====================

   Deploying 'Database'
   --------------------
   > transaction hash:    0x395f273d8f96876f040dbda4ef4583cda15ecda8d6ce51fe6fcf15237922166d
   > Blocks: 0            Seconds: 0
   > contract address:    0xBA16d41238F91611caC04C958Da929cb78F2497B
   > block number:        1
   > block timestamp:     1729783252
   > account:             0xD410820FF4D7a00a03a0a58Bed49e81E4Ae29573
   > balance:             999.998579098
   > gas used:            421008 (0x66c90)
   > gas price:           3.375 gwei
   > value sent:          0 ETH
   > total cost:          0.001420902 ETH

   > Saving artifacts
   -------------------------------------
   > Total cost:         0.001420902 ETH

Summary
=======
> Total deployments:   1
> Final cost:          0.001420902 ETH

Save the contract address (0xYourContractAddressHere), as we’ll need it in a moment.

Writing and Retrieving Data Points Using Python

Now for the fun part, let's write some data points to our data warehouse (😆).

First, we need to install a package called web3:

python3 install web3

Python Code to Write and Retrieve Data Points

The code below will let us write and retrieve data points from our smart contract. For simplicity, everything is in a single script:

Note: Remember the smart contract address that I mentioned earlier? You’ll need it for this script. Additionally, you’ll need an account address and private key from one of the accounts displayed in the terminal running Ganache.

from web3 import Web3
import time

# Replace with the URL of your local Ganache instance
ganache_url = "http://127.0.0.1:8545"
w3 = Web3(Web3.HTTPProvider(ganache_url))

# Check if the connection is successful
if not w3.is_connected():
    raise ConnectionError(f"Failed to connect to the blockchain at {ganache_url}")

# Replace with your contract's ABI (copy this from Truffle build artifacts)
abi = [
    {
        "inputs": [
            {"internalType": "string", "name": "measurement", "type": "string"},
            {"internalType": "int256", "name": "value", "type": "int256"},
            {"internalType": "uint256", "name": "timestamp", "type": "uint256"}
        ],
        "name": "addDataPoint",
        "outputs": [],
        "stateMutability": "nonpayable",
        "type": "function"
    },
    {
        "inputs": [
            {"internalType": "string", "name": "measurement", "type": "string"},
            {"internalType": "uint256", "name": "index", "type": "uint256"}
        ],
        "name": "getDataPoint",
        "outputs": [
            {"internalType": "uint256", "name": "", "type": "uint256"},
            {"internalType": "int256", "name": "", "type": "int256"}
        ],
        "stateMutability": "view",
        "type": "function"
    },
    {
        "inputs": [{"internalType": "string", "name": "measurement", "type": "string"}],
        "name": "getDataPointCount",
        "outputs": [{"internalType": "uint256", "name": "", "type": "uint256"}],
        "stateMutability": "view",
        "type": "function"
    }
]

# Replace with your deployed contract address
contract_address = Web3.to_checksum_address("your_smart_contract_id")

# Create the contract instance using the checksum address
database_contract = w3.eth.contract(address=contract_address, abi=abi)

# Replace with your account address and private key
account_address = "your_account_address"
private_key = "your_private_key"

# Function to add a new data point and measure the time taken
def add_data_point(measurement, value, timestamp=None):
    if timestamp is None:
        timestamp = int(time.time())

    # Measure time for adding a data point
    start_time = time.time()
    
    transaction = database_contract.functions.addDataPoint(measurement, value, timestamp).build_transaction({
        'chainId': 1337,
        'gas': 2000000,
        'gasPrice': w3.to_wei('10', 'gwei'),
        'nonce': w3.eth.get_transaction_count(account_address),
    })

    signed_txn = w3.eth.account.sign_transaction(transaction, private_key=private_key)
    tx_hash = w3.eth.send_raw_transaction(signed_txn.raw_transaction)
    w3.eth.wait_for_transaction_receipt(tx_hash)

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Added data point to '{measurement}' with value {value}. Time taken: {elapsed_time:.4f} seconds")

# Function to retrieve a data point by index and measure the time taken
def get_data_point(measurement, index):
    start_time = time.time()

    timestamp, value = database_contract.functions.getDataPoint(measurement, index).call()

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Retrieved data point from '{measurement}' at index {index}. Time taken: {elapsed_time:.4f} seconds")
    return {'timestamp': timestamp, 'value': value}

# Function to retrieve the number of data points for a given measurement
def get_data_point_count(measurement):
    return database_contract.functions.getDataPointCount(measurement).call()

# Main function to insert and retrieve data for multiple measurements
def main():
    measurements = {
        "temperature": [42, 45, 47],
        "humidity": [30, 40, 50],
        "pressure": [1001, 1002, 1003]
    }

    # Add data points for each measurement
    for measurement, values in measurements.items():
        print(f"Adding data points to '{measurement}'...")
        for value in values:
            add_data_point(measurement, value)

    # Check the data point counts for each measurement
    for measurement in measurements.keys():
        print(f"\nChecking the data point count for '{measurement}'...")
        count = get_data_point_count(measurement)
        print(f"Data point count for '{measurement}': {count}")

        # Retrieve each data point for this measurement
        for i in range(count):
            data_point = get_data_point(measurement, i)
            print(f"{measurement.capitalize()} Data Point {i} - Timestamp: {data_point['timestamp']}, Value: {data_point['value']}")

if __name__ == "__main__":
    main()

In the script, you'll notice that we're hardcoding values for this demo, but it’s easy to see how you could integrate this with a Node-RED client to push real data. Now, let's run the script and see what we get:

python3 main.py

In this example, we are adding that from temperature, humidty and pressure so, the output looks like this:

Adding data points to 'temperature'...
Added data point to 'temperature' with value 42. Time taken: 0.0096 seconds
Added data point to 'temperature' with value 45. Time taken: 0.0087 seconds
Added data point to 'temperature' with value 47. Time taken: 0.0090 seconds
Adding data points to 'humidity'...
Added data point to 'humidity' with value 30. Time taken: 0.0096 seconds
Added data point to 'humidity' with value 40. Time taken: 0.0092 seconds
Added data point to 'humidity' with value 50. Time taken: 0.0097 seconds
Adding data points to 'pressure'...
Added data point to 'pressure' with value 1001. Time taken: 0.0105 seconds
Added data point to 'pressure' with value 1002. Time taken: 0.0110 seconds
Added data point to 'pressure' with value 1003. Time taken: 0.0091 seconds

Checking the data point count for 'temperature'...
Data point count for 'temperature': 12
Retrieved data point from 'temperature' at index 0. Time taken: 0.0040 seconds
Temperature Data Point 0 - Timestamp: 1729786885, Value: 42
Retrieved data point from 'temperature' at index 1. Time taken: 0.0036 seconds
Temperature Data Point 1 - Timestamp: 1729786885, Value: 45
Retrieved data point from 'temperature' at index 2. Time taken: 0.0039 seconds
Temperature Data Point 2 - Timestamp: 1729786885, Value: 47
Retrieved data point from 'temperature' at index 3. Time taken: 0.0038 seconds
Temperature Data Point 3 - Timestamp: 1729786887, Value: 42
Retrieved data point from 'temperature' at index 4. Time taken: 0.0044 seconds
Temperature Data Point 4 - Timestamp: 1729786887, Value: 45
Retrieved data point from 'temperature' at index 5. Time taken: 0.0044 seconds
Temperature Data Point 5 - Timestamp: 1729786887, Value: 47
Retrieved data point from 'temperature' at index 6. Time taken: 0.0041 seconds
Temperature Data Point 6 - Timestamp: 1729786888, Value: 42
Retrieved data point from 'temperature' at index 7. Time taken: 0.0035 seconds
Temperature Data Point 7 - Timestamp: 1729786888, Value: 45
Retrieved data point from 'temperature' at index 8. Time taken: 0.0037 seconds
Temperature Data Point 8 - Timestamp: 1729786888, Value: 47
Retrieved data point from 'temperature' at index 9. Time taken: 0.0037 seconds
Temperature Data Point 9 - Timestamp: 1729787175, Value: 42
Retrieved data point from 'temperature' at index 10. Time taken: 0.0033 seconds
Temperature Data Point 10 - Timestamp: 1729787175, Value: 45
Retrieved data point from 'temperature' at index 11. Time taken: 0.0040 seconds
Temperature Data Point 11 - Timestamp: 1729787175, Value: 47

Checking the data point count for 'humidity'...
Data point count for 'humidity': 12
Retrieved data point from 'humidity' at index 0. Time taken: 0.0036 seconds
Humidity Data Point 0 - Timestamp: 1729786885, Value: 30
Retrieved data point from 'humidity' at index 1. Time taken: 0.0038 seconds
Humidity Data Point 1 - Timestamp: 1729786885, Value: 40
Retrieved data point from 'humidity' at index 2. Time taken: 0.0046 seconds
Humidity Data Point 2 - Timestamp: 1729786885, Value: 50
Retrieved data point from 'humidity' at index 3. Time taken: 0.0038 seconds
Humidity Data Point 3 - Timestamp: 1729786887, Value: 30
Retrieved data point from 'humidity' at index 4. Time taken: 0.0041 seconds
Humidity Data Point 4 - Timestamp: 1729786887, Value: 40
Retrieved data point from 'humidity' at index 5. Time taken: 0.0035 seconds
Humidity Data Point 5 - Timestamp: 1729786887, Value: 50
Retrieved data point from 'humidity' at index 6. Time taken: 0.0038 seconds
Humidity Data Point 6 - Timestamp: 1729786888, Value: 30
Retrieved data point from 'humidity' at index 7. Time taken: 0.0036 seconds
Humidity Data Point 7 - Timestamp: 1729786888, Value: 40
Retrieved data point from 'humidity' at index 8. Time taken: 0.0039 seconds
Humidity Data Point 8 - Timestamp: 1729786888, Value: 50
Retrieved data point from 'humidity' at index 9. Time taken: 0.0036 seconds
Humidity Data Point 9 - Timestamp: 1729787175, Value: 30
Retrieved data point from 'humidity' at index 10. Time taken: 0.0036 seconds
Humidity Data Point 10 - Timestamp: 1729787175, Value: 40
Retrieved data point from 'humidity' at index 11. Time taken: 0.0033 seconds
Humidity Data Point 11 - Timestamp: 1729787175, Value: 50

Checking the data point count for 'pressure'...
Data point count for 'pressure': 12
Retrieved data point from 'pressure' at index 0. Time taken: 0.0038 seconds
Pressure Data Point 0 - Timestamp: 1729786885, Value: 1001
Retrieved data point from 'pressure' at index 1. Time taken: 0.0035 seconds
Pressure Data Point 1 - Timestamp: 1729786885, Value: 1002
Retrieved data point from 'pressure' at index 2. Time taken: 0.0034 seconds
Pressure Data Point 2 - Timestamp: 1729786885, Value: 1003
Retrieved data point from 'pressure' at index 3. Time taken: 0.0035 seconds
Pressure Data Point 3 - Timestamp: 1729786887, Value: 1001
Retrieved data point from 'pressure' at index 4. Time taken: 0.0036 seconds
Pressure Data Point 4 - Timestamp: 1729786887, Value: 1002
Retrieved data point from 'pressure' at index 5. Time taken: 0.0035 seconds
Pressure Data Point 5 - Timestamp: 1729786887, Value: 1003
Retrieved data point from 'pressure' at index 6. Time taken: 0.0033 seconds
Pressure Data Point 6 - Timestamp: 1729786888, Value: 1001
Retrieved data point from 'pressure' at index 7. Time taken: 0.0035 seconds
Pressure Data Point 7 - Timestamp: 1729786888, Value: 1002
Retrieved data point from 'pressure' at index 8. Time taken: 0.0033 seconds
Pressure Data Point 8 - Timestamp: 1729786888, Value: 1003
Retrieved data point from 'pressure' at index 9. Time taken: 0.0033 seconds
Pressure Data Point 9 - Timestamp: 1729787175, Value: 1001
Retrieved data point from 'pressure' at index 10. Time taken: 0.0034 seconds
Pressure Data Point 10 - Timestamp: 1729787175, Value: 1002
Retrieved data point from 'pressure' at index 11. Time taken: 0.0034 seconds
Pressure Data Point 11 - Timestamp: 1729787175, Value: 1003

So, basically, we're writing data in 9.6 milliseconds and reading it in an average of 3.5 milliseconds, not bad at all!

To conclude

This experiment was a lot of fun. I’ve never come across using Blockchain for a time-series use case like this, and now my mind is racing with questions. How feasible would this be on a private blockchain network where data is saved across multiple nodes, offering some level of redundancy? How expensive would it be in terms of resources to, let’s say, push 1 million metrics per second? Could a real data storage system be built from this? Instead of storing data directly on the filesystem, could we write it to something like S3?

This project sparked a lot of curiosity and questions, pushing me to think about how much deeper we could go and what else we could build. While I’m not convinced this makes sense for a real-world solution, I might give it a shot for the sake of learning.

What do you think about this project? What other use cases come to mind for this "Data Warehouse"?

Milvus Unleashed: A First Dive into Vector Databases

Ignacio Van Droogenbroeck — Thu, 17 Oct 2024 15:13:08 GMT

Welcome to a new article in this series, where we will explore databases for different use cases that aren't part of the "mainstream" names we're used to hearing. In this first edition, we'll start with Milvus. Why Milvus? Well, to be honest, my experience with vector databases is somewhat limited, and I've been hearing a lot about this one, so I think it's a good place to begin.

What's Milvus?

Milvus is an open-source vector database designed to handle massive amounts of unstructured data. Think of it as a specialized tool for managing and searching data like images, videos, and even audio files, which are usually hard to organize in traditional databases. It's built for scenarios involving machine learning, artificial intelligence, and similarity searches—basically, anytime you need to efficiently find relationships or patterns in complex data. With its powerful indexing and fast querying capabilities, Milvus makes working with large-scale, unstructured data a lot more approachable, even if you're just starting out.

Typical Use Cases for Milvus

Milvus shines in situations where you need to deal with unstructured data, especially when it comes to similarity searches. Here are some typical use cases:

Image and Video Search: Imagine you have a huge collection of images or videos, and you want to find items that look similar. Milvus can help you efficiently search for visually similar content, making it great for applications like visual search engines, digital asset management, or content recommendation.
Recommendation Systems: By storing vector embeddings of user behavior or product features, Milvus can power recommendation engines that provide personalized suggestions, such as for e-commerce or media streaming platforms.
Natural Language Processing (NLP): Milvus can be used for semantic searches in text. If you convert text data into vectors, you can then use Milvus to perform searches that understand the meaning of the content, which is particularly useful in chatbots, customer support, or document retrieval systems.
Anomaly Detection: Milvus is effective for finding unusual patterns or outliers within large datasets. For example, in cybersecurity, you can use Milvus to detect anomalies in network traffic by searching for unusual data patterns.
Genomics and Medical Data: Handling massive, complex datasets like DNA sequences or medical imaging can benefit from Milvus's ability to find similarities. This makes it a valuable tool for research and diagnostic purposes.
Audio and Speech Recognition: Similar to image and NLP tasks, Milvus can also store and search audio features, helping in applications like voice recognition, audio classification, or music recommendation systems.

Let's deploy

For me, the best way to learn is trying, breaking and fixing is how works for me. Also, replicating use cases is quite useful for learn new things.

Said this, let's start "breaking" deploying an instance of Milvus using Docker through a script that they provide in the documentation website:

$ curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -

Then

sh ./standalone_embed.sh start

You should see something like this:

Unable to find image 'milvusdb/milvus:v2.4.13-hotfix' locally
2024/10/17 11:18:41 must use ASL logging (which requires CGO) if running as root
v2.4.13-hotfix: Pulling from milvusdb/milvus
9b10a938e284: Pull complete 
72d6e057bd66: Pull complete 
1db49fab89e3: Pull complete 
9ab0fe5697fd: Pull complete 
149e2d21f99f: Pull complete 
7191be017dba: Pull complete 
Digest: sha256:7a4fb4c98b3a7940a13ceba0f01429258c6dca441722044127195ceb053a9a86
Status: Downloaded newer image for milvusdb/milvus:v2.4.13-hotfix
Wait for Milvus Starting...

What we are doing here? This setup is configuring a simple standalone Milvus environment with the necessary etcd service for metadata management. Milvus is run in a standalone mode, which is ideal for testing or small-scale use cases since it is easy to set up compared to distributed mode.

If everything is going well you should see something like this when you run `docker ps`

CONTAINER ID   IMAGE                            COMMAND                  CREATED          STATUS                    PORTS                                                                      NAMES
2cc4e66b3938   milvusdb/milvus:v2.4.13-hotfix   "/tini -- milvus run…"   56 seconds ago   Up 55 seconds (healthy)   0.0.0.0:2379->2379/tcp, 0.0.0.0:9091->9091/tcp, 0.0.0.0:19530->19530/tcp   milvus-standalone

Another check to see if everything is running as expected can be browsing http://localhost:9091/healthz that should return an 'Ok'

Let's get the fun started

Let's dive into a practical use case to showcase the power of Milvus: an Image Similarity Search. In this example, we'll build a system that allows users to find visually similar images from a large collection. This use case is a perfect way to illustrate Milvus's capabilities with unstructured data.

First, we need to make a few things:

Collect the Dataset: We'll use the CIFAR-10 dataset, which contains 60,000 images in 10 classes, with 6,000 images per class. It's a great dataset for testing out image similarity.
Generate Image Embeddings: To search for similar images, we'll convert the CIFAR-10 images into vectors (embeddings) using a pre-trained model like ResNet.
Store Embeddings in Milvus: Once we have the embeddings, we'll store them in Milvus for fast retrieval.
Query Milvus for Similar Images: Finally, we'll use Milvus to query and find images that are most similar to a given input image.

For simplicity purposes, I'm grouping all the functions in a Python script but let me explain each step:

In the step one, the script is going to create the connection to Milvus, then, we are going to create a collection in the database, the next step is generate the embeddings, load the dataset and search for similarities.

import numpy as np
import torch
import torchvision.models as models
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from pymilvus import connections, Collection, CollectionSchema, FieldSchema, DataType

# Step 1: Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Step 2: Create a collection in Milvus
fields = [
    FieldSchema(name="image_id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=512)
]
schema = CollectionSchema(fields, "Image similarity search collection")
collection = Collection("image_similarity", schema)

# Step 3: Load ResNet model for generating embeddings
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)  # Update for torchvision v0.13+
model = torch.nn.Sequential(*list(model.children())[:-1])  # Remove the classification layer
model.eval()

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Step 4: Load CIFAR-10 dataset
cifar10 = datasets.CIFAR10(root="./data", train=False, download=True, transform=transform)

# Insert embeddings into Milvus
for idx in range(len(cifar10)):
    image, _ = cifar10[idx]
    image_tensor = image.unsqueeze(0)
    
    with torch.no_grad():
        embedding = model(image_tensor).squeeze().numpy()
        embedding = embedding.flatten()
    
    # Insert data into Milvus
    collection.insert([
        [embedding.tolist()]
    ])

# Step 4.5: Create an index for the collection
index_params = {
    "index_type": "IVF_FLAT",
    "params": {"nlist": 128},
    "metric_type": "L2"
}
collection.create_index(field_name="embedding", index_params=index_params)

collection.load()

# Step 5: Search for similar images
query_embedding = embedding  # Replace with an embedding of the image you want to search for
search_params = {"metric_type": "L2", "params": {"ef": 128}}
results = collection.search([query_embedding], "embedding", param=search_params, limit=5, output_fields=["image_id"])

for result in results[0]:
    print(f"Found similar image with ID: {result.id} and distance: {result.distance}")

Running this, can be intensive for your computer, in my case with a M3 Pro Max, was the first time that I heard the fans running, also, depending of the large of the datasize can take some time.

Once the ingesting of the data is done and the search go through, we should have a result like this:

Found similar image with ID: 453292799976041682 and distance: 0.0
Found similar image with ID: 453292799976061682 and distance: 0.0
Found similar image with ID: 453292799976029238 and distance: 170.8767852783203
Found similar image with ID: 453292799976049238 and distance: 170.8767852783203
Found similar image with ID: 453292799976023594 and distance: 208.214111328125

But, Ignacio, what does this mean?

The output that we received indicates that Milvus found a set of images similar to the query image, ranked by their distance:

ID and Distance: The ID is a unique identifier for each image, and the distance is a measure of similarity. Lower distance values indicate greater similarity.

The result with distance: 0.0 means that the image is identical to the input image (since it’s a self-match).
The other results have positive distances, meaning they are similar to the input image but not identical. The larger the distance, the less similar the image is.

In our case:

The first two results have a distance of 0.0, indicating they are identical to the query image.
The other results have distances like 170.876 and 208.214, which means they are the most similar but not exact matches.

The concept here is to use vector similarity (in this case, the L2 distance metric) to find images that are close in the feature space. Lower values indicate higher similarity, which helps to find relevant matches in an image similarity search.

But, for being more visual, let's tweak the code to add matploblib to see this similarities in a more graphical way:

import numpy as np
import torch
import torchvision.models as models
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from pymilvus import connections, Collection, CollectionSchema, FieldSchema, DataType
import matplotlib.pyplot as plt

# Step 1: Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Step 2: Create a collection in Milvus
fields = [
    FieldSchema(name="image_id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=512)
]
schema = CollectionSchema(fields, "Image similarity search collection")
collection = Collection("image_similarity", schema)

# Step 3: Load ResNet model for generating embeddings
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)  # Updated for torchvision v0.13+
model = torch.nn.Sequential(*list(model.children())[:-1])  # Remove the classification layer
model.eval()

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Step 4: Load CIFAR-10 dataset
cifar10 = datasets.CIFAR10(root="./data", train=False, download=True, transform=transform)

# Store original images for visualization
original_images = datasets.CIFAR10(root="./data", train=False, download=True)

# Insert embeddings into Milvus
for idx in range(len(cifar10)):
    image, _ = cifar10[idx]
    image_tensor = image.unsqueeze(0)
    
    with torch.no_grad():
        embedding = model(image_tensor).squeeze().numpy()
        embedding = embedding.flatten()
    
    # Insert data into Milvus
    collection.insert([
        [embedding.tolist()]
    ])

collection.load()

# Step 5: Search for similar images
query_embedding = embedding  # Replace with an embedding of the image you want to search for
search_params = {"metric_type": "L2", "params": {"ef": 128}}
results = collection.search([query_embedding], "embedding", param=search_params, limit=5, output_fields=["image_id"])

# Step 6: Display the similar images
fig, axes = plt.subplots(1, len(results[0]), figsize=(15, 5))

for idx, result in enumerate(results[0]):
    image_id = result.id
    distance = result.distance
    
    # Since we use auto_id, the image_id corresponds to the index in the CIFAR-10 dataset
    original_image, _ = original_images[image_id % len(original_images)]
    
    axes[idx].imshow(original_image)
    axes[idx].set_title(f"ID: {image_id}\nDistance: {distance:.2f}")
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

Let's run the script again and we should see something like this:

In this case, we can see that three images were detected as being identical to each other, and then we have the truck, which differs from the horse images. Now, we could dive deeper into understanding why these specific images were selected over others, but I think that would be interesting for another article on context and visual features versus semantic understanding.

To Conclude

We explored what Milvus is, its use cases, and how it works using Python. In this example, we chose a dataset of 6,000 images to find matches. Now, imagine other use cases that could benefit from this kind of technology—what pops into your mind right now? I'd love to read your thoughts in the comments below.

How to track vessels with Python, ClickHouse and Grafana

Ignacio Van Droogenbroeck — Thu, 13 Jun 2024 13:51:02 GMT

I like tracking things... airplanes, cars, devices, and even Santa. Today, we are going to learn how to track vessels using Python to collect and process the information, push it into ClickHouse, and create a dashboard to visualize the collected data using Grafana.

Let's get started

aisstream.io

stream ship position, speed and other data live via websocket

Tracking Choice

I chose to track vessels in two distinct locations: the port of Buenos Aires, Argentina, and the port of San Francisco, United States. These ports were selected due to their high activity levels and my personal experience visiting them.

Let's set up our infrastructure...

As the title of this blog post suggests, we are going to use ClickHouse. If you are not familiar with this database, I can tell you that it is the fastest and most resource-efficient real-time data warehouse and open-source database.

Fast Open-Source OLAP DBMS - ClickHouse

ClickHouse is a fast open-source column-oriented database management system that allows generating analytical data reports in real-time using SQL queries

ClickHouseClickHouse

Also, we are going to use Grafana, which we have discussed on this blog several times.

Grafana: The open observability platform | Grafana Labs

Grafana is the open source analytics & monitoring solution for every database.

Grafana LabsEric Leijonmarck

To do this, we are going to run it on our localhost using Docker. My docker-compose file looks like this:

version: '3.8'

services:
  clickhouse:
    image: clickhouse/clickhouse-server:24.5.2.34
    container_name: clickhouse
    ports:
      - "8123:8123"  # HTTP interface
      - "9000:9000"  # Native client interface
    volumes:
      - clickhouse-data:/var/lib/clickhouse
    environment:
      - CLICKHOUSE_USER=nacho
      - CLICKHOUSE_PASSWORD=clickhouseFTW2024!
      - CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - clickhouse
    restart: unless-stopped

volumes:
  clickhouse-data:
  grafana-data:

If you paid attention to this file, you will see a few things here:

I'm persisting the data of ClickHouse in the clickhouse-data volume. This means that if my container is deleted, the data remains. The same applies to Grafana.
I'm specifying environment variables in both containers. In ClickHouse and Grafana, I'm setting a username and password.

💡

Do not hardcode usernames and passwords in YML files; this is only for demonstration purposes.

Also, I'm exposing the ports in ClickHouse and Grafana.

Once we have modified this according to our needs, we can run:

docker-compose up -d

If everything goes well, when you run docker ps you will see something like this:

CONTAINER ID   IMAGE                                    COMMAND            CREATED          STATUS          PORTS                                                      NAMES
b25f21e653f4   grafana/grafana:latest                   "/run.sh"          20 minutes ago   Up 20 minutes   0.0.0.0:3000->3000/tcp                                     grafana
f4f5ae585632   clickhouse/clickhouse-server:24.5.2.34   "/entrypoint.sh"   20 minutes ago   Up 20 minutes   0.0.0.0:8123->8123/tcp, 0.0.0.0:9000->9000/tcp, 9009/tcp   clickhouse

Ok, the infrastructure is up and running. Let's create the database in ClickHouse.

Dealing with ClickHouse

Before we start pushing information, we need to create our database. In our code, we can create the table with the information we are going to store. You will see that this is quite an easy process.

We are going to use the famous and great Curl to create the database called vessels_tracking:

curl -u nacho:clickhouseFTW2024! 'http://localhost:8123/' --data-binary "CREATE DATABASE vessels_tracking"

To validate that our database is there, we can run something like this:

curl -u nacho:clickhouseFTW2024! 'http://localhost:8123/' --data-binary "SHOW DATABASES

If everything works as we expect, we should see this:

INFORMATION_SCHEMA
default
information_schema
system
vessels_tracking

Shaping and pushing data

Ok, the fun part. It's time to shape the data and send it to ClickHouse. We are going to use Python and start by installing the ClickHouse client for this language:

pip3 install clickhouse-driver

We also need to install websockets

pip3 install websockets

Now, the code...

import asyncio
import websockets
import json
from datetime import datetime, timezone
from clickhouse_driver import Client
import os

# Connect to ClickHouse
client = Client(host='localhost', user='nacho', password='clickhouseFTW2024!')

# Create database and table
client.execute('CREATE DATABASE IF NOT EXISTS vessels_tracking')

create_table_query = '''
CREATE TABLE IF NOT EXISTS vessels_tracking.ais_data (
    ts DateTime64(3, 'UTC'),
    ship_id UInt32,
    latitude Float32,
    longitude Float32,
    speed Float32,
    heading Float32,
    nav_status String
) ENGINE = MergeTree()
ORDER BY ts;
'''

client.execute(create_table_query)

# Connect to AIS stream and insert data into ClickHouse
async def connect_ais_stream():

    async with websockets.connect("wss://stream.aisstream.io/v0/stream") as websocket:
        subscribe_message = {
            "APIKey": os.environ.get("AISAPIKEY"),  # Required!
            "BoundingBoxes": [
                # Buenos Aires, Argentina
                [[-34.811548, -58.537903], [-34.284453, -57.749634]],
                # San Francisco, USA
                [[36.989391, -123.832397], [38.449287, -121.744995]],
            ],
            "FilterMessageTypes": ["PositionReport"],
        }

        subscribe_message_json = json.dumps(subscribe_message)
        await websocket.send(subscribe_message_json)

        async for message_json in websocket:
            message = json.loads(message_json)
            message_type = message["MessageType"]

            if message_type == "PositionReport":
                # The message parameter contains a key of the message type which contains the message itself
                ais_message = message["Message"]["PositionReport"]
                print(f"[{datetime.now(timezone.utc)}] ShipId: {ais_message['UserID']} Latitude: {ais_message['Latitude']} Longitude: {ais_message['Longitude']} Speed: {ais_message['Sog']} Heading: {ais_message['Cog']} NavStatus: {ais_message['NavigationalStatus']}")
                # Insert data into ClickHouse
                insert_query = '''
                INSERT INTO vessels_tracking.ais_data (ts, ship_id, latitude, longitude, speed, heading, nav_status) VALUES
                '''
                # Ensure nav_status is a string
                values = (
                    datetime.now(timezone.utc),
                    ais_message['UserID'],
                    ais_message['Latitude'],
                    ais_message['Longitude'],
                    ais_message['Sog'],
                    ais_message['Cog'],
                    str(ais_message['NavigationalStatus'])  # Cast to string
                )
                client.execute(insert_query, [values])

if __name__ == "__main__":
    asyncio.run(connect_ais_stream())

Before running this, let's go through important things in this code:

The connection to the database: make sure it is connected to your instance using the username and password you specified during setup.
The creation of the database: this is crucial to ensure that the data is there.
The creation of the table: this is important to ensure that the data we push matches our table.
The API Key from AisStream: you can sign up using your GitHub account from here.

aisstream.io

stream ship position, speed and other data live via websocket

The BoundingBoxes: make sure you enter the coordinates according to your needs.

Once we have sorted this out, save this with a name. In my case, it is main.py, and run it:

python3 main.py

If everything goes well, you will start to see something like this, which means it worked:

[2024-06-13 13:03:59.876760+00:00] ShipId: 368341690 Latitude: 37.794628333333335 Longitude: -122.31805 Speed: 9.1 Heading: 107.5 NavStatus: 0
[2024-06-13 13:04:01.503668+00:00] ShipId: 368231420 Latitude: 37.512494999999994 Longitude: -122.19592666666668 Speed: 0 Heading: 360 NavStatus: 5
[2024-06-13 13:04:02.522329+00:00] ShipId: 366999711 Latitude: 37.810505 Longitude: -122.36067833333333 Speed: 0 Heading: 289 NavStatus: 5
[2024-06-13 13:04:02.994950+00:00] ShipId: 701006099 Latitude: -34.59726166666667 Longitude: -58.36488666666666 Speed: 0 Heading: 85.4 NavStatus: 3
[2024-06-13 13:04:03.902959+00:00] ShipId: 366844270 Latitude: 37.794311666666665 Longitude: -122.28603833333334 Speed: 0.6 Heading: 354.9 NavStatus: 5
[2024-06-13 13:04:04.454044+00:00] ShipId: 256037000 Latitude: 37.39175 Longitude: -123.39798333333333 Speed: 2.1 Heading: 226 NavStatus: 0
[2024-06-13 13:04:05.356680+00:00] ShipId: 368173000 Latitude: 37.774015 Longitude: -122.24171500000001 Speed: 0 Heading: 251 NavStatus: 5
[2024-06-13 13:04:05.449553+00:00] ShipId: 366963980 Latitude: 37.94531166666666 Longitude: -122.50868666666666 Speed: 0 Heading: 293.5 NavStatus: 0
[2024-06-13 13:04:05.963961+00:00] ShipId: 367153070 Latitude: 37.91493333333334 Longitude: -122.36188833333334 Speed: 0 Heading: 289.3 NavStatus: 5
[2024-06-13 13:04:05.990570+00:00] ShipId: 367145450 Latitude: 37.809855 Longitude: -122.41158666666666 Speed: 0 Heading: 34.1 NavStatus: 15
[2024-06-13 13:04:06.297423+00:00] ShipId: 366972000 Latitude: 37.729818333333334 Longitude: -122.52628666666666 Speed: 0.7 Heading: 44.1 NavStatus: 3
[2024-06-13 13:04:06.307300+00:00] ShipId: 367006030 Latitude: 38.092705 Longitude: -122.261055 Speed: 0 Heading: 360 NavStatus: 0
[2024-06-13 13:04:07.014252+00:00] ShipId: 368173000 Latitude: 37.774015 Longitude: -122.24171333333334 Speed: 0 Heading: 251 NavStatus: 5
[2024-06-13 13:04:08.653218+00:00] ShipId: 367469070 Latitude: 37.86828 Longitude: -122.31493999999999 Speed: 0 Heading: 0 NavStatus: 0
[2024-06-13 13:04:09.267404+00:00] ShipId: 368278830 Latitude: 37.792305000000006 Longitude: -122.28416166666666 Speed: 0 Heading: 250.5 NavStatus: 5
[2024-06-13 13:04:09.676525+00:00] ShipId: 367328780 Latitude: 37.80373 Longitude: -122.39635833333334 Speed: 6.3 Heading: 246 NavStatus: 0
[2024-06-13 13:04:09.734593+00:00] ShipId: 367369720 Latitude: 37.905665 Longitude: -122.37193833333335 Speed: 0 Heading: 187.5 NavStatus: 0
[2024-06-13 13:04:09.744267+00:00] ShipId: 367152240 Latitude: 37.806756666666665 Longitude: -122.40414000000001 Speed: 0.3 Heading: 321.4 NavStatus: 0
[2024-06-13 13:04:09.785783+00:00] ShipId: 255806494 Latitude: 37.781621666666666 Longitude: -122.36749333333334 Speed: 1.2 Heading: 296.2 NavStatus: 0
[2024-06-13 13:04:10.393677+00:00] ShipId: 368992000 Latitude: 37.74805 Longitude: -122.38328166666666 Speed: 0 Heading: 0 NavStatus: 5
[2024-06-13 13:04:11.069113+00:00] ShipId: 366844270 Latitude: 37.79433166666667 Longitude: -122.28603666666666 Speed: 0.4 Heading: 13 NavStatus: 5

Ok, cool. The data is printed and probably already in the database. How can we make sure?

Easy peasy lemon squeezy.

We are going to use CURL again:

curl -u nacho:clickhouseFTW2024! 'http://localhost:8123/?query=SELECT%20*%20FROM%20vessels_tracking.ais_data%20LIMIT%2010%20FORMAT%20TabSeparatedWithNames'

As you can see, I'm limiting my query to 10 results. For testing, no need for more.

Now, if the data is pushing to ClickHouse, the result of the query should look like this:

ts	ship_id	latitude	longitude	speed	heading	nav_status
2024-06-13 13:03:59.876	368341690	37.79463	-122.31805	9.1	107.5	0
2024-06-13 13:04:01.503	368231420	37.512493	-122.19593	0	360	5
2024-06-13 13:04:02.522	366999711	37.810505	-122.36068	0	289	5
2024-06-13 13:04:02.995	701006099	-34.597263	-58.364887	0	85.4	3
2024-06-13 13:04:03.903	366844270	37.79431	-122.28604	0.6	354.9	5
2024-06-13 13:04:04.454	256037000	37.39175	-123.39798	2.1	226	0
2024-06-13 13:04:05.356	368173000	37.774014	-122.241714	0	251	5
2024-06-13 13:04:05.449	366963980	37.945312	-122.50869	0	293.5	0
2024-06-13 13:04:05.964	367153070	37.914932	-122.361885	0	289.3	5
2024-06-13 13:04:05.990	367145450	37.809856	-122.41158	0	34.1	15

Success!

Let's visualize it

Ok, ok, calm down. We have data, everything is running, now, let's visualize it using Grafana.

Let's go to http://localhost:3000, using the username admin and password admin. (If you changed this, use your credentials)

Go to connections, Add a new connection, and select ClickHouse.

Click on Install

Then, click on Add new data source

You will find this page where we need to fill in the connection info to ClickHouse. Here, let me give you a tip.

You might want to use localhost, but localhost refers to the Grafana container itself. We don't have ClickHouse running in the Grafana container, so we need to point to the ClickHouse container, which is named clickhouse. Don't include http:// or something like that.

You can connect using port 9000 and use native as the protocol, or http and use 8123 as the server port.

If everything goes as expected, as soon as you click Save & Test, you should see the banner that says "Data source is working".

Now, the good stuff.

Click on "Explore View" on that page and use this query:

SELECT
    ts AS "time",
    ship_id,
    latitude,
    longitude,
    speed,
    heading,
    nav_status
FROM
    vessels_tracking.ais_data
ORDER BY
    ts DESC
LIMIT
    100

You will see something like this...

Now, click on the Add to dashboard button at the top right...

Click on "Open Dashboard"

And you will see this

Let's make it more appealing. Edit that panel, go to visualizations, and select Geomap.

It is going to look like this, and as you can see, we have points on the map, specifically in Buenos Aires and San Francisco.

Let's zoom in to San Francisco:

Awesome, we have vessels, and if you click on one of the green dots, you can see the data from that ship:

To conclude

What a fun project. In this article, you learned how to track vessels using the AisStream API, how to get that data using Websockets in Python, how to deploy ClickHouse and Grafana in Docker, and how to ingest and visualize data.

Have you tried it? Let me know on Twitter or LinkedIn about your results.

Pique #33: Como modificar un archivo de un contenedor parado / stopped

Ignacio Van Droogenbroeck — Tue, 28 May 2024 15:21:04 GMT

Hoy, me paso que hice una modificación a un archivo dentro de un contenedor y esa config estaba mal, el resultado, el contenedor no arrancaba. Una de las opciones era crear un contenedor nuevo, montando los volúmenes existentes y listo, pero no, decidí por una mas rápida y no tan conocida opción.

Básicamente lo que hice fue, copiar ese archivo de config hacia afuera del contenedor, corregir la modificación, volverlo a ponerlo y arrancar el contenedor.

Si, podes copiar un archivo desde un contenedor que no esta corriendo, ningun problema. La unica dificultad que puede haber es que no conozcas la ruta de ese archivo en el contenedor. Si lo conoces, podes hacer algo así:

docker cp 9f4f2d321e50:/var/lib/ghost/config.production.json .

Esto básicamente significa copiar al directorio en donde te estas parado este archivo desde ese contenedor. Una vez que esta modificado el archivo, lo volves a copiar hacia "adentro".

docker cp config.production.json 9f4f2d321e50:/var/lib/ghost/config.production.json

Startea el contenedor y si, todo esta bien, tu contenedor debería arrancar sin ningún problema.

Conocías este pique?

How to host a static website using AWS S3 and CloudFront - Part 1.

Ignacio Van Droogenbroeck — Tue, 19 Mar 2024 17:14:56 GMT

A few days ago I decided to update my website, this website is HTML, Bootstrap, CSS, and some Javascript. A static website that can be hosted anywhere and in this post, you are going to learn how to host it in S3 using CloudFront to deliver the content.

I have the website in Github, so, if you are looking for a template for your website, feel free to grab what is useful to you.

GitHub - xe-nvdk/my-awesome-website: This is my personal website. Take a look and tell me what you think.

This is my personal website. Take a look and tell me what you think. - xe-nvdk/my-awesome-website

GitHubxe-nvdk

Introduction:

Originally, my website was running in a docker container, standalone, with no integration with deployment tools at all. For any modification required download the updated code, update the volumes in the container, and then restart the container. This is something barbaric for the times that we are living.

So, in these times when I'm defining the next step in my career I decided to update my website and "fix" for good how to deploy new changes.

Architecture

This is the architecture of the solution proposed. Looks like overkill but isn't. Is quite simple.

Let me explain the components.

Github: This is where the code lives, any changes that I make on my computer, are reflected in the repository on GitHub.

Amazon S3 Bucket: This is the bucket where we are going to put our files.

AWS CloudFront: This component is used to serve what we have in the s3 bucket.

Route 53: This is where my domain (vandroogenbroeck.net) domain is living, is my DNS "manager".

How to...

Create the AWS s3 Bucket

Ok, it's time to get our hands dirty.

First, let's create a bucket in S3.

Go to your console and click on "Create Bucket"

The next steps are the following:

Select the region: In my case, my visitors came from EMEA and the US, so US-east-1 makes sense to me.

Bucket type: General Purpose

Bucket Name: Specify a name for your bucket.

Then, disable "Block all public access" as appears in the following image and click the tickbox acknowledging you are going to open your s3 bucket to the world.

For the rest of the settings, leave those as default. Click on "Create Bucket" and then your new shiny bucket should appear...

The following things in adding some rules to allow "get" files from your s3 Bucket. To do that, we are going to open the bucket, click on the tab permissions, and click on "Edit bucket policy" Then you can paste this. Don't forget to change this using your bucket name.

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "AddPerm",
           "Effect": "Allow",
           "Principal": "*",
           "Action": "s3:GetObject",
           "Resource": "arn:aws:s3:::your_bucket_name/*"
       }
  ]
}

Once you have that done, go to the tab "Properties" and scroll down to "Static website hosting", edit, and select "Enable". Specify the index of your website, and the error if you have a 404 error page in your project. Scroll down and click on "Save Changes".

Now, when everything is saved, you will see something like this. Of course, with your own URL.

Connect Github Repo and push the site to S3

For this case, we are going to use Github actions to automate the deployment to s3 in every push to our repo.

The steps here are very straightforward:

Create a New Workflow File: In your GitHub repository, create a new file in the .github/workflows directory. You might name it deploy-to-s3.yml.
Add the Workflow Configuration: Paste the following YAML configuration into the file. Make sure to replace YOUR_S3_BUCKET_NAME with the name of your S3 bucket.

We are going to use this recipe:

name: Deploy to S3

on:
  push:
    branches:
      - main  # Set this to your default branch

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4

    - uses: jakejarvis/s3-sync-action@master
      with:
        args: --follow-symlinks --delete
      env:
        AWS_S3_BUCKET: my-website-cduser.com
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        AWS_REGION: 'us-east-1'  # Set this to your AWS region
        SOURCE_DIR: './'  # Set this to the directory of your static site

Please, do not hardcode your credentials, use GitHub options to securely provide your credentials.

Using secrets in GitHub Actions - GitHub Docs

Secrets allow you to store sensitive information in your organization, repository, or repository environments.

GitHub Docs

Now, that we have everything set, let's push our website or the changes and see what happens with the GitHub Action.

If everything goes well, you will see something like this.

And when you visit the URL for your bucket, you will start to see your website. In my case, you can go here:

Ignacio Van Droogenbroeck - Fractional CTO, Technical Account Manager, Business Developer, Community Leader

I am a versatile ‘plug-and-play’ team member, equally adept at working independently. Self-aware and an enthusiastic learner, I strive for excellence in all my endeavors.

Ignacio Van Droogenbroeck - Fractional CTO, Technical Account Manager, Business Developer, Community Leader

As you can see the URL is something like this:

http://my-website-cduser.com.s3-website-us-east-1.amazonaws.com/

You can share this now, but is not nice, also, we want our site to be available near to every visitor, and for that, we are going to use AWS CloudFront.

Distributing our website content with AWS CloudFront.

Ok, we are in the last step of the first part of this tutorial. Now. We are going to make use of the CDN that offers CloudFront to make our site available with good performance from several places.

In CloudFront, we are going to click on "create distribution"

In the next step, at the origin domain, we are going to select our S3 bucket and we are going to click on "Use website endpoint".

The following is almost by default. We are not going to enable WAF and the cache settings, we are going to stick to the recommended settings.

Depending on your needs, you can make your site available in all locations, in my case is not needed, only in North America and Europe is enough.

Click on "Create Distribution" at the bottom of the page, and now, we need to wait a few minutes but we can see what is the URL that we have assigned for this distribution.

Once this is done, you can browse this and you will see what is served from S3.

https://duaz73fwgf26p.cloudfront.net/

To conclude

In this first part of this tutorial, you saw how to create an AWS s3 bucket, also, how to push your repo on GitHub to that S3 bucket, and also, you learned how to create a distribution of your content using AWS CloudFront.

In the next part, we are going to see how to configure a DNS in Route 53, how to create a free Amazon certificate to enable SSL, and of course use your domain.

The cost of this setup is going to depend on how many people are visiting your website, CloudFront and S3 have generous free tiers that I think that is more than enough for a personal website.

Stay tuned for part 2 coming soon.