Dimitris Effrosynidis

How to Do the “Retrieval” in Retrieval-Augmented Generation (RAG)

2025-01-29T00:00:00+00:00

The project is available online on Towards AI.

Efficient and accurate text retrieval is a cornerstone of modern information systems, powering applications like search engines, chatbots, and knowledge bases.

It is the first step in RAG (Retrieval-Augmented Generation) systems.

RAG systems, first use text retrieval to find the answer to our query and then use an LLM to answer. RAG allows us to “chat with our data”.

In this article, we explore the integration of dense retrieval, BM25 lexical search, and transformer-based reranking to create a robust and scalable text retrieval system.

The project leverages the strengths of each technique:

Dense Retrieval: Captures semantic meaning by embedding text into high-dimensional vector spaces, enabling similarity-based search. BM25 Lexical Search: Performs efficient keyword matching to quickly narrow down relevant results. Transformer-Based Reranking: Uses Hugging Face cross-encoders to evaluate and rank query-document pairs based on semantic relevance, ensuring precision in the final output. This hybrid approach optimizes both computational efficiency and retrieval accuracy, making it well-suited for use cases where context, relevance, and speed are critical.

Continue your read on Towards AI.

Fine-Tuning a Pre-trained LLM for Sentiment Classification

2025-01-15T00:00:00+00:00

The project is available online on Towards AI.

In this tutorial, we will fine-tune the Task-Specific Sentiment Model (juliensimon/reviews-sentiment-analysis) and see if its 0.79 accuracy will improve.

Continue your read on Towards AI.

Mastering LLM Interactions

2025-01-09T00:00:00+00:00

The project is available online on Towards AI.

Generative AI transforms industries by enabling machines to produce human-like text, generate structured outputs, and solve complex problems creatively.

However, the magic happens when you combine cutting-edge models with advanced techniques like prompt engineering, quantized model optimization, and grammar-constrained sampling.

In this project, I explore the forefront of these techniques, demonstrating how to harness the capabilities of state-of-the-art language models to achieve specific and efficient results.

From crafting intricate prompts to controlling the randomness and structure of outputs, this work reflects the fusion of creativity, technical expertise, and a deep understanding of model behavior.

Continue your read on Towards AI.

Traditional vs. Generative AI for Sentiment Classification

2025-01-07T00:00:00+00:00

The project is available online on Towards AI.

This article focuses on the sentiment analysis of product reviews from the Flipkart Customer Review dataset.

Sentiment analysis is a crucial task in Natural Language Processing (NLP) that aims to classify text into positive, negative, or neutral sentiments. It enables businesses to gain insights from customer feedback.

Our objective is to explore and compare multiple methods for binary sentiment classification, evaluating their performance and computational efficiency.

The methods range from traditional approaches to advanced techniques leveraging embeddings and generative models.

Some methods require labeled data and some do not at all.

Continue your read on Towards AI.

The Transformer Architecture From a Top View

2024-02-20T00:00:00+00:00

The project is available online on Towards AI.

The state-of-the-art Natural Language Processing (NLP) models used to be Recurrent Neural Networks (RNN) among others.

And then came Transformers.

Transformer architecture significantly improved natural language task performance compared to earlier RNNs.

Developed by Vaswani et al. in their 2017 paper “Attention is All You Need,” Transformers revolutionized NLP by leveraging self-attention mechanisms, allowing the model to learn the relevance and context of all words in a sentence.

Unlike RNNs that process data sequentially, Transformers analyze all parts of the sentence simultaneously. This parallel processing capability allows Transformers to learn the context and relevance of each word about every other word in a sentence or document, overcoming limitations related to long-term dependency and computational efficiency found in RNNs.

But let’s explore the architecture step by step.

Continue your read on Towards AI.

End to End ML Project

2023-03-01T00:00:00+00:00

The project is available on my GitHub

This project aims to apply the best software engineering practices in a Machine Learning project in order to deploy the model.

We are developing a model to predict if a Data Scientist is willing to leave his/her current job. We are not interested in the accuracy of the model (which is 77%), but rather to transition from the research environment to production code, packaging, and finally deployment of the model.

Research Code ➙ Production Code ➙ Deployment

Project Structure

end-to-end-ML-project
│   README.md
│   MANIFEST.in    
│   mypy.ini
│   pyproject.toml
│   setyp.py
│   .gitignore
│   tox.ini
│   Dockerfile
│
└───notebooks
│   │   1. Data Analysis.ipynb
│   │   2. Feature Engineering.ipynb
│   │   3. Feature Engineering Pipeline.ipynb
│   │   4. Machine Learning.ipynb
│   │   preprocess.py
│   
└───requirements
│   │   requirements.txt
│   │   research-env.txt
│   │   production.txt
│   │   deployment.txt  
│
└───src
│   │   VERSION
│   │   __init__.py
│   │   config.yml
│   │   pipeline.py
│   │   train_pipeline.py
│   │   predict.py
│   │
│   └───config 
│   │   │   __init__.py
│   │   │   core.py
│   │ 
│   └───data 
│   │   │   __init__.py
│   │   │   train.csv
│   │   │   test.csv
│   │     
│   └───processing 
│   │   │   __init__.py
│   │   │   data_manager.py
│   │   │   features.py
│   │  
│   └───trained_models 
│   │   │   __init__.py  
│
└───app-fastapi
│   ...

Steps in An End-to-end ML Project

Start with jupyter notebooks and finalize a model.
Transform research code to production code.
Make the project a package.
Serve it via a REST API.
Dockerize it and deploy it.

1. Start with jupyter notebooks and finalize a model

The notebooks folder is the research which is often done by a Data Scientist.

Usually a Data Analysis notebook for EDA and data understanding is the first step. Then, features are created in a pipeline. Here, sciki-learn and feature-engine were used. Finally, the ML model is placed at the end of the pipeline.

Research can be very time-consuming. Here, a simple pipeline is created, because the creation of a 95% accuracy model is out of the scope of this work.

2. Transform research code to production code

The src folder is the transformation of the jupyter notebooks to a python project.

Some good practices:

Create a config.yml file that contains all the constants and configurations derived from the notebooks. Accompany it with a .py file to parse it (Here it is the src/config/core.py).
Tidy all extra functions written and place them in a processing folder. For example, in src/processing/data_manager.py there are functions to read the data, save, read, and remove the pipeline.
Make different file for train_pipeline.py and predict.py.
Always create very small functions to test them easier and have a readable code.
Create a trained_models folder to deposit the models.
Have a VERSION file, to track the version of the project, e.g. 0.0.4
Write tests. Now write more tests.
Make a tox.ini file to make life easier, test code faster, get rid of styling, type checks, linting, and PEP8 concerns.

Note: In order to import your python files as packages in other python files, we need to add the project’s filepath to the Path environmental Variable.

3. Make the project a package

We need 3 files in the root of the project:

MANIFEST.ini: Define which files to include and exclude from the package.
pyproject.toml: Specify basic dependencies and configure tooling.
setup.py: Package metadata, version, requirements, how to create the package.

From the project directory: python -m build

Then, make an account to PyPI. Install twine: pip install twine

Upload: twine upload dist/end_to_end_ML_project-0.0.4-py3-none-any.whl

Now the package can be installed like any other package with pip install end-to-end-ML-project

It can be imported like: import src

4. Serve it via a REST API

The API should be a different repository or at least a different folder. Here it is located in the folder app-fastapi.

The first thing here is in the requirements.txt, where we define to install the end-to-end-ML-project package, which we have published earlier.

Three key files of the api are:

config.py: Specify metadata of the api, and logging settings.
main.py: Define the main app and the index page router.
api.py: Define a health and a predict endpoint.

We define some schemas for automatic validation of variable types. We define some schemas for automatic validation of variable types.

We also define tests with predefined input data to predict.

We also use logging and the package loguru.

The Procfile and runtime.txt are necessary files to deploy on Heroku.

5. Dockerize it and deploy it

We create a Dockerfile and build the image:

docker build -t end-to-end-ML-project:latest .

We run the image:

docker run -p 8001:8001 -e PORT=8001 end-to-end-ml-project

We can see the output on localhost:8001/

Now to deploy on Heroku, create a heroku.yml file.

heroku login
heroku cointainer:login
heroku container:push web --app end-to-end-ml-project
heroku container:release web --app end-to-end-ml-project
heroku open --app end-to-end-ml-project

MLOps Lifecycle with MLFlow, Airflow, Amazon S3, PostgreSQL

2023-01-01T00:00:00+00:00

The project is available on my GitHub

MLOps Pipeline with + + +

A complete Machine Learning lifecycle. The pipeline is as follows:

1. Read Data➙2. Split train-test➙3. Preprocess Data➙4. Train Model➙
➙ 5.1 Register Model
➙ 5.2 Update Registered Model

Telco Customer Churn dataset from Kaggle.

Tech Stack

: For experiment tracking and model registration
: Store the MLflow tracking
: Store the registered MLflow models and artifacts
: Orchestrate the MLOps pipeline
: Machine Learning
: R&D

How to reproduce

Have Docker installed and running.

Make sure docker-compose is installed:

pip install docker-compose

Clone the repository to your machine.

git clone https://github.com/Deffro/MLOps.git

Rename .env_sample to .env and change the following variables:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION
- AWS_BUCKET_NAME
Run the docker-compose file

docker-compose up --build -d

Urls to access

http://localhost:8080 for Airflow. Use credentials: airflow/airflow
http://localhost:5000 for MLflow.
http://localhost:8893 for Jupyter Lab. Use token: mlops

Cleanup

Run the following to stop all running docker containers through docker compose

docker-compose stop

or run the following to stop and delete all running docker containers through docker

docker stop $(docker ps -q)
docker rm $(docker ps -aq)

Finally, run the following to delete all (named) volumes

docker volume rm $(docker volume ls -q)

Ensemble Feature Selection for Machine Learning

2022-11-02T00:00:00+00:00

The project is available online on Towards Data Science.

Most of the content of this article is from my paper entitled: “An Evaluation of Feature Selection Methods for Environmental Data”, and is available here for anyone interested.

In my previous article, I presented 12 individual feature selection methods. This article serves as the next step to feature selection.

If you are here, I suppose you are already familiar with the well-established ensemble on classification algorithms, which provides better results and robustness than employing single algorithms. The same principle can be applied to feature selection algorithms.

Ensemble Feature Selection

The idea behind ensemble feature selection is to combine multiple different feature selection methods, taking into account their strengths, and create an optimal best subset. In general, it makes a better feature space and reduces the risk of choosing an unstable subset. Besides, a single method may result in a subset that can be considered a local optimum, while an ensemble might provide more stable results.

Continue your read on Towards Data Science.

YouTube Video Resolution Downgrade Classification

2022-04-14T00:00:00+00:00

The project is available on my GitHub

Binary Classification on data with varying number of features
Visualize, Analyze and understand data
Engineer new features
Try 8 approaches to create a model for classification
Perform Ensemble Feature Selection
Explain feature importance with SHAP

Feature Selection for Machine Learning: 3 Categories and 12 Methods

2021-06-09T00:00:00+00:00

The project is available online on Towards Data Science.

Most of the content of this article is from my recent paper entitled: “An Evaluation of Feature Selection Methods for Environmental Data”, available here for anyone interested.

The 2 approaches for Dimensionality Reduction

There are two ways to reduce the number of features, otherwise known as dimensionality reduction.

The first way is called feature extraction and it aims to transform the features and create entirely new ones based on combinations of the raw/given ones. The most popular approaches are the Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Multidimensional Scaling. However, the new feature space can hardly provide us with useful information about the original features. The new higher-level features are not easily understood by humans, because we can not link them directly to the initial ones, making it difficult to draw conclusions and explain the variables.

The second way for dimensionality reduction is feature selection. It can be considered as a pre-processing step and does not create any new features, but instead selects a subset of the raw ones, providing better interpretability. Finding the best features from a significant initial number can help us extract valuable information and discover new knowledge. In classification problems, the significance of features is evaluated as to their ability to resolve distinct classes. The property which gives an estimation of each feature’s handiness in discriminating the distinct classes is called feature relevance.

Continue reading on Towards Data Science.