<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://deffro.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://deffro.github.io/" rel="alternate" type="text/html" /><updated>2025-09-11T09:27:49+00:00</updated><id>https://deffro.github.io/feed.xml</id><title type="html">Dimitris Effrosynidis</title><subtitle>Data Science Portfolio</subtitle><author><name>Dimitris Effrosynidis</name></author><entry><title type="html">How to Do the “Retrieval” in Retrieval-Augmented Generation (RAG)</title><link href="https://deffro.github.io/generative%20ai/how-to-do-the-retrieval-in-rag/" rel="alternate" type="text/html" title="How to Do the “Retrieval” in Retrieval-Augmented Generation (RAG)" /><published>2025-01-29T00:00:00+00:00</published><updated>2025-01-29T00:00:00+00:00</updated><id>https://deffro.github.io/generative%20ai/how-to-do-the-retrieval-in-rag</id><content type="html" xml:base="https://deffro.github.io/generative%20ai/how-to-do-the-retrieval-in-rag/"><![CDATA[<p><strong>The project is available online on <a href="https://medium.com/towards-artificial-intelligence/how-to-do-the-retrieval-in-retrieval-augmented-generation-rag-c96c0faea086">Towards AI</a></strong>.</p>

<p>Efficient and accurate text retrieval is a cornerstone of modern information systems, powering applications like search engines, chatbots, and knowledge bases.</p>

<p>It is the first step in RAG (Retrieval-Augmented Generation) systems.</p>

<p>RAG systems, first use text retrieval to find the answer to our query and then use an LLM to answer. RAG allows us to “chat with our data”.</p>

<p>In this article, we explore the integration of dense retrieval, BM25 lexical search, and transformer-based reranking to create a robust and scalable text retrieval system.</p>

<p>The project leverages the strengths of each technique:</p>

<p>Dense Retrieval: Captures semantic meaning by embedding text into high-dimensional vector spaces, enabling similarity-based search.
BM25 Lexical Search: Performs efficient keyword matching to quickly narrow down relevant results.
Transformer-Based Reranking: Uses Hugging Face cross-encoders to evaluate and rank query-document pairs based on semantic relevance, ensuring precision in the final output.
This hybrid approach optimizes both computational efficiency and retrieval accuracy, making it well-suited for use cases where context, relevance, and speed are critical.</p>

<p>Continue your read on <a href="https://medium.com/towards-artificial-intelligence/how-to-do-the-retrieval-in-retrieval-augmented-generation-rag-c96c0faea086">Towards AI</a>.</p>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Generative AI" /><category term="transformers" /><summary type="html"><![CDATA[Optimizing results with minimal effort]]></summary></entry><entry><title type="html">Fine-Tuning a Pre-trained LLM for Sentiment Classification</title><link href="https://deffro.github.io/generative%20ai/fine-tuning-a-pre-trained-llm-for-sentiment-classification/" rel="alternate" type="text/html" title="Fine-Tuning a Pre-trained LLM for Sentiment Classification" /><published>2025-01-15T00:00:00+00:00</published><updated>2025-01-15T00:00:00+00:00</updated><id>https://deffro.github.io/generative%20ai/fine-tuning-a-pre-trained-llm-for-sentiment-classification</id><content type="html" xml:base="https://deffro.github.io/generative%20ai/fine-tuning-a-pre-trained-llm-for-sentiment-classification/"><![CDATA[<p><strong>The project is available online on <a href="https://medium.com/towards-artificial-intelligence/fine-tuning-a-pre-trained-llm-for-sentiment-classification-394ea9217bdb">Towards AI</a></strong>.</p>

<p>In this tutorial, we will fine-tune the Task-Specific Sentiment Model (juliensimon/reviews-sentiment-analysis) and see if its 0.79 accuracy will improve.</p>

<p>Continue your read on <a href="https://medium.com/towards-artificial-intelligence/fine-tuning-a-pre-trained-llm-for-sentiment-classification-394ea9217bdb">Towards AI</a>.</p>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Generative AI" /><category term="transformers" /><summary type="html"><![CDATA[Optimizing results with minimal effort]]></summary></entry><entry><title type="html">Mastering LLM Interactions</title><link href="https://deffro.github.io/generative%20ai/mastering-llm-interactions/" rel="alternate" type="text/html" title="Mastering LLM Interactions" /><published>2025-01-09T00:00:00+00:00</published><updated>2025-01-09T00:00:00+00:00</updated><id>https://deffro.github.io/generative%20ai/mastering-llm-interactions</id><content type="html" xml:base="https://deffro.github.io/generative%20ai/mastering-llm-interactions/"><![CDATA[<p><strong>The project is available online on <a href="https://medium.com/towards-artificial-intelligence/mastering-llm-interactions-4becd2887b0d">Towards AI</a></strong>.</p>

<p>Generative AI transforms industries by enabling machines to produce human-like text, generate structured outputs, and solve complex problems creatively.</p>

<p>However, the magic happens when you combine cutting-edge models with advanced techniques like prompt engineering, quantized model optimization, and grammar-constrained sampling.</p>

<p>In this project, I explore the forefront of these techniques, demonstrating how to harness the capabilities of state-of-the-art language models to achieve specific and efficient results.</p>

<p>From crafting intricate prompts to controlling the randomness and structure of outputs, this work reflects the fusion of creativity, technical expertise, and a deep understanding of model behavior.</p>

<p>Continue your read on <a href="https://medium.com/towards-artificial-intelligence/mastering-llm-interactions-4becd2887b0d">Towards AI</a>.</p>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Generative AI" /><category term="transformers" /><summary type="html"><![CDATA[Prompt engineering, quantized model optimization, and grammar-constrained sampling]]></summary></entry><entry><title type="html">Traditional vs. Generative AI for Sentiment Classification</title><link href="https://deffro.github.io/generative%20ai/machine%20learning/feature%20engineering/tradition-vs-generative-ai-for-sentiment-classification/" rel="alternate" type="text/html" title="Traditional vs. Generative AI for Sentiment Classification" /><published>2025-01-07T00:00:00+00:00</published><updated>2025-01-07T00:00:00+00:00</updated><id>https://deffro.github.io/generative%20ai/machine%20learning/feature%20engineering/tradition-vs-generative-ai-for-sentiment-classification</id><content type="html" xml:base="https://deffro.github.io/generative%20ai/machine%20learning/feature%20engineering/tradition-vs-generative-ai-for-sentiment-classification/"><![CDATA[<p><strong>The project is available online on <a href="https://medium.com/towards-artificial-intelligence/traditional-vs-generative-ai-for-sentiment-classification-0195b7cc0a0d">Towards AI</a></strong>.</p>

<p>This article focuses on the sentiment analysis of product reviews from the Flipkart Customer Review dataset.</p>

<p>Sentiment analysis is a crucial task in Natural Language Processing (NLP) that aims to classify text into positive, negative, or neutral sentiments. It enables businesses to gain insights from customer feedback.</p>

<p>Our objective is to explore and compare multiple methods for binary sentiment classification, evaluating their performance and computational efficiency.</p>

<p>The methods range from traditional approaches to advanced techniques leveraging embeddings and generative models.</p>

<p>Some methods require labeled data and some do not at all.</p>

<p>Continue your read on <a href="https://medium.com/towards-artificial-intelligence/traditional-vs-generative-ai-for-sentiment-classification-0195b7cc0a0d">Towards AI</a>.</p>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Generative AI" /><category term="Machine Learning" /><category term="Feature Engineering" /><category term="transformers" /><category term="sklearn" /><summary type="html"><![CDATA[5 ways to classify text (even without train data)]]></summary></entry><entry><title type="html">The Transformer Architecture From a Top View</title><link href="https://deffro.github.io/generative%20ai/the-transformer-architecture-from-a-top-view/" rel="alternate" type="text/html" title="The Transformer Architecture From a Top View" /><published>2024-02-20T00:00:00+00:00</published><updated>2024-02-20T00:00:00+00:00</updated><id>https://deffro.github.io/generative%20ai/the-transformer-architecture-from-a-top-view</id><content type="html" xml:base="https://deffro.github.io/generative%20ai/the-transformer-architecture-from-a-top-view/"><![CDATA[<p><strong>The project is available online on <a href="https://medium.com/towards-artificial-intelligence/the-transformer-architecture-from-a-top-view-e8079c96b473">Towards AI</a></strong>.</p>

<p>The state-of-the-art Natural Language Processing (NLP) models used to be Recurrent Neural Networks (RNN) among others.</p>

<p>And then came Transformers.</p>

<p>Transformer architecture significantly improved natural language task performance compared to earlier RNNs.</p>

<p>Developed by Vaswani et al. in their 2017 paper “Attention is All You Need,” Transformers revolutionized NLP by leveraging self-attention mechanisms, allowing the model to learn the relevance and context of all words in a sentence.</p>

<p>Unlike RNNs that process data sequentially, Transformers analyze all parts of the sentence simultaneously. This parallel processing capability allows Transformers to learn the context and relevance of each word about every other word in a sentence or document, overcoming limitations related to long-term dependency and computational efficiency found in RNNs.</p>

<p>But let’s explore the architecture step by step.</p>

<p>Continue your read on <a href="https://medium.com/towards-artificial-intelligence/the-transformer-architecture-from-a-top-view-e8079c96b473">Towards AI</a>.</p>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Generative AI" /><category term="transformers" /><category term="BERT" /><category term="GPT" /><summary type="html"><![CDATA[Exploring the encoder-decoder magic in NLP behind LLMs]]></summary></entry><entry><title type="html">End to End ML Project</title><link href="https://deffro.github.io/machine%20learning/deployment/mlops/feature%20engineering/end-to-end-ml-project/" rel="alternate" type="text/html" title="End to End ML Project" /><published>2023-03-01T00:00:00+00:00</published><updated>2023-03-01T00:00:00+00:00</updated><id>https://deffro.github.io/machine%20learning/deployment/mlops/feature%20engineering/end-to-end-ml-project</id><content type="html" xml:base="https://deffro.github.io/machine%20learning/deployment/mlops/feature%20engineering/end-to-end-ml-project/"><![CDATA[<p><strong>The project is available on my <a href="https://github.com/Deffro/end-to-end-ML-project">GitHub</a></strong></p>

<p>This project aims to apply the best software engineering practices in a Machine Learning project in order to deploy the model.</p>

<p>We are developing a model to predict if a Data Scientist is willing to leave his/her current job.
We are not interested in the accuracy of the model (which is 77%), but rather to transition from the research environment to production code, packaging, and finally deployment of the model.</p>

<h2> Research Code ➙ Production Code ➙ Deployment </h2>

<p><a href="https://www.python.org/" target="_blank"><img alt="Python" src="https://img.shields.io/badge/-Python-4B8BBE?style=flat-square&amp;logo=python&amp;logoColor=white" height="27" /></a>
<a href="https://scikit-learn.org/stable/index.html" target="_blank"><img alt="Scikit-learn" src="https://img.shields.io/badge/-Sklearn-fa9c3c?style=flat-square&amp;logo=scikitlearn&amp;logoColor=white" height="27" /></a>
<a href="https://jupyter.org/" target="_blank"><img alt="Jupyter" src="https://img.shields.io/badge/-Jupyter-eb6c2d?style=flat-square&amp;logo=jupyter&amp;logoColor=white" height="27" /></a>
<a href="https://www.anaconda.com/" target="_blank"><img alt="Anaconda" src="https://img.shields.io/badge/-Anaconda-3EB049?style=flat-square&amp;logo=anaconda&amp;logoColor=white" height="27" /></a>
<a href="https://www.jetbrains.com/pycharm/" target="_blank"><img alt="PyCharm" src="https://img.shields.io/badge/-PyCharm-41c473?style=flat-square&amp;logo=pycharm&amp;logoColor=white" height="27" /></a>
<a href="https://www.docker.com/" target="_blank"><img alt="Docker" src="https://img.shields.io/badge/-Docker-0db7ed?style=flat-square&amp;logo=docker&amp;logoColor=white" height="27" /></a>
<a href="https://git-scm.com/" target="_blank"><img alt="Git" src="https://img.shields.io/badge/-Git-F1502F?style=flat-square&amp;logo=git&amp;logoColor=white" height="27" /></a>
<a href="https://www.heroku.com/" target="_blank"><img alt="Heroku" src="https://img.shields.io/badge/-Heroku-430098?style=flat-square&amp;logo=heroku&amp;logoColor=white" height="27" /></a>
<a href="https://fastapi.tiangolo.com/" target="_blank"><img alt="FastAPI" src="https://img.shields.io/badge/-FastAPI-35a691?style=flat-square&amp;logo=fastapi&amp;logoColor=white" height="27" /></a>
<a href="https://docs.pytest.org/en/7.0.x/" target="_blank"><img alt="Pytest" src="https://img.shields.io/badge/-Pytest-ffd43b?style=flat-square&amp;logo=pytest&amp;logoColor=white" height="27" /></a>
<a href="https://docs.pytest.org/en/7.0.x/" target="_blank"><img alt="tox" src="https://img.shields.io/badge/-tox-9dab35" height="27" /></a></p>

<h3 id="project-structure">Project Structure</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>end-to-end-ML-project
│   README.md
│   MANIFEST.in    
│   mypy.ini
│   pyproject.toml
│   setyp.py
│   .gitignore
│   tox.ini
│   Dockerfile
│
└───notebooks
│   │   1. Data Analysis.ipynb
│   │   2. Feature Engineering.ipynb
│   │   3. Feature Engineering Pipeline.ipynb
│   │   4. Machine Learning.ipynb
│   │   preprocess.py
│   
└───requirements
│   │   requirements.txt
│   │   research-env.txt
│   │   production.txt
│   │   deployment.txt  
│
└───src
│   │   VERSION
│   │   __init__.py
│   │   config.yml
│   │   pipeline.py
│   │   train_pipeline.py
│   │   predict.py
│   │
│   └───config 
│   │   │   __init__.py
│   │   │   core.py
│   │ 
│   └───data 
│   │   │   __init__.py
│   │   │   train.csv
│   │   │   test.csv
│   │     
│   └───processing 
│   │   │   __init__.py
│   │   │   data_manager.py
│   │   │   features.py
│   │  
│   └───trained_models 
│   │   │   __init__.py  
│
└───app-fastapi
│   ...
</code></pre></div></div>

<h3 id="steps-in-an-end-to-end-ml-project">Steps in An End-to-end ML Project</h3>

<ol>
  <li>Start with jupyter notebooks and finalize a model.</li>
  <li>Transform research code to production code.</li>
  <li>Make the project a package.</li>
  <li>Serve it via a REST API.</li>
  <li>Dockerize it and deploy it.</li>
</ol>

<h3 id="1-start-with-jupyter-notebooks-and-finalize-a-model">1. Start with jupyter notebooks and finalize a model</h3>

<p>The <code class="language-plaintext highlighter-rouge">notebooks</code> folder is the research which is often done by a Data Scientist.</p>

<p>Usually a Data Analysis notebook for EDA and data understanding is the first step.
Then, features are created in a pipeline. Here, sciki-learn and feature-engine were used.
Finally, the ML model is placed at the end of the pipeline.</p>

<p>Research can be very time-consuming. Here, a simple pipeline is created, 
because the creation of a 95% accuracy model is out of the scope of this work.</p>

<h3 id="2-transform-research-code-to-production-code">2. Transform research code to production code</h3>

<p>The <code class="language-plaintext highlighter-rouge">src</code> folder is the transformation of the jupyter notebooks to a python project.</p>

<p>Some good practices:</p>
<ul>
  <li>Create a <code class="language-plaintext highlighter-rouge">config.yml</code> file that contains all the constants and configurations derived from the notebooks. Accompany it with a .py file to parse it (Here it is the <code class="language-plaintext highlighter-rouge">src/config/core.py</code>).</li>
  <li>Tidy all extra functions written and place them in a <code class="language-plaintext highlighter-rouge">processing</code> folder. For example, in <code class="language-plaintext highlighter-rouge">src/processing/data_manager.py</code> there are functions to read the data, save, read, and remove the pipeline.</li>
  <li>Make different file for <code class="language-plaintext highlighter-rouge">train_pipeline.py</code> and <code class="language-plaintext highlighter-rouge">predict.py</code>.</li>
  <li>Always create very small functions to test them easier and have a readable code.</li>
  <li>Create a <code class="language-plaintext highlighter-rouge">trained_models</code> folder to deposit the models.</li>
  <li>Have a <code class="language-plaintext highlighter-rouge">VERSION</code> file, to track the version of the project, e.g. 0.0.4</li>
  <li>Write <code class="language-plaintext highlighter-rouge">tests</code>. Now write more tests.</li>
  <li>Make a <code class="language-plaintext highlighter-rouge">tox.ini</code> file to make life easier, test code faster, get rid of styling, type checks, linting, and PEP8 concerns.</li>
</ul>

<p>Note: In order to import your python files as packages in other python files, we need to add the project’s filepath to the Path environmental Variable.</p>

<h3 id="3-make-the-project-a-package">3. Make the project a package</h3>

<p>We need 3 files in the root of the project:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">MANIFEST.ini</code>: Define which files to include and exclude from the package.</li>
  <li><code class="language-plaintext highlighter-rouge">pyproject.toml</code>: Specify basic dependencies and configure tooling.</li>
  <li><code class="language-plaintext highlighter-rouge">setup.py</code>: Package metadata, version, requirements, how to create the package.</li>
</ol>

<p>From the project directory: <code class="language-plaintext highlighter-rouge">python -m build</code></p>

<p>Then, make an account to PyPI. Install twine: <code class="language-plaintext highlighter-rouge">pip install twine</code></p>

<p>Upload: <code class="language-plaintext highlighter-rouge">twine upload dist/end_to_end_ML_project-0.0.4-py3-none-any.whl</code></p>

<p>Now the package can be installed like any other package with <code class="language-plaintext highlighter-rouge">pip install end-to-end-ML-project</code></p>

<p>It can be imported like: <code class="language-plaintext highlighter-rouge">import src</code></p>

<h3 id="4-serve-it-via-a-rest-api">4. Serve it via a REST API</h3>

<p>The API should be a different repository or at least a different folder. Here it is located in the folder <code class="language-plaintext highlighter-rouge">app-fastapi</code>.</p>

<p>The first thing here is in the <code class="language-plaintext highlighter-rouge">requirements.txt</code>, where we define to install the <code class="language-plaintext highlighter-rouge">end-to-end-ML-project</code> package,
which we have published earlier.</p>

<p>Three key files of the api are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">config.py</code>: Specify metadata of the api, and logging settings.</li>
  <li><code class="language-plaintext highlighter-rouge">main.py</code>: Define the main app and the index page router.</li>
  <li><code class="language-plaintext highlighter-rouge">api.py</code>: Define a health and a predict endpoint.</li>
</ul>

<p>We define some <code class="language-plaintext highlighter-rouge">schemas</code> for automatic validation of variable types.
We define some <code class="language-plaintext highlighter-rouge">schemas</code> for automatic validation of variable types.</p>

<p>We also define <code class="language-plaintext highlighter-rouge">tests</code> with predefined input data to predict.</p>

<p>We also use <code class="language-plaintext highlighter-rouge">logging</code> and the package <code class="language-plaintext highlighter-rouge">loguru</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">Procfile</code> and <code class="language-plaintext highlighter-rouge">runtime.txt</code> are necessary files to deploy on Heroku.</p>

<h3 id="5-dockerize-it-and-deploy-it">5. Dockerize it and deploy it</h3>

<p>We create a <code class="language-plaintext highlighter-rouge">Dockerfile</code> and build the image:</p>

<p><code class="language-plaintext highlighter-rouge">docker build -t end-to-end-ML-project:latest .</code></p>

<p>We run the image:</p>

<p><code class="language-plaintext highlighter-rouge">docker run -p 8001:8001 -e PORT=8001 end-to-end-ml-project</code></p>

<p>We can see the output on localhost:8001/</p>

<p>Now to deploy on Heroku, create a <code class="language-plaintext highlighter-rouge">heroku.yml</code> file.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>heroku login
heroku cointainer:login
heroku container:push web --app end-to-end-ml-project
heroku container:release web --app end-to-end-ml-project
heroku open --app end-to-end-ml-project
</code></pre></div></div>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Machine Learning" /><category term="Deployment" /><category term="MLOps" /><category term="Feature Engineering" /><category term="fastapi" /><category term="sklearn" /><category term="machine learning" /><category term="docker" /><category term="aws" /><summary type="html"><![CDATA[This project aims to apply the best software engineering practices in a Machine Learning project in order to deploy the model.]]></summary></entry><entry><title type="html">MLOps Lifecycle with MLFlow, Airflow, Amazon S3, PostgreSQL</title><link href="https://deffro.github.io/machine%20learning/deployment/mlops/feature%20engineering/mlops-pipeline-with-mlflow-postgresSQL-amazonS3-airflow/" rel="alternate" type="text/html" title="MLOps Lifecycle with MLFlow, Airflow, Amazon S3, PostgreSQL" /><published>2023-01-01T00:00:00+00:00</published><updated>2023-01-01T00:00:00+00:00</updated><id>https://deffro.github.io/machine%20learning/deployment/mlops/feature%20engineering/mlops-pipeline-with-mlflow-postgresSQL-amazonS3-airflow</id><content type="html" xml:base="https://deffro.github.io/machine%20learning/deployment/mlops/feature%20engineering/mlops-pipeline-with-mlflow-postgresSQL-amazonS3-airflow/"><![CDATA[<p><strong>The project is available on my <a href="https://github.com/Deffro/MLOps">GitHub</a></strong></p>

<h1 id="mlops-pipeline-with-------">MLOps Pipeline with <a href="https://mlflow.org/" target="_blank"><img alt="MLflow" src="https://img.shields.io/badge/-MLflow-0194E2?style=flat-square&amp;logo=mlflow&amp;logoColor=white" height="27" /></a> + <a href="https://www.postgresql.org/" target="_blank"><img alt="PostgreSQL" src="https://img.shields.io/badge/-PostgreSQL-4169E1?style=flat-square&amp;logo=postgresql&amp;logoColor=white" height="27" /></a> + <a href="https://aws.amazon.com/s3/" target="_blank"><img alt="Amazon S3" src="https://img.shields.io/badge/-Amazon S3-569A31?style=flat-square&amp;logo=amazons3&amp;logoColor=white" height="27" /></a> + <a href="https://airflow.apache.org/" target="_blank"><img alt="Apache Airflow" src="https://img.shields.io/badge/-Apache Airflow-017CEE?style=flat-square&amp;logo=apacheairflow&amp;logoColor=white" height="27" /></a></h1>

<p>A complete Machine Learning lifecycle. The pipeline is as follows:</p>

<p><code class="language-plaintext highlighter-rouge">1. Read Data</code>➙<code class="language-plaintext highlighter-rouge">2. Split train-test</code>➙<code class="language-plaintext highlighter-rouge">3. Preprocess Data</code>➙<code class="language-plaintext highlighter-rouge">4. Train Model</code>➙<br />
      ➙ <code class="language-plaintext highlighter-rouge">5.1 Register Model</code><br />
      ➙ <code class="language-plaintext highlighter-rouge">5.2 Update Registered Model</code><br /></p>

<p>Telco Customer Churn dataset from <a href="https://www.kaggle.com/datasets/blastchar/telco-customer-churn" target="_blank">Kaggle</a>.</p>

<h2 id="tech-stack">Tech Stack</h2>
<p><a href="https://mlflow.org/" target="_blank"><img alt="MLflow" src="https://img.shields.io/badge/-MLflow-0194E2?style=flat-square&amp;logo=mlflow&amp;logoColor=white" height="20" /></a>: For experiment tracking and model registration<br />
<a href="https://www.postgresql.org/" target="_blank"><img alt="PostgreSQL" src="https://img.shields.io/badge/-PostgreSQL-4169E1?style=flat-square&amp;logo=postgresql&amp;logoColor=white" height="20" /></a>: Store the MLflow tracking<br />
<a href="https://aws.amazon.com/s3/" target="_blank"><img alt="Amazon S3" src="https://img.shields.io/badge/-Amazon S3-569A31?style=flat-square&amp;logo=amazons3&amp;logoColor=white" height="20" /></a>: Store the registered MLflow models and artifacts<br />
<a href="https://airflow.apache.org/" target="_blank"><img alt="Apache Airflow" src="https://img.shields.io/badge/-Apache Airflow-017CEE?style=flat-square&amp;logo=apacheairflow&amp;logoColor=white" height="20" /></a>: Orchestrate the MLOps pipeline<br />
<a href="https://scikit-learn.org/stable/index.html" target="_blank"><img alt="Scikit-learn" src="https://img.shields.io/badge/-Sklearn-fa9c3c?style=flat-square&amp;logo=scikitlearn&amp;logoColor=white" height="20" /></a>: Machine Learning<br />
<a href="https://jupyter.org/" target="_blank"><img alt="Jupyter" src="https://img.shields.io/badge/-Jupyter-eb6c2d?style=flat-square&amp;logo=jupyter&amp;logoColor=white" height="20" /></a>: R&amp;D<br />
<a href="https://www.python.org/" target="_blank"><img alt="Python" src="https://img.shields.io/badge/-Python-4B8BBE?style=flat-square&amp;logo=python&amp;logoColor=white" height="20" /></a>
<a href="https://www.anaconda.com/" target="_blank"><img alt="Anaconda" src="https://img.shields.io/badge/-Anaconda-3EB049?style=flat-square&amp;logo=anaconda&amp;logoColor=white" height="20" /></a>
<a href="https://www.jetbrains.com/pycharm/" target="_blank"><img alt="PyCharm" src="https://img.shields.io/badge/-PyCharm-41c473?style=flat-square&amp;logo=pycharm&amp;logoColor=white" height="20" /></a>
<a href="https://www.docker.com/" target="_blank"><img alt="Docker" src="https://img.shields.io/badge/-Docker Compose-0db7ed?style=flat-square&amp;logo=docker&amp;logoColor=white" height="20" /></a>
<a href="https://git-scm.com/" target="_blank"><img alt="Git" src="https://img.shields.io/badge/-Git-F1502F?style=flat-square&amp;logo=git&amp;logoColor=white" height="20" /></a></p>

<h2 id="how-to-reproduce">How to reproduce</h2>

<ol>
  <li>Have <a href="https://docs.docker.com/get-docker/" target="_blank">Docker</a> installed and running.</li>
</ol>

<p>Make sure <code class="language-plaintext highlighter-rouge">docker-compose</code> is installed:</p>
<pre><code class="language-commandline">pip install docker-compose
</code></pre>

<ol>
  <li>Clone the repository to your machine.
    <pre><code class="language-commandline">git clone https://github.com/Deffro/MLOps.git
</code></pre>
  </li>
  <li>Rename <code class="language-plaintext highlighter-rouge">.env_sample</code> to <code class="language-plaintext highlighter-rouge">.env</code> and change the following variables:
    <ul>
      <li>AWS_ACCESS_KEY_ID</li>
      <li>AWS_SECRET_ACCESS_KEY</li>
      <li>AWS_REGION</li>
      <li>AWS_BUCKET_NAME</li>
    </ul>
  </li>
  <li>Run the docker-compose file</li>
</ol>

<pre><code class="language-commandline">docker-compose up --build -d
</code></pre>

<h2 id="urls-to-access">Urls to access</h2>

<ul>
  <li><a href="http://localhost:8080" target="_blank">http://localhost:8080<a></a> for <code class="language-plaintext highlighter-rouge">Airflow</code>. Use credentials: airflow/airflow</a></li>
  <li><a href="http://localhost:5000" target="_blank">http://localhost:5000<a></a> for <code class="language-plaintext highlighter-rouge">MLflow</code>.</a></li>
  <li><a href="http://localhost:8893" target="_blank">http://localhost:8893<a></a> for <code class="language-plaintext highlighter-rouge">Jupyter Lab</code>. Use token: mlops</a></li>
</ul>

<h2 id="cleanup">Cleanup</h2>
<p>Run the following to stop all running docker containers through docker compose</p>
<pre><code class="language-commandline">docker-compose stop
</code></pre>

<p>or run the following to stop and delete all running docker containers through docker</p>
<pre><code class="language-commandline">docker stop $(docker ps -q)
docker rm $(docker ps -aq)
</code></pre>

<p>Finally, run the following to delete all (named) volumes</p>
<pre><code class="language-commandline">docker volume rm $(docker volume ls -q)
</code></pre>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Machine Learning" /><category term="Deployment" /><category term="MLOps" /><category term="Feature Engineering" /><category term="mlflow" /><category term="airflow" /><category term="sklearn" /><category term="machine learning" /><category term="docker" /><category term="aws" /><category term="sql" /><summary type="html"><![CDATA[Full Machine Learning Lifecycle using Airflow, MLflow, and AWS S3.]]></summary></entry><entry><title type="html">Ensemble Feature Selection for Machine Learning</title><link href="https://deffro.github.io/machine%20learning/feature%20selection/ensemble-feature-selection-for-machine-learning/" rel="alternate" type="text/html" title="Ensemble Feature Selection for Machine Learning" /><published>2022-11-02T00:00:00+00:00</published><updated>2022-11-02T00:00:00+00:00</updated><id>https://deffro.github.io/machine%20learning/feature%20selection/ensemble-feature-selection-for-machine-learning</id><content type="html" xml:base="https://deffro.github.io/machine%20learning/feature%20selection/ensemble-feature-selection-for-machine-learning/"><![CDATA[<p><strong>The project is available online on <a href="https://towardsdatascience.com/ensemble-feature-selection-for-machine-learning-c0df77b970f9">Towards Data Science</a></strong>.</p>

<p>Most of the content of this article is from my paper entitled:
“An Evaluation of Feature Selection Methods for Environmental Data”, and is available <a href="https://www.sciencedirect.com/science/article/abs/pii/S1574954121000157">here</a> for anyone interested.</p>

<p>In my previous <a href="https://towardsdatascience.com/feature-selection-for-machine-learning-3-categories-and-12-methods-6a4403f86543">article</a>, I presented 12 individual feature selection methods. This article serves as the next step to feature selection.</p>

<p>If you are here, I suppose you are already familiar with the well-established ensemble on classification algorithms, which provides better results and robustness than employing single algorithms. The same principle can be applied to feature selection algorithms.</p>

<h2 id="ensemble-feature-selection">Ensemble Feature Selection</h2>
<p>The idea behind ensemble feature selection is to combine multiple different feature selection methods, taking into account their strengths, and create an optimal best subset.
In general, it makes a better feature space and reduces the risk of choosing an unstable subset.
Besides, a single method may result in a subset that can be considered a local optimum, while an ensemble might provide more stable results.</p>

<p>Continue your read on <a href="https://towardsdatascience.com/ensemble-feature-selection-for-machine-learning-c0df77b970f9">Towards Data Science</a>.</p>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Machine Learning" /><category term="Feature Selection" /><category term="features" /><category term="machine learning" /><summary type="html"><![CDATA[Investigate if ensemble feature selection methods are superior of individual in machine learning classification problems.]]></summary></entry><entry><title type="html">YouTube Video Resolution Downgrade Classification</title><link href="https://deffro.github.io/machine%20learning/feature%20engineering/feature%20selection/data%20processing/exploratory%20data%20analysis/youtube-resolution-downgrade-classification/" rel="alternate" type="text/html" title="YouTube Video Resolution Downgrade Classification" /><published>2022-04-14T00:00:00+00:00</published><updated>2022-04-14T00:00:00+00:00</updated><id>https://deffro.github.io/machine%20learning/feature%20engineering/feature%20selection/data%20processing/exploratory%20data%20analysis/youtube-resolution-downgrade-classification</id><content type="html" xml:base="https://deffro.github.io/machine%20learning/feature%20engineering/feature%20selection/data%20processing/exploratory%20data%20analysis/youtube-resolution-downgrade-classification/"><![CDATA[<p><strong>The project is available on my <a href="https://github.com/Deffro/Data-Science-Portfolio/tree/master/Notebooks/YouTube%20Video%20Resolution%20Downgrade%20Classification">GitHub</a></strong></p>

<ol>
  <li>Binary Classification on data with varying number of features</li>
  <li>Visualize, Analyze and understand data</li>
  <li>Engineer new features</li>
  <li>Try 8 approaches to create a model for classification</li>
  <li>Perform Ensemble Feature Selection</li>
  <li>Explain feature importance with SHAP</li>
</ol>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Machine Learning" /><category term="Feature Engineering" /><category term="Feature Selection" /><category term="Data Processing" /><category term="Exploratory Data Analysis" /><category term="pandas" /><category term="visualization" /><category term="features" /><category term="sklearn" /><category term="machine learning" /><category term="EDA" /><summary type="html"><![CDATA[Feature engineering, selection, visualization, model development, feature explainability.]]></summary></entry><entry><title type="html">Feature Selection for Machine Learning: 3 Categories and 12 Methods</title><link href="https://deffro.github.io/machine%20learning/feature%20selection/feature-selection-for-machine-learning/" rel="alternate" type="text/html" title="Feature Selection for Machine Learning: 3 Categories and 12 Methods" /><published>2021-06-09T00:00:00+00:00</published><updated>2021-06-09T00:00:00+00:00</updated><id>https://deffro.github.io/machine%20learning/feature%20selection/feature-selection-for-machine-learning</id><content type="html" xml:base="https://deffro.github.io/machine%20learning/feature%20selection/feature-selection-for-machine-learning/"><![CDATA[<p><strong>The project is available online on <a href="https://towardsdatascience.com/feature-selection-for-machine-learning-3-categories-and-12-methods-6a4403f86543">Towards Data Science</a></strong>.</p>

<p>Most of the content of this article is from my recent paper entitled:
“An Evaluation of Feature Selection Methods for Environmental Data”, available <a href="https://www.sciencedirect.com/science/article/abs/pii/S1574954121000157">here</a> for anyone interested.</p>

<h2 id="the-2-approaches-for-dimensionality-reduction">The 2 approaches for Dimensionality Reduction</h2>
<p>There are two ways to reduce the number of features, otherwise known as dimensionality reduction.</p>

<p>The first way is called feature extraction and it aims to transform the features and create entirely new ones based on combinations of the raw/given ones.
The most popular approaches are the Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Multidimensional Scaling. However, the new feature space can hardly provide us with useful information about the original features.
The new higher-level features are not easily understood by humans, because we can not link them directly to the initial ones, making it difficult to draw conclusions and explain the variables.</p>

<p>The second way for dimensionality reduction is feature selection.
It can be considered as a pre-processing step and does not create any new features, but instead selects a subset of the raw ones, providing better interpretability.
Finding the best features from a significant initial number can help us extract valuable information and discover new knowledge.
In classification problems, the significance of features is evaluated as to their ability to resolve distinct classes.
The property which gives an estimation of each feature’s handiness in discriminating the distinct classes is called feature relevance.</p>

<p>Continue reading on <a href="https://towardsdatascience.com/feature-selection-for-machine-learning-3-categories-and-12-methods-6a4403f86543">Towards Data Science</a>.</p>]]></content><author><name>Dimitris Effrosynidis</name></author><category term="Machine Learning" /><category term="Feature Selection" /><category term="features" /><category term="machine learning" /><summary type="html"><![CDATA[Learn basic theory about the 3 types of feature selection in machine learning namely filters, wrappers, and embedders.]]></summary></entry></feed>