DEV Community: Clelia (Astra) Bertelli

Build Code-RAGent, an agent for your codebase

Clelia (Astra) Bertelli — Tue, 29 Apr 2025 20:20:28 +0000

Introduction

Recently, I've been hooked up with automating data ingestion into vector databases, and I came up with ingest-anything, which I talked about in my last post.
After Chonkie released CodeChunker, I decided to include code ingestion within ingest-anything, and you can read about it in this LinkedIn post, where I announced the new release:

The only thing left to do then was to build something that could showcase the power of code ingestion within a vector database, and it immediately clicked in my mind: "Why don't I ingest my entire codebase of solved Go exercises from Exercism?"
That's how I created Code-RAGent, your friendly coding assistant based on your personal codebases and grounded in web search. It is built on top of GPT-4.1, powered by OpenAI, LinkUp, LlamaIndex, Qdrant, FastAPI and Streamlit.
The building of this project was aimed at providing a reproducible and adaptable agent, that people can therefore customize based on their needs, and it was composed of three phases:

Environment setup
Data preparation and ingestion
Agent workflow design

Environment Setup

I personally like setting up my environment using conda, also because it's easily dockerizable, so we'll follow this path:

conda create -y -n code-ragent python=3.11 # you don't necessarily need to specify 3.11, it's for reproducibility purposes
conda activate code-ragent

Now let's install all the needed packages within our environment:

python3 -m pip install ingest-anything streamlit

ingest-anything already wraps all the packages that we need to get our Code-RAGent up and running, we just need to add streamlit, which we'll use to create the frontend.

Let's also get a Qdrant instance, as a vector database, locally using Docker:

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:latest

Data ingestion

The starting data, as I said earlier, will be my learning-go repository, that contains solved Go exercises coming from Exercism. We can get the repository by cloning it:

git clone https://github.com/AstraBert/learning-go

And now we can get all the Go files contained in it, in our python scripts, as follows:

import os

files = []
for root, _, fls in os.walk("./learning-go"):
    for f in fls:
        if f.endswith(".go"):
            files.append(os.path.join(root, f))

Now let's ingest all the files with ingest-anything:

from ingest_anything.ingestion import IngestCode
from qdrant_client import QdrantClient, AsyncQdrantClient

client = QdrantClient("http://localhost:6333")
aclient = AsyncQdrantClient("http://localhost:6333")
ingestor = IngestCode(qdrant_client=client, async_qdrant_client=aclient, collection_name="go-code",hybrid_search=True)
vector_index = ingestor.ingest(files=files, embedding_model="Shuu12121/CodeSearch-ModernBERT-Owl", language="go")

And this is it: the collection go-code is now set up and available for search within Qdrant, so we can get our hands actually on agent workflow design.

Agent workflow design

This is a visualization of Code-RAGent workflow:

We won't see the details of the code here, just high-level concepts, but you can find everything in the GitHub repo.

1. Tools

We need three main tools:

vector_search_tool that searches the vector database, using a LlamaIndex Query Engine, that first produces a hypothetical document embedding (HyDE) and then matches it with the database using hybrid retrieval, producing a final summary response.
web_search_tool that can ground solutions in web search: we exploit Linkup, and we format the search results in such a way that the tool always produces a code explanation and, when necessary, a code snippet.
evaluate_response that can give a correctness, faithfulness and relevancy score to the agent's final response based on the original user query and on the retrieved context (either from the web or from vector search). For this purpose, we use LlamaIndex evaluators

2. Designing and serving the agent

We use a simple and straightforward Function Calling Agent within the Agent Workflow module in LlamaIndex, and we give the agent access to all the tools designed at point (1).

Now, it's just a matter of deploying the agent on an API endpoint, making it available to the frontend portion of our application: we do it via FastAPI, serving the agent under the /chat POST endpoint.

3. User Interface

The UI, written with Streamlit, can be set up like this:

import streamlit as st
import requests as rq
from pydantic import BaseModel

class ApiInput(BaseModel):
    prompt: str

def get_chat(prompt: str):
    response = rq.post("http://backend:8000/chat/", json=ApiInput(prompt=prompt).model_dump())
    actual_res = response.json()["response"]
    actual_proc = response.json()["proces"]
    return actual_res, actual_proc



st.title("Code RAGent💻")

if "messages" not in st.session_state:
    st.session_state.messages = []

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if prompt := st.chat_input("What is up?"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)
    with st.chat_message("assistant"):
        stream, proc = get_chat(
            prompt=st.session_state.messages[-1]["content"],
        )
        response = st.write(stream)
        st.session_state.messages.append({"role": "assistant", "content": stream})
        with st.expander("See Agentic Process"):
            st.write(proc)

And will result in something like this:

Clean and simple!

Conclusion

To wrap this article up, let me just highlight three main points that have the potential to make Code-RAGent a very good codebase assistant:

The codebase is ingested with a dedicated pipeline, using a special chunking for code, as well as a dense embedding model finetuned for code retrieval
The agent can fall back on the web search whenever the information you ask for is outside of the scope of your ingested codebase
It evaluates the responses it produces

That being said, this is just a tutorial-ready agentic system, far from being perfect, so if you have any feedback or suggestion just let me know! ✨

Ingest (almost) any non-PDF document in a vector database, effortlessly

Clelia (Astra) Bertelli — Fri, 25 Apr 2025 17:32:51 +0000

One of my areas of focus, recently, has been the development of a universal and zero-effort way of converting text-based documents (and even images) into PDF files, so that they could fit into my RAG pipelines, that are optimized for that format. In the end, after almost 30 "This is the last git commit", I came up with PdfItDown, a python package capable of transforming the most commonly used file formats into PDF, and it can do so with single or multiple files (and even entire folders!).
After that, tho, I wasn't satisfied: converting files to PDF is ok, but they're unplugged from the main ingest-into-DB pipeline, which might be still a lot of effort to design and optimize. Then it came the idea: why don't I create a standardized, simple and yet powerful, fully-automated procedure to go from a non-PDF file to vector data loaded into a database?
The tools were already there:

PdfItDown can handle file transformation
LlamaIndex has the readers to turn PDFs into text files
Chonkie offers a versatile and mighty chunking toolbox
Sentence Transformers are a widely use embeddings library that could provide text encoders
Qdrant is an easy-to-set-up, highly performing and scalable vector database, that offers numerous functionalities (among which hybrid search and metadata filtering).

What's even better? All these tools are open source!🎉

So it was just a matter of combining them - and that's how ingest-anything came to life:

Simple, elegant and all-in-one!

Let's see how we can use it to ingest files:

We install it:

pip install ingest-anything 
# or, if you prefer a faster installation
uv pip install ingest-anything

We set up a local Qdrant instance with Docker:

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:latest

We initialize the ingestor:

from ingest_anything.ingestionn import IngestAnything, QdrantClient, AsyncQdrantClient

ingestor = IngestAnything(qdrant_client = QdrantClient("http://localhost:6333"), async_qdrant_client = AsyncQdrantClient("http://localhost:6333"), collection_name = "flowers", hybrid_search=True)

We ingest our files...:

ingestor.ingest(chunker="late", files_or_dir=['tests/data/test.docx', 'tests/data/test0.png', 'tests/data/test1.csv', 'tests/data/test2.json', 'tests/data/test3.md', 'tests/data/test4.xml', 'tests/data/test5.zip'], embedding_model="sentence-transformers/all-MiniLM-L6-v2")

Or an entire directory!

# with a directory
ingestor.ingest(chunker="token", files_or_dir="tests/data", tokenizer="gpt2", embedding_model="sentence-transformers/all-MiniLM-L6-v2")

And we're done! In three lines of code we've ingested all non-PDF files (in a list or in a directory) into a Qdrant collection, which we can now query for RAG purposes!

As you can see, you can act on several levels, customizing the embedding model, the chunking method (check out Chonkie docs for this), the tokenizer (when necessary), a lot of chunking parameters (that you can optionally set or leave as default), and you can also turn on and off the hybrid search, (optionally) choosing the sparse model to use among the ones available through FastEmbed.

So what are you waiting? Grab your PC and try this out: I can guarantee that speedrunning effortlessly through documents ingestion in a vector DB is highly satisfying (and addictive)!

1minDocker #14 - Deploy an AI app with Docker on cloud

Clelia (Astra) Bertelli — Thu, 06 Mar 2025 22:59:09 +0000

In the last article we dove into the world of continuous integration with GitHub Actions. Now it's time to take a step forward and to talk about deploying a Docker application and make it available to everyone. To do this, we could exploit a local server, but local servers are usually costly to set up, initialize and maintain (on the long run): cloud solutions are, on the other hand, simpler and faster to boot and set up, especially the one we are going to use for this tutorial, Linode

Step 1: your Linode instance

Setting up a Linode instance couldn't be easier: you just need to sign up or log in to Linode.

Once you land into your dashboard, you just need to click on Create (green button on the top left corner) -> Linode . Then you will be prompted to select the settings of your instance (operating system, region, name, root password and, eventually, an SSH key). The set up is extremely intuitive and, for our application, I'd suggest:

Choose Ubuntu 22.04 as OS
Choose a 2GB RAM - 1 vCPU hardware
Choose the region that is closer to you
Choose a strong password for your root user

Once you are set up and your instance is booted and running, you can connect to it from your terminal. Regardless that you are on Windows, macOS or Linux (although I prefer the last one), you can simply use the SSH protocol and authenticate with the root password. To do so, you need to get the public IP address of your Linode (which will be useful also later) - you can comfortably find it in you dashboard.

ssh root@<PUBLIC-IP-ADDRESS>

You'll be prompted to input the password and, after that, you'll be finally inside your Linode's terminal!

Step 2: preparing your Linode for the application

Since we want to deploy our application with Docker, we need to install it within our Linode virtual machine.

If you followed my advice and you created an Ubuntu 22.04 machine, you can simply try these commands, that you can also find on the official Docker installation page:

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

After this, run:

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Test the successful installation with:

sudo docker run hello-world

And BAM! You installed Docker🐋

Step 3: Get the application

For this tutorial, I already prepared an application for you: it's called SciNewsBot and it's a BlueSky bot which publishes daily science news from trusted publishers.

SciNewsBot exploits Mistral AI to summarizes into an effective and catchy headlines the titles and content of news from Google News publishers that are labelled as trustworthy by Media Bias/Fact Check. These news spans four domains (Science, Environment, Energy and Technology), and are scraped and published 4 times a day, with a pause of 3 hours in between and with a pause of 12 hours from the last news report of one day to the first news report of the following day. You can see the bot working in this page

So, from within your Linode instance (which you connected to via SSH in previous steps) clone the application from GitHub:

git clone https://github.com/AstraBert/SciNewsBot.git
cd SciNewsBot/

Now the only things you need to do are:

Get a Mistral AI API key (you can create one for free)
Create a BlueSky user for your bot, and you can do it here
Modify your .env.example file with reporting the Mistral API key, the BlueSky username and password
Rename the .env.example file to .env with the following command:

mv .env.example .env

Step 4: Deploy!

Now we're just one step away from deployment, and that step is launching our application through Docker. Let's take a look to the compose.yaml file that we have in the SciNewsBot folder

name: news-sci-bot

services:
  bot:
    build: 
      context: ./docker/
      dockerfile: Dockerfile
    secrets:
      - mistral_key
      - bsky_usr
      - bsky_psw
    networks:
      - mynet

secrets:
  mistral_key:
    environment: mistral_api_key
  bsky_usr:
    environment: bsky_username
  bsky_psw:
    environment: bsky_password

networks:
  mynet:
    driver: bridge

This file creates a container from the docker subfolder we have, mounting within it three environment-derived secrets, i.e the Mistral AI API key, the BlueSky username and the password. It then attaches the container to a network named mynet.

Each of this secrets is accessible through the path /run/secrets/<secret_name>, and that's how we do that in our python scripts (we read these files).

Now, to deploy we just need to run:

docker compose up -d

Don't forget to put the -d option, otherwise you will kill the container execution one you exit from the Linode terminal: the -d option detaches the container execution from the main terminal, allowing you to close it without stopping the container.

Congrats! You just deployed your first Docker application on cloud!🎉

We will stop here for this article, but in the next (and last article) we will see more complex deployment use cases and will wrap up the 1minDocker journey: stay tuned and have fun!🥰

1minDocker #13 - Push, build and dockerize with GitHub Actions

Clelia (Astra) Bertelli — Thu, 23 Jan 2025 23:24:11 +0000

In the last article we talked about CI/CD: but how do we put in practice a CI/CD pipeline in a hassle-free and very simple framework?

Well, say no more! GitHub Actions have all that you need to set up a perfect environment to commit -> build -> dockerize your app.

Let's dive in!

Find the tutorial repo for this blog post here

Setting up - GitHub

We do not take anything for granted, so let's assume you don't have a GitHub account and you just wanna start from scratch:

Head over to GitHub Signup and register there with your email and password. You will be asked also to create an username
Activating 2 factors authentication (2FA) is optional, but recommended
Once you are signed in and set up, it's time to create your first repository!

To create a repository, you generally have to click on a New or a Create new repository green button: you will be prompted to choose the visibility (if the repo is public or private) and the name of the repository. I suggest that, for this tutorial, you create a public repository called hello-world-github-docker.

Setting up - Application

Now let's build an application: we'll be using Python, a versatile programming language that you can use for lots of things, from building apps to data analysis, from creating websites to machine learning.

Let's first of all clone the GitHub repository (i.e. make a local copy of it) with git (see how to install it here if you don't have it)

git clone https://github.com/username/hello-world-docker-github
# Remember to change 'username' with your actual username!
cd hello-world-docker-github/

And now, let's create and start editing our app.py file (.py is the extension that python scripts have):

touch app.py
code app.py

The touch command creates the file, whereas the code command opens the file within Visual Studio Code (a IDE, Integrated Development Environment) to modify it. You can obviously use different IDEs: no IDE is better than another one, as long as it works for you.

Generally, a "hello world" application prints "Hello world!" on the terminal, like this:

print("Hello world!")

But we want to do something more: we don't like vanilla white text on our terminal, we want some 🌈color🌈.

To do this, we simply need to install the termcolor package:

pip install termcolor

Now we just import and use it into our script:

from termcolor import cprint
# cprint stands for "colorer print", and allows us to print colored text

cprint("Hello world!", color="red")

Now the printed text will be red, but let's add even some more spice, and let the program choose a random color to use in printing the string to terminal:

from termcolor import cprint
import random
# random is the library that allows us to extract random items from a list
colors = ["red", "green", "blue", "magenta", "yellow"]

color = random.choice(colors)

cprint("Hello world!", color=color)

Now our script will randomly choose one of the colors and print the "Hello world!" in that color :)

NOTE: random is a built-in library in python, which means it comes within the language: that's why we did not need to install it.

Setting up - Docker

Ok, now we have our application: how can we dockerize it?

Our first option could be to write down a Dockerfile, build an image from that and push it to Docker Hub. That's a viable solution, but we would need to re-build the image on our local machine and push it to Docker Hub every time we make a change to the app. While our "Hello world" app is relatively small and that would not be a burden for our computer, we for sure would want to avoid this with bigger applications.

Another solution could be, as we said, writing the Dockerfile, and then uploading our application to GitHub and let the platform take care of building and pushing the image to ghcr, i.e. GitHub Container Registry, a registry where the Docker images built on GitHub (also known as packages) are stored.

Since our goal is to exploit GitHub Actions, let's do that!

The first thing we have to do is to create a requirements.txt file that will install all the necessary dependencies inside our Docker image. In our case, we only need termcolor, so we can just do it like this:

Create and open the file for editing:

touch requirements.txt
code requirements.txt

Write the termcolor package in the file:

termcolor

Now let's build our Dockerfile. We want our image to contain python, and we want it also to install the needed dependencies, so let's define it like this:

# Define your python version
ARG PY_VERSION="3.11.9-slim-bookworm"

# Base image
FROM python:${PY_VERSION}

# Define your working directory
WORKDIR /app/
# Copy your local file system into the working directory
COPY ./ /app/

# Install the necessary dependencies
RUN pip cache purge
RUN pip install --no-cache-dir -r requirements.txt

# Run the application as an entrypoint
ENTRYPOINT python3 app.py

Now our local folder structure will look like this:

.
|__ app.py
|__ requirements.txt
|__ Dockerfile

The only thing that we need is to configure a workflow file that will trigger GitHub Actions and tell them to build and push the image. For this, we can use the pre-built template offered by GitHub.

We need to first of all create the file:

# The file is placed in a special directory, .github/workflows
mkdir -p .github/workflows
touch .github/workflows/docker-publish.yml
code .github/workflows/docker-publish.yml

As you can see, the workflow file is in YAML format, a powerful markup language that allows you to specify all the steps you want in the build. Copy and paste the below text in the file you are now editing:

name: Docker

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

on:
  schedule:
    - cron: '40 8 * * *'
  push:
    branches: [ "main" ]
    # Publish semver tags as releases.
    tags: [ 'v*.*.*' ]
  pull_request:
    branches: [ "main" ]

env:
  # Use docker.io for Docker Hub if empty
  REGISTRY: ghcr.io
  # github.repository as <account>/<repo>
  IMAGE_NAME: ${{ github.repository }}


jobs:
  build:

    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      # This is used to complete the identity challenge
      # with sigstore/fulcio when running outside of PRs.
      id-token: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      # Install the cosign tool except on PR
      # https://github.com/sigstore/cosign-installer
      - name: Install cosign
        if: github.event_name != 'pull_request'
        uses: sigstore/cosign-installer@59acb6260d9c0ba8f4a2f9d9b48431a222b68e20 #v3.5.0
        with:
          cosign-release: 'v2.2.4'
      # Set up BuildKit Docker container builder to be able to build
      # multi-platform images and export cache
      # https://github.com/docker/setup-buildx-action
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@f95db51fddba0c2d1ec667646a06c2ce06100226 # v3.0.0

      # Login against a Docker registry except on PR
      # https://github.com/docker/login-action
      - name: Log into registry ${{ env.REGISTRY }}
        if: github.event_name != 'pull_request'
        uses: docker/login-action@343f7c4344506bcbf9b4de18042ae17996df046d # v3.0.0
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      # Extract metadata (tags, labels) for Docker
      # https://github.com/docker/metadata-action
      - name: Extract Docker metadata
        id: meta
        uses: docker/metadata-action@96383f45573cb7f253c731d3b3ab81c87ef81934 # v5.0.0
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      # Build and push Docker image with Buildx (don't push on PR)
      # https://github.com/docker/build-push-action
      - name: Build and push Docker image
        id: build-and-push
        uses: docker/build-push-action@0565240e2d4ab88bba5387d719585280857ece09 # v5.0.0
        with:
          context: .
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      # Sign the resulting Docker image digest except on PRs.
      # This will only write to the public Rekor transparency log when the Docker
      # repository is public to avoid leaking data.  If you would like to publish
      # transparency data even for private images, pass --force to cosign below.
      # https://github.com/sigstore/cosign
      - name: Sign the published Docker image
        if: ${{ github.event_name != 'pull_request' }}
        env:
          # https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-an-intermediate-environment-variable
          TAGS: ${{ steps.meta.outputs.tags }}
          DIGEST: ${{ steps.build-and-push.outputs.digest }}
        # This step uses the identity token to provision an ephemeral certificate
        # against the sigstore community Fulcio instance.
        run: echo "${TAGS}" | xargs -I {} cosign sign --yes {}@${DIGEST}

It's now time to test this workflow!

Testing - First push

Let's first of all add, commit and push all the local changes to the online repository:

git add .
git commit -m "first commit"
git branch -M main
git push -u origin main

You will be prompted to insert your GitHub username and password. As password, you should use a GitHub access token, that you can create here.

Once the change is pushed, you will see, on your online GitHub repo, a brown dot: it means that the workflow we triggered with our push is now working and will build and push the image. If the workflow run is successful, you will see a green tick at the end of it, whereas if it fails, you will see a red cross.

Make sure that, once the package is created, it is public: if not, change its visibility to public, otherwise you won't be able to download it.

Testing - Trying our app

As we said, the Docker image will be loaded as a package under ghcr.io; we can then pull and run the image like this:

docker pull ghcr.io/username/hello-world-github-docker:main
docker run -t ghcr.io/username/hello-world-github-docker:main

If Docker says that it can't pull the image because you are not logged in to the GitHub Container Registry, you can simply run:

docker login ghcr.io -u username -p GITHUB-ACCESS-TOKEN

And this should do the magic!

When we run our application, we should see a colored "Hello world!" printed on the terminal.

Modifying our application

Now, let's say that we want to let the user choose the color they want "Hello world" to be printed with. We can modify our app.py like this:

from termcolor import cprint

colors = ["red", "green", "blue", "magenta", "yellow"]

# Tell the user the instructions
print(f"Hello user! What color would you like 'Hello world' to be printed with?\nChoose among: {', '.join(colors)}")

# Take the input from the user
color = input("-->")

# Check if the input is in the available colors: if not, tell the user that it is not available
if color.lower() in colors:
    cprint("Hello world!", color=color)
else:
    print("ERROR! The color you chose is not among the available colors :(")

To modify the Docker image, it is now sufficient to add, commit and push the local changes to the online repo:

git add .
git commit -m "user defined color"
git push origin main

If the workflow run is successful again, we should be able to pull and run the new image with updated features:

docker pull ghcr.io/username/hello-world-github-docker:main
docker run -it ghcr.io/username/hello-world-github-docker:main

We will stop here for this article, but in the next one we will see, hands-on, how to use Docker for on-cloud deployments: stay tuned and have fun!🥰

1minDocker #12 - What is CI/CD?

Clelia (Astra) Bertelli — Wed, 22 Jan 2025 20:52:33 +0000

Hello world and Happy 2025!🐳✨

It's time we resume our 1minDocker series, starting our last learning block, consisting of 4 articles, that will introduce us to CI/CD practices with Docker.

In order to get started with this last learning block, tho, we need to understand what CI/CD is: let's dive in!

What does CI/CD mean?

CI/CD is short for Continuous Integration/Continuous Delivery or, sometimes, Continuous Deployment. Let's break it down:

Continuous Integration: the CI piece of CI/CD means that everything, even small modification, gets integrated in the main code, constantly. This is a key feature when you need to fix small bugs and/or technicalities that would otherwise require you to package a new, standalone, patch release, that will have to be manually integrated user-side through updates. With continuous integrations, small and big changes/fixes are immediately available.
Continuous Delivery: the CD part is the consequence of continuous integration. Every time we integrate a new modification of our source code, we prompt a certain number of steps that will then, in the end, bring to a delivery/deployment of our code into production. This happens constantly, and allows companies with wise source control and orchestration to ship new features into production within minutes from their request. Obviously, this might be an edge case applicable to those huge tech companies with big developer teams working 24/7 on their products, but still the continuous delivery ensures high speed also to other smaller companies, that can deploy new features within hours instead of days or weeks.

What are the main steps of CI/CD?

As this image shows:

Image from Black duck: What is CI/CD

The CI/CD pipeline contains several really important steps, that are inserted in a "infinite" loop (that's the main idea behind the continuous thing):

We start with the code: developers all around the world write their code in their comfortable IDE on their computers, and then, once they are done, push the changes they made into a version control system. The most widespread is Git, and the most used services on this end are GitHub and GitLab
The code, once in code management system, needs to be built: builds are generally automatized and test if there is some error in this phase, that would result in breakings and/or other errors further down the road.
After the build is complete and the code is deemed clean on this end, we can proceed with testing its actual capabilities. There are several possible tests, most of them depend on the use case that your code is tackling. A widespread strategy is integration testing, which looks into the possibility that the code is compatible with the requirements and the standard of its use case.
If all the tests are passed, we can finally release our code out in the wild
Following the release there's often the deployment: the code is pushed into production and is finally ready to be used
Operating and Monitoring are then the last two phases before starting to modify the code again: we see how users interact with it, how well our products perform and we collect feedback. With this information, we can start fixing bugs, creating new features and making our products shine :)

Why CI/CD and Docker?

As you can see, CI/CD requires several environments: a build environment, a test environment and a deployment environment. Docker is a perfect solution, as we can manage everything through different containers, just using simple commands like docker build and docker run. Docker is also perfect for deployment, as it does not require complex environment set-up from scratch: you can build a portable image with all the dependencies and simply deploy it building from Dockerfile and running it.

Docker handles secrets, takes care of data transfer from the local context to the container and manages networks.

With Docker, you can even manage all of this with one command: you simply need to create a compose.yaml file and define your services, their sources (if pre-existing images or on-the-fly builds) and their specs (secrets, volumes, networks). After that, you simply need docker compose up. It's really simple, isn't it?🐳

Sources

1minDocker #11 - Advanced compose example

Clelia (Astra) Bertelli — Sun, 29 Dec 2024 00:07:11 +0000

In the last article, we introduced the compose file reference, and we went through numerous top-level elements and their attributes. In this article, we will present an advanced example using docker compose to deploy a multi-container application on the cloud.

We will refer, for this tutorial, to a slightly modified version of the compose.yaml file proposed for ElasticSearch-Logstash-Kibana by the awesome-compose repository by Docker on GitHub.

Background

Why would we build a multi-container environment with ElasticSearch, Logstash and Kibana? Well, the idea is that this stack (also known as the ELK stack), can help us monitor logs and data sent by third party services (Logstash), store and index them for fast search/retrieval (ElasticSearch) and visualize statistics in real time (Kibana).

This is optimal when we have real-time data flows (such as with IoT devices, social media or web traffic), we want to track logs from big servers and analyze them and/or we want to search data-heavy applications, such as e-commerce ones.

If you want to know more, please refer to:

As you can see, these three services are all provided by Elastic.

Set-up

Getting the needed files

First of all, we clone the awesome-compose repository:

git clone https://github.com/docker/awesome-compose.git

And we head over to our folder of interest:

cd awesome-compose/elasticsearch-logstash-kibana/

TIP💡: you can use all the folders in this repository to experiment with different compose settings

And we can take a look at the structure of the repository:

# You might need to install tree before using it
tree . -L 2

We will get out this structure:

.
|__ logstash/
|  |__nginx.log
|  |__pipeline/
|__compose.yaml

The logstash folder contains a pipeline subfolder and a nginx.log file: their purpose is not important for our scopes, but we have to keep in mind that they are there.

Adding some modifications

To showcase more of the compose file elements, we introduce a .env file, which we can create in this way:

touch .env

And then modify with our favorite text editor (for me it's VSCode):

code .env

In the .env file, let's create the following keys and values:

JAVA_OPTS="-Xms512m -Xmx512m"
DISCOVERY_SEED_HOSTS="logstash"
API_TOKEN="super-secret-token"

We will use them in the compose file (see below).

The `compose` file

Let's take a look to the compose file, which is slightly different from the one proposed by awesome-compose:

services:
  elasticsearch:
    image: elasticsearch:7.16.1
    container_name: es
    environment:
      discovery.type: single-node
      ES_JAVA_OPTS: $JAVA_OPTS # as we mentioned in the last article, compose can access the env variables we set in our .env file
    ports:
      - "9200:9200"
      - "9300:9300"
    healthcheck:
      test: ["CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health || exit 1"]
      interval: 10s
      timeout: 10s
      retries: 3
    networks:
      - elastic
  logstash:
    image: logstash:7.16.1
    container_name: log
    environment:
      discovery.seed_hosts: $DISCOVERY_SEED_HOSTS
      LS_JAVA_OPTS: $JAVA_OPTS
    secrets:
        - api_token
    volumes:
      - ./logstash/pipeline/logstash-nginx.config:/usr/share/logstash/pipeline/logstash-nginx.config
      - ./logstash/nginx.log:/home/nginx.log
    ports:
      - "5000:5000/tcp"
      - "5000:5000/udp"
      - "5044:5044"
      - "9600:9600"
    depends_on:
      - elasticsearch
    networks:
      - elastic
    command: logstash -f /usr/share/logstash/pipeline/logstash-nginx.config
  kibana:
    image: kibana:7.16.1
    container_name: kib
    volumes: 
        - kibana-cache:/etc/logs/cache
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch
    networks:
      - elastic
networks:
  elastic:
    driver: bridge
secrets:
   api_token:
      environment: "API_TOKEN"
volumes:
    kibana-cache:
        external: true
        name: kibana_cache_volume

We will now break this compose file down service by service.

ElasticSearch

ElasticSearch is based on the Docker image elasticsearch:7.16.1, and the container that is launched within the service is named es.

It accesses two environment variables: ES_JAVA_OPTS, that is read from the .env file, and discovery.type, which is set in-line.

It is connected to two ports: 9200 and 9300, accessible to the local device host on the same port addresses.

There is an healthcheck, that is performed for a maximum of three times with 10s intervals and a timeout (maximum time of execution of the test command) of 10s.

The service is bound to the elastic network, whose driver is the most common: bridge (simply connects all the containers together, bridging among them).

This container does not depend on anything, so it is the first to be started.

Logstash

ElasticSearch is based on the Docker image logstash:7.16.1, and the container that is launched within the service is named log.

It accesses two environment variables: LS_JAVA_OPTS, that is read from the .env file, and discovery.seed_hosts, which is also read from the .env` file.

It also has access to a secret, api_token, which is read from the environment file also as the API_TOKEN variable.

Inside the container, two volumes are mounted from the local file system (the pipeline subfolder and the nginx.log file) into the container's system.

It is connected to three ports: 5600, 5044 and 9300, accessible to the local device host on the same port addresses.

There is no healthcheck, but the service depends on ElasticSearch so it is the second one to be started: as soon as the container is started, the command logstash -f /usr/share/logstash/pipeline/logstash-nginx.config is executed, overriding the eventual CMD entries in the original Dockerfile from the logstash image.

The service is bound to the elastic network.

Kibana

Kibana is based on the Docker image kibana:7.16.1, and the container that is launched within the service is named kib.

It does not access environment variables or secrets.

Inside the container, there is one volume mounted, kibana-cache, the depends on a volume whose life cycle is externally managed, and which is named kibana_cache_volume.

It is connected to one ports: 5061, accessible to the local device host on the same port address.

There is no healthcheck, but the service depends on ElasticSearch so it is the second one to be started, along with Logstash.

The service is bound to the elastic network.

Launch and stop the service

Now, we can just launch our multi-container application with:

docker compose up

Or we can use:

docker compose up --detach

If we do not want the services logs to be displayed (and, potentially swamp) on our terminal.

If we want to see what is currently running, we can use:

docker compose ps

And if we want to stop the services and take down the compose app, we can simply run:

docker compose stop

Or, to stop and remove all the containers:

docker compose down

We will stop here for this article, and in the next one we will explore continuous-integration/continuous-deployment solution for Docker images. See you in 2025!🥰

1minDocker #10 - The Compose File

Clelia (Astra) Bertelli — Wed, 18 Dec 2024 11:37:54 +0000

In the last article we introduced compose, a popular Docker plugin to build multi-container applications and to manage complex environments in an easy and sharable way.

In this post we will focus on the compose file, i.e. the YAML file that contains the instructions that docker compose reads and runs when it is launched.

As we saw for the Dockerfile, also the compose file has keywords: these keywords are named elements, and the most important of them are known as top level elements. We will learn about them in the following paragraphs.

`name` and `version`

The version top element is obsolete, and it is used only for backward compatibility with older version of Compose, where the program actually validated the YAML file structured according to a precise schema known as Specification. Newer versions of compose no longer parse their input file for validation and, if they encounter a wrongly compiled/unknown field, compose would simply throw an error.

The name top level element is set to give a name to the project you are launching with compose, and overrides the default ones.

For example:

name: new_app

services:
    app:
        image: foo/bar
        command: echo "I'm running ${COMPOSE_PROJECT_NAME}"

`services`

The services elements defines the various containers your compose project will run, with several potential configurations and specifications.

There are numerous elements linked to the services one, we will go through the most used (excluding the ones referenced in the next sections):

image

Specifies the image that the container is running on: if the image has not already been pulled locally, it is pulled from the hub on the fly when the service is started.

services:
   app:
      image: node:18-alpine
      ...
   db:
      image: Postgres
      ...

build

If you want your container to run on a custom image you configured through a Dockerfile, you can use the build element, which will build the image on the fly based on the context provided (you can also specify the Dockerfile name):

services:
   app:
      build: .
      dockerfile: "Dockerfile.node"
      ...
   db:
      image: postgres
      ...

env_file and environment

compose by default can read the environment variables you set in a .env file is that is placed in the same directory in which the compose.yml file is situated. Nevertheless, you can specify your environment file through the env_file element:

env_file: "./envs_config/.raw.env"

You can also specify the format of the env_file (more on Docker docs here) and if it is required or not (more on Docker docs here).

For example, you could write:

env_file:
  - path: ./default.env
    required: true # default
    format: raw
  - path: ./override.env
    required: false

The environment element works as an env file, but from inside the compose file:

environment:
    - PG_USR: user
    - PG_PSW: password
    - PG_DATABASE: postgres

If the environment element is used in the compose file, it has priority over the same variables used in the env file.

depends_on

The depends_on element is useful when it comes to set the order in which several services are started.

For example:

services:
  app:
    build: .
    depends_on:
      - db
      - redis
  redis:
    image: redis
  db:
    image: postgres

In this case, the app container is built only after the db and redis one are ready.

This can be accompanied by conditions on how to actually control the starting of a container:

services:
  app:
    build: .
    depends_on:
      db:
        condition: service_healthy
        restart: true
      redis:
        condition: service_started
  redis:
    image: redis
  db:
    image: postgres

In this case, the app container is started only if the db container passes its health check (see below) and when the redis container is started (no need for health check).

command and entrypoint

command element specifies a command that overrides the execution of a CMD-dependent command from the Docker image.

For example:

command: bundle exec thin -p 3000

entrypoint, on the other hand, overrides the ENTRYPOINT set for the service's Docker image:

entrypoint: bash /app/post_create_command.sh

ports

Ports associated with the service and exposed from the container:

services:
    semantic_db:
        image: qdrant/qdrant:latest
        volumes: 
            - "./qdrant_storage:/qdrant/storage"
        ports:
            - "6333:6333"
            - "6334:6334"

In this case, the semantic_db container will have associated and exposed the ports 6333 and 6334, which will be accessible to the user on their localhost under the same port number.

restart

It can happen that a container fails to start or terminates abruptly/prematurely its execution. restart takes care of this, defining what is the policy when termination happens:

restart: "no" # no restarting whatsoever 
restart: always # always restart upon termination
restart: on-failure # restart only if the container produced an error
restart: on-failure:3 # restart max 3 times on failure
restart: unless-stopped # restart only if the container wasn't stopped or removed externally

healthcheck

healthcheck element is used to test the correct functioning of the service to which it is associated. It generally uses this syntax:

healthcheck:
    - test: ["CMD", "curl","-f", "http://localhost"] 
    - interval: 1m10s
    - timeout: 30s
    - retries: 5
    - start_period: 30s
    - start_interval: 5s

test is the command to be executed during the check. CMD or CMD-SHELL specify that the command is executed in the default shell for the container (/bin/sh for Linux, generally).
interval is the time that occurs between retries.
timeout is the maximum duration for a health check before it is considered failed
retries sets the maximum number of failures before the container is considered unhealthy
start_period is the "protected time" in which health checks occur during the start of a container that needs bootstrap. If these health checks fail, they do not count toward the maximum number of retries, whereas if they pass the container is considered started
start_interval works as interval but for health checks during the start time

container_name

Set the name of the container, in order to make it easier to detect it when you run docker ps -a:

services:
    app:
        image: node:alpine-18
        container_name: "reactjs_app"
        ...

`volumes`

volumes is a top-level element that ensures that data from the local file system are injected into the container. volumes is both a top level element and an attribute for services elements, and to ensure that a service has access to the volume you need to explicitly specify it inside the service specification itself.

services:
    app:
        image: python:3.11.9-slim-bookworm
        volumes: 
            - app-data:/app/data/

volumes:
    app-data:

If you don't need to mount anything inside the /app/data path, you can simply leave the app-data field under volumes blank. Otherwise, you simply have to specify the path to your data in the local file system.

A volume, like a network (see below), can have a driver (whose options are specified through driver_opts) and can be managed outside the container (external: true is specified).

Let's see a complex example:

services:
    app:
        image: python:3.11.9-slim-bookworm
        volumes: 
            - app-data:/app/data/
            - app-cache:/etc/logs/cache

volumes:
    app-data:
        driver: local
        driver_opts:
              type: none
              device: /data/db_data
              o: bind
    app-cache:
        external: true
        name: appcache_vol

`networks`

networks are a very important element for a compose file, because they allow the different services to communicate with each other, instead of isolating them. compose by default sets a single network for your app, but this is not always optimal: we may want some networks to be accessible only to specific services, and that's why we should specify the networks attached to each of our services.

For example:

services:
    frontend:
        image: user/webapp
        networks:
            - foo
            - bar

networks:
    foo:
    bar:

Networks can obviously be configured, so let's see two important elements in their configuration:

driver: this attribute provides information about the driver used to build the network and provide its core functionalities: bridge is the default one (ensure communication among containers in your app), but you can also use host (exploit directly the host networking, removing network isolation between the container and the Docker host), overlay (allow connectivity across different Docker daemons, networking across nodes for Swarm services), ipvlan (gives user control over IPv4 or IPv6 addressing and may be used for underlay network integration), macvlan (assigns your MAC address to a container, making it a visible device in your network) or none (completely isolate the container's network). Drivers can be configured through driver_opts
internal/external: specified if the network is managed outside or inside the application. By default, every network is internal and, if set external, compose throws an error if it is not able to connect to it.

Let's see a complete example:

services:
  proxy:
    build: ./proxy
    networks:
      - frontend
      - outside
  app:
    build: ./app
    networks:
      - frontend
      - backend
  db:
    image: postgres
    networks:
      - backend

networks:
  frontend:
    driver: bridge
    driver_opts:
      com.docker.network.bridge.host_binding_ipv4: "127.0.0.1"
  backend:
    driver: bridge
  outside:
    external: true

`configs`

configs are specific configurations that can be accessed by services (if explicitly declared under the configs attribute) and that modify a Docker image without having to build it from scratch.

Configs are by default owned by the user who is running the services and generally have world-readable permissions (that can be overridden by the services if they are configured to do so).

Configs have the following attributes:

file: the configuration file for the container (provided as a path referring to the local file system)
environment: the configuration is set as an environment variable
content: configuration is passed in-line inside the compose file
external: the config was already created and its lifecycle is externally managed
name: the name of the configuration (by default is <project_name>_config_key)

Let's see an example:

services:
    app:
      image: foo/bar
      configs:
         - app_config
    http_server:
      build: ./http_server/
      configs:
         - http_config
    db:
       image: postgres
       configs:
         - db_config

configs:
  http_config:
    file: ./httpd.conf
  app_config:
    content: |
      debug=${DEBUG}
      spring.application.admin.enabled=${DEBUG}
      spring.application.name=${COMPOSE_PROJECT_NAME}
  db_config:
    external: true

`secrets`

A secret can be specified as a file or an environment variable, and to be accessed by a service, it has to specified as a service attribute. Here is an example:

services:
  frontend:
    image: example/webapp
    secrets:
      - server-certificate
  db:
    image: postgres
    secrets:
       - postgres_psw
       - postgres_user
       - postgres_db
secrets:
  server-certificate:
    file: ./server.cert
  postgres_psw:
    environment: "POSTGRES_PSW"
  postgres_user:
    environment: "POSTGRES_USER"
  postgres_db:
    environment: "POSTGRES_DB"

We will stop here for this article, but in the next one we will explore an advanced compose example, in which we will see the nuances of this powerful plugin🥰

The content for this article is mainly based on docker compose file reference documentation : make sure to visit them to get to know more!

1minDocker #9 - Introduction to Compose

Clelia (Astra) Bertelli — Thu, 12 Dec 2024 19:23:52 +0000

In the last article we finished our deep dive on docker buildx, a popular plugin that is aimed at easing and automating the building process.

From this article on, we will talk about another plugin, docker compose, which presents numerous fields of application and a high potential for deployment and development automation.

What is `compose`?

compose is a technology that allow users to run one or more container in a easy and reproducible way.

Through compose you can control the entire tech stack and environment needed for your application, using simple but elegant YAML code in a input file.

compose provides also a very simple and intuitive CLI that, with few commands, lets your run, inspect, interact with and stop the containers you defined as services inside your YAML input file.

Why `compose`?

Choosing compose might come for a variety of reasons:

It's simple: it only needs few key words to work correctly, it leverages intuitive CLI commands and does not need the complex configurations that are set when running directly with docker run
It's compact: everything you need (images, volumes, networks...) is in one file
It's the easiest way to set up a working environment: imaging managing multiple databases, switching among various stacks for backend and frontend and manage several different API services: this would be very difficult to implement natively but, with compose, you can easily combine several different Docker images and just run them all together as a perfectly harmonic orchestra
It's easily sharable: you don't have to transfer entire codebases, deal with conflicts and with local machine versioning problems when giving your compose YAML file to other people from your team or from other team. This enhances reproducibility and fosters collaboration

Getting started

Getting started with compose is simple and easy. We just need to have it installed (see the second article of this series) and we can then proceed with creating our first compose YAML file, which we will call compose.yaml (suggested by Docker docs over compose.yml and docker-compose.yaml, which can still be used, though).

Let's say we want to build a React.js application and we want it to be interfaced with a Postgres database, whose status we also want to monitor through Adminer. We can exploit the node:18-alpine image to build an environment where we can install and run our local application mounted as a volume, the postgres image to get a PostgreSQL DB instance up and running on port 5432 and the adminer image to start Adminer on port 8080.

Let's see how the compose file will look like:

services:
  db:
    image: postgres
    restart: always
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: $PG_DB
      POSTGRES_USER: $PG_USER
      POSTGRES_PASSWORD: $PG_PASSWORD
    volumes:
      - pgdata:/var/lib/postgresql/data 

  app:
    image: node:18-alpine
    restart: always
    ports:
      - "3000:3000"
    volumes:
      - "appsrc:/app/src"
      - "apppublic:/app/public"
      - "./package.json:/app/"
      - "./.env:/app/"
    entrypoint: "cd /app && npm install && npm start"

  adminer:
    image: adminer
    restart: always
    ports:
      - "8080:8080"

volumes:
  pgdata:
  appsrc: "./src/"
  apppublic: "./public/"

Notice that we use PG_DB, PG_USER and PG_PASSWORD as environmental variable: this means that you should have set them in a .env file

In this case, we have all our three services available at once: the app (exposed on port 3000), that is injected from the local file system into the container and built on the fly every time the service is started, the db (exposed on port 5432), that is accessible through user, password and database name on adminer (exposed on port 8080).

To start everything, we just need to go to the directory in which our compose file is stored and run:

docker compose up

And, if we want to stop them, we can simply run:

docker compose down

We will stop here for this article, but in the next one we will dive into the compose files and how to build the best out of them!🥰

1MinDocker #8 - Advanced concepts for buildx

Clelia (Astra) Bertelli — Sun, 01 Dec 2024 19:56:00 +0000

In the last article, we started using buildx to add more building capacity to our Docker core.

In this article, we will dive deep into buildx's subcommands.

`docker buildx bake`

bake is a high-level command for buildx.
It is able to automate the build for multiple images at once, taking as reference a JSON, compose or HCL (HashiCorp configuration Language) file.

On a smaller scale, bake does not make any difference from build. If we consider having only one image to build, there is no performance gap, and:

docker build . -t user/name:tag

Is the same as building the following HCL file:

target "image" {
    dockerfile = "Dockerfile"
    tag = ["user/image:tag"]
}

And then run:

docker buildx bake image

Things change when we have multiple images to build together.

The bake file

Let's nevertheless take a step back and ask ourselves: how do we build a bake file? We will explore the HCL format, because it is the easiest and the most intuitive to use.
The file structure resembles the one of a JSON, and has the following three main keywords:

target: objects that are specified under this key are images that should be built. Target objects generally contain information on the context on which we are building the Docker image and on the tags to assign.
group: a list of targets are put under this keyword, so that everytime we want to build all the images together we can just bake the group name instead of calling the targets one by one
variable: works as an ARG or an ENV in a Dockerfile. It sets a variable that can be used downstream in the HCL file

Let's look at an example:

group "all" {
    targets = ["app", "db"]
}

variable PYTHON_TAG {
    default = "3.11.9-slim-bookworm"
}

target "backend" {
    dockerfile = "Dockerfile.backend"
    tag = ["user/python-backend:prod", "user/python-backend:latest"]
    args = {
        PYTHON_VERSION = ${PYTHON_TAG}
    }
}

target "db" {
    dockerfile = "Dockerfile.postgres"
    tag = ["user/postgres-db:prod", "user/postgres-db:latest"]
    no-cache = true
    platforms = ["linux/amd64", "linux/arm64"]
}

Now, we could run:

docker build . \
    -f Dockerfile.backend \
    -t user/python-backend:prod \
    -t user/python-backend:latest

docker build . \
    -f Dockerfile.postgres \
    -t user/postgres-db:prod \
    -t user/postgres-db:latest

Or we could also run:

docker buildx bake backend db

But the easiest way to do this is to leverage the group of targets that we specified:

docker buildx bake all

Bake will take care of the two builds at the same time.

`docker buildx create`

The create subcommand will create a new build environment instance. You can append some context to it as a node:

docker buildx node-0

This will produce an environment with a name which will be returned on your terminal (let's say happy_euclid).

You can use this name to append a new node to the environment.

docker buildx --name happy_euclid --append node-1

You can use the --node flag with the node name to modify or create a node.

create should be provided with a daemon configuration file through the --buildkitd-config flag (if not, it defaults to the buildkitd.default.toml file contained in the config directory of buildx). You can find an example of a complete configuration file in buildkit official documentation on GitHub.

If you nevertheless want to specify some BuildKit configuration flags for your builder instances overwriting the ones of the config file, you can do it by adding the --buildkitd-flags option:

docker buildx create node-0 --buildkitd-config ./buildkitdconfig.local.toml --buildkitd-flags '--debug --debugaddr 0.0.0.0:6666'

You should also specify the driver (see last article) for your builder instances with the --driver option: the default one is docker (your local Docker), but you can also choose docker-container (runs locally but based on a Docker image), kubernetes (a Kubernetes pod) and remote (a remote environment to which you're connected).

If you want to specify the platform(s) for which a builder is intended, you can do that passing the --platform option (like --platform linux/amd64 or --platform darwin/amd64,linux/arm64).

Deleting a node is also very simple: you just add the --leave flag followed by the name of the node you want to eliminate (specifying the name of the builder and the name of the node):

docker buildx create --name kitty_builds --node kitty0 --leave

`docker buildx build`

The build subcommand, as one might expect, has lots of options. Let's focus on the most important ones:

`--build-arg`

This option passes arguments for the build as in the following example:

docker buildx build --build-arg HTTP_PROXY=http://10.20.30.2:1234 --build-arg FTP_PROXY=http://40.50.60.5:4567 .

Arguments here are passed only at build-time (so not exposed while running the image) and can only modify non-persistent arguments in a Dockerfile set with the ARG keyword.

`--build-context`

This option sets additional building context for our build operation. For example you can specify an additional Docker image or stage that can be accessed through the Dockerfile using the FROM keyword or the --from flag:

docker buildx build --build-context myimage=docker-image://myimage@sha256:0123456789 .

The argument can be also a local or remote directory:

docker buildx build --build-context project=path/to/project/source .

docker buildx build --build-context gitproject=https://github.com/myuser/project.git .

You can access all this values in the Dockerfile:

FROM myimage

COPY --from=project node_modules/* /app/node_modules/
COPY --from=gitproject src/* /app/sec/

`--cache-from`

With --cache-from, you can import a previously existent cache for your build from a local folder, a GitHub repo, a Docker registry cache or a S3 bucket.

Here are some examples on how to use this option command:

## IMAGE REGISTRY - 1
docker buildx build --cache-from=user/image:cache .

## IMAGE REGISTRY - 2
docker buildx build --cache-from=user/image .

## IMAGE REGISTRY - 3
docker buildx build --cache-from=type=registry,ref=ghcr.io/user/image .

## LOCAL
docker buildx build --cache-from=type=local,src=path/to/cache .

## GITHUB REPO
docker buildx build --cache-from=type=gha .

## S3 BUCKET
docker buildx build --cache-from=type=s3,region=eu-west-1,bucket=mybucket .

`-f,--file`

Specify the Dockerfile for your build.

`--load`

Load the image resulting from the build into a local Docker image. This flags is the same as setting --output=type=docker

`--push`

Push directly the image resulting from the build to a registry. This flag is the same as setting: --output=type=registry

`--platform`

Specify the target platform for which you are building the image.

The platform specification should follow the os/arch or os/arch/variant syntax and can be also a list of comma-separated platforms, but only if you are not using docker as a driver.

You can also configure the platform as local, which makes buildx picking the local platform on which BuildKit is configured for the build.

`--secret`

You can expose a secret during a build, mounting it inside your Dockerfile and on your command line from a file (type=file) or from an environmental variable (type=env).

If you're using a file-based secret, you should specify the file origin:

docker buildx build --secret type=file,id=hf_token,src=$HOME/.gitcredentials/HF_TOKEN .

Abd you can import it inside your Dockerfile like this:

FROM python:3.11.9-slim-bookworm
RUN pip install huggingface-cli
RUN --mount=type=secret,id=hf_token,target=/root/.gitcredentials/HF_TOKEN \
  huggingface-cli login --token $hf_token

Using type=env instead loads the secret from an environmental variable.

You can set it like this:

export SECRET_TOKEN=token 
docker buildx build --secret id=SECRET_TOKEN .

As long as the ID matches with the name of the environmental variable, you don't have to specify type=env.

You can import it in your Dockerfile with:

# syntax=docker/dockerfile:1
FROM node:alpine
RUN --mount=type=bind,target=. \
  --mount=type=secret,id=SECRET_TOKEN,env=SECRET_TOKEN \
  yarn run test

You can also use the src/source flag but you need to specify type=env otherwise buildx will look for a file named as the name you reported for src/source:

export API_KEY=sk-your-supersecret-key-api
docker buildx build --secret type=env,id=api,src=API_KEY .

This might be useful when you don't want your secret's ID to match with the name of the environmental variable.

`-t, --tag`

Used to specify the name and tag of an image for the build.

Minor commands

docker buildx imagetools: it helps with managing registry-based images creating new ones from a list of manifests and/or inspecting already-existent manifests, for instance to check for multi-platform attestation.
docker buildx use: Changes the current builder instances to the specified one.
docker buildx rm: Removes the specified builder instance(s).
docker buildx prune: Removes data from a builder cache, giving you precise control over data elimination.
docker buildx stop: Stops the current specified builder instance but allows restarting, and is driver-dependent

We will stop here for this article, but in the next one we will dive into compose, another popular Docker plugin🥰

The content for this article is mainly based on docker buildx command documentation : make sure to visit them to get to know more!

1MinDocker #7 - Superpower your builds with buildx

Clelia (Astra) Bertelli — Fri, 22 Nov 2024 20:15:36 +0000

In the last article we talked about the possibility of expanding our build capacity with multi-staged builds and if-else statements: in this article, we'll see how to superpower our builds with buildx, a popular Docker plugin that is intended at replacing the legacy docker build command.

Getting `buildx`

If you correctly installed Docker Desktop for Windows or macOS (see our second article), buildx should be already included.

If you are on Linux and running docker buildx --version returns an error because the plugin wasn't installed, you should follow the instructions you can find in 1minDocker #2 and/or on Docker official documentation

Once you have buildx, you can set it as a default builder by running:

docker buildx install

This will dismiss the legacy builder (docker build) and default to that of the plugin (docker buildx build).

What `buildx` can that `build` can't

buildx has multiple features that the legacy builder does not provide

1. Drivers

You can choose the environment where to run the build: this environment is called driver and by default is set to the same as the normal builder (the docker driver), but it can also exploit docker-container (a containerized environment for the build), kubernetes (that connects local environments to Kubernetes clusters) or remote (allows access to an externally managed building environment)

2. Isolated builder instances

You can create multiple isolated builder instances assigning them to different nodes through buildx create (and there are a handful of commands to manage those instances). There is also the possibility to give your builder instances a default template with the buildx context command.

3. Multi-platform builds

You can specify the platform for which you're building through the --platform flag (available: linux/amd64, linux/arm64, darwin/amd64). When you're backed by docker-container or kubernetes, you can actually do a multi-platform build at once, using different strategies:

Specifying stages in the Dockerfile that can cross-compile through different platforms
Using different builder instances that compile for different architectures
Using kernel emulation through QEMU (easiest solution)

For what concerns kernel emulation, if this solution is enabled in your node, it just automatically recognizes secondary available architectures and builds also for them. QEMU can be installed with Docker as simple as this command:

docker run --privileged --rm tonistiigi/binfmt --install all

And the builder instances will be able to use it.

You can also encounter more complicated cases where QEMU is not sufficient. In those cases you can either build on multiple nodes, like in this example:

docker buildx create --use --name mybuild node-amd64
docker buildx create --append --name mybuild node-arm64
docker buildx build --platform linux/amd64,linux/arm64 .

Or you can use multi-platform builds in your Dockerfile:

# syntax=docker/dockerfile:1
FROM --platform=$BUILDPLATFORM golang:alpine AS build
ARG TARGETPLATFORM
ARG BUILDPLATFORM
RUN echo "I am running on $BUILDPLATFORM, building for $TARGETPLATFORM" > /log
FROM alpine
COPY --from=build /log /log

We will stop here for this article, but in the next one we will go through common buildx commands and how they work🥰.

The content for this article is mainly based on docker/buildx GitHub repo: make sure to visit them and give them a star!⭐

1MinDocker #6 - Building further

Clelia (Astra) Bertelli — Tue, 12 Nov 2024 02:35:10 +0000

In the last article we saw how to build an image from scratch and we introduced several keywords to work with Dockerfiles.

We will now try to understand how to take our building capacity to the next level, adding more complexity and more layers to our images.

Case study

Imagine that we want to build an image to run our data analysis pipelines written in python and R.

To manage python and R dependencies separately we can wrap them inside conda environments.

Conda is a great tool for environment management, but is often outpaced by mamba in some operations such as environment creation and installation.

We will then use conda to organize and run the environments, while mamba will create them and install what's needed.

Let's say we need the following packages for python data analysis:

And we need the following for our R data analysis:

We store the environment creation and the installation of everything in this file called conda_deps_1.sh (find all the code for this article here):

eval "$(conda shell.bash hook)"

micromamba create \
    python_deps \
    -y \
    -c conda-forge \
    -c bioconda \
    python=3.10

conda activate python_deps

micromamba install \
    -y \
    -c bioconda \
    -c conda-forge \
    -c anaconda \
    -c plotly \
    pandas polars numpy scikit-learn scipy matplotlib seaborn plotly

conda deactivate

micromamba create \
    R \
    -y \
    -c conda-forge \
    r-base

conda activate R

micromamba install \
    -y \
    -c conda-forge \
    -c r \
    r-dplyr r-lubridate r-tidyr r-purrr r-ggplot2 r-caret

conda deactivate

From these premises, we will build our data science Docker image.

Building on top of the building

We are very lucky with mamba and conda, because they both provide a docker image for their smaller and lightweight versions, micromamba and miniconda .

We want then to combine micromamba with miniconda, but how? We can exploit a feature in Docker builds, which is basically the same as "building on top of a building": we start with an image as base, we copy the most important things from there to our actual image and then we continue building on top of it.

The syntax may be as follows:

FROM author/image1:tag as base
FROM author/image2:tag

COPY --from=base /usr/local/bin/* /usr/local/bin/

Which means that, from image1 as base we take only the files stored under /usr/local/bin and place them in image2.

In our case, it would be:

ARG CONDA_VER=latest
ARG MAMBA_VER=latest

FROM mambaorg/micromamba:${MAMBA_VER} as mambabase

FROM conda/miniconda3:${CONDA_VER} 

COPY --from=mambabase /usr/bin/micromamba /usr/bin/

We copied micromamba from its original location into our image.

Install environments

We can now take the conda_deps_1.sh, copy and execute it into our build:

WORKDIR /data_science/

RUN mkdir -p /data_science/installations/

COPY ./conda_deps_1.sh /data_science/installations/

RUN bash /data_science/installations/conda_deps_1.sh

But let's say we also want to provide our image with an environment for AI development, that we only want to add to our build if the user specifies it at build time.

In this case, we can use if...else conditional statements in our Dockerfile!

We will create another file, conda_deps_2.sh with a python environment for AI development in which we will put some base packages such as:

transformers
pytorch
tensorflow
langchain, langchain-community, langchain-core
gradio

eval "$(conda shell.bash hook)"

micromamba create \
    python_ai \
    -y \
    -c conda-forge \
    -c bioconda \
    python=3.11

conda activate python_ai

micromamba install \
    -y \
    -c conda-forge \
    -c pytorch \
    transformers pytorch tensorflow langchain langchain-core langchain-community gradio

conda deactivate

Now we just add a condition to our Dockerfile:

ARG BUILD_AI="False"

RUN if [ "$BUILD_AI" = "True" ]; bash /data_science/installations/conda_deps_2.sh; \
    elif [ "$BUILD_AI" = "False" ]; then echo "No AI environment will be built"; \
    else echo "BUILD_AI should be either True or False: you passed an invalid value, thus no AI environment will be built"; fi

Building and its options

Now let's take a look at the complete Dockerfile:

ARG CONDA_VER=latest
ARG MAMBA_VER=latest

FROM mambaorg/micromamba:${MAMBA_VER} as mambabase

FROM conda/miniconda3:${CONDA_VER} 

COPY --from=mambabase /usr/bin/micromamba /usr/bin/

WORKDIR /data_science/

RUN mkdir -p /data_science/installations/

COPY ./conda_deps_?.sh /data_science/installations/

RUN bash /data_science/installations/conda_deps_1.sh

ARG BUILD_AI="False"

RUN if [ "$BUILD_AI" = "True" ]; bash /data_science/installations/conda_deps_2.sh; \
    elif [ "$BUILD_AI" = "False" ]; then echo "No AI environment will be built"; \
    else echo "BUILD_AI should be either True or False: you passed an invalid value, thus no AI environment will be built"; fi

CMD ["/bin/bash"]

We can build our image tweaking and twisting the build-args as we please:

# BUILD THE IMAGE AS-IS
docker build . \
    -t YOUR-USERNAME/data-science:latest-noai

# BUILD THE IMAGE WITH AI ENV
docker build . \
    --build-args BUILD_AI="True" \
    -t YOUR-USERNAME/data-science:latest-ai

# BUILD THE IMAGE WITH A DIFFERENT VERSION OF MICROMAMBA

docker build . \
    --build-args MAMBA_VER="cuda12.1.1-ubuntu22.04" \
    -t YOUR-USERNAME/data-science:mamba-versioned

Then you can proceed and push the image to Docker Hub or to another registry as we saw in the last article.

You can now run your image interactively, loading also your pipelines as a volume, and activate all the environments as you please:

docker run \
    -i \
    -t \
    -v /home/user/datascience/pipelines/:/app/pipelines/ \
    YOUR-USERNAME/data-science:latest-noai \
    "/bin/bash"

# execute the following commands inside the container
source activate python_deps
conda deactivate 
source activate R
conda deactivate

We will stop here for this article, but in the next one we will dive into how to use the buildx plugin!🥰

1minDocker #5 - Build and push a Docker image

Clelia (Astra) Bertelli — Mon, 04 Nov 2024 21:22:11 +0000

In the last article we saw how we can pull an image, run it inside a container, list images and containers and remove them: now it's time to build, so we'll create our first simple Docker image.

The Dockerfile

As we already said in our conceptual introduction to Docker, a Dockerfile is a sort of recipe: it contains all the instructions to collect the ingredients (the image) that will make the cake (the container).

But what exactly can a Dockerfile contain? We will see, in our example (that you can find here), the following base key words:

FROM: this key word is fundamental. It specifies the base image from which we mount our environment
RUN: with this key you can specify a command (like RUN python3 -m pip install --no-cache-dir requirements.txt) that will be executed during build time (only once) and will be stored in an image layer
WORKDIR: you can specify the working directory that will be the base for your Docker image (for example WORKDIR /app/)
COPY or ADD: These two key words are very similar. COPY allows you to copy specific local folders into a folder inside the image (like COPY src/ /app/) whereas ADD adds the whole local specified path to the destination directory inside the Docker image (ADD . /app/)
EXPOSE: it specifies the port that is exposed inside the container to the outside (EXPOSE 3000)
ENTRYPOINT: this key word specifies the default executable that should be run when the image is launched in a container (ENTRYPOINT ["npm", "start"]). It must be specified at the end of your Docker file and only once (otherwise the last ENTRYPOINT instance will override the other ones). Although the ENTRYPOINT executable cannot be overridden by other commands provided through CLI when we run the container, it's arguments can be changed from CLI upon container start.
CMD: Similar to ENTRYPOINT, this key word specifies a command that runs every time the image is started inside a container. Differently from ENTRYPOINT, though, it can be completely overridden and generally is used as a set of extra arguments for ENTRYPOINT, like here:

ENTRYPOINT [ "streamlit", "run" ]
CMD [ "scripts/app.py" ]

In this case, every time we start the container we will run a Streamlit app, but we can choose the path of the app by providing it to the container from the docker run command line.

ARG: this key word is used to set build arguments, which are local variable that can be overridden by other specified at build-time with the docker build CLI. They're especially useful if you use a value more than once in your Dockerfile and don't want to repeat it:

ARG NODE_VERSION="20"
ARG ALPINE_VERSION="3.20"

FROM node:${NODE_VERSION}-alpine${ALPINE_VERSION}

This can be easily overridden by:

docker build . --build-args NODE_VERSION="18"

ENV: this key word, as the name suggest, sets an environment variable. Environment variables are fixed and cannot be changed at build-time, and they can be useful when we want a variable to be accessible to all image build stages.

Let's build a Dockerfile

To build a Dockerfile, we need to know what application we are going to ship through the image we're about to set up.

In this tutorial, we will build a very simple python application with Gradio, a popular framework to build elegant and beautiful frontend for AI/ML python apps.

Our folder will look like this:

build_an_image_1/
|__ app.py
|__ Dockerfile

To fill up app.py, we will use a template that Hugging Face itself provides for Gradio ChatBot Spaces:

import gradio as gr


def respond(
    message,
    history):
    message_back = f"Your message is: {message}"
    response = ""
    for m in message_back:
        response += m
        yield response

demo = gr.ChatInterface(
    respond,
    title="Echo Bot",
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=7860)

This is a simple bot that echoes every message we send.
We will just copy this code into our main script, app.py.

Now we're ready to build our Docker image, starting with modifying our Dockerfile.

1. The base image

For our environment we need Python 3, so we will need to find a suitable base image for that.

Luckily, Python itself provides us with Alpine-based (a Linux distro) python images, so we will just use python:3.11.9.

We just then need to specify:

ARG PYTHON_VERSION="3.11.9"
FROM python:${PYTHON_VERSION}

At the very beginning of our Dockerfile.

As we said, if we want a different python version, we just need to run:

docker build . --build-args PYTHON_VERSION="3.10.14"

2. Get the needed dependencies

Our app depends exclusively on gradio, so we can do a quick pip install for that!

We also set the version (5.4.0) as an ARG and ENV:

ARG GRADIO_V="5.4.0"
ENV GRADIO_VERSION=${GRADIO_V}

RUN python3 -m pip cache purge
RUN python3 -m pip install gradio==${GRADIO_VERSION}

You cannot change GRADIO_VERSION directly, but you can pass GRADIO_V as a build argument and modify also the ENV value!

docker build . --build-args GRADIO_V="5.1.0"

3. Start the application

We need to start the application, something that we would normally do as python3 app.py.

But our app.py file is locally stored, not available to the Docker image, so we need to copy it into our Docker working directory:

WORKDIR /app/
COPY ./app.py /app/

Since our application runs on http://0.0.0.0:7860, we need to expose port 7860:

EXPOSE 7860

Now we can make our application run:

ENTRYPOINT ["python3"]
CMD ["/app/app.py"]

We will not be able to change the base executable (python3) but we will be able to override the CMD instance specifying another path at runtime (for example if we mount a volume while running the container).

4. Full Dockerfile

Our full Dockerfile will look like this:

ARG PYTHON_VERSION="3.11.9"
FROM python:${PYTHON_VERSION}

WORKDIR /app/
COPY ./app.py /app/

ARG GRADIO_V="5.4.0"
ENV GRADIO_VERSION=${GRADIO_V}

RUN python3 -m pip cache purge
RUN python3 -m pip install gradio==${GRADIO_VERSION}

EXPOSE 7860

ENTRYPOINT ["python3"]
CMD ["/app/app.py"]

Now we just need to build the image!

Build and push the image

When we build the image, we need to specify the context, meaning the directory in which our Dockerfile is. For starters, we will also use the -t flag, which specifies the name and tag of our image:

docker build . -t YOUR-USERNAME/gradio-echo-bot:0.0.0 -t YOUR-USERNAME/gradio-echo-bot:latest

As you can see, you can specify multiple tags.

This build, once launched, will take some minutes to complete, and then you will have your images locally!

If you want to make this images available to everyone, you need to login to your Docker account:

docker login -u YOUR-USERNAME --password-stdin

You will be prompted to input the password from your console.

You won't put your Docker password, but an access token (follow the link for a guide on how to obtain it).

Now let's push our image to the Docker Hub registry:

docker push YOUR-USERNAME/gradio-echo-bot:0.0.0
docker push YOUR-USERNAME/gradio-echo-bot:latest

The push generally takes some time, but after that our image will be live on Docker Hub: we published our first Docker image!🎉

We will stop here for this article, but in the next one we will dive into more advanced build concepts🥰

DEV Community: Clelia (Astra) Bertelli

Build Code-RAGent, an agent for your codebase

Introduction

Environment Setup

Data ingestion

Agent workflow design

1. Tools

2. Designing and serving the agent

3. User Interface

Conclusion

Ingest (almost) any non-PDF document in a vector database, effortlessly

1minDocker #14 - Deploy an AI app with Docker on cloud

Step 1: your Linode instance

Step 2: preparing your Linode for the application

Step 3: Get the application

Step 4: Deploy!

1minDocker #13 - Push, build and dockerize with GitHub Actions

Setting up - GitHub

Setting up - Application

Setting up - Docker

Testing - First push

Testing - Trying our app

Modifying our application

1minDocker #12 - What is CI/CD?

What does CI/CD mean?

What are the main steps of CI/CD?

Why CI/CD and Docker?

Sources

1minDocker #11 - Advanced compose example

Background

Set-up

Getting the needed files

Adding some modifications

The compose file

ElasticSearch

Logstash

Kibana

Launch and stop the service

1minDocker #10 - The Compose File

name and version

services

image

build

env_file and environment

depends_on

command and entrypoint

ports

restart

healthcheck

container_name

volumes

networks

configs

secrets

1minDocker #9 - Introduction to Compose

What is compose?

Why compose?

Getting started

1MinDocker #8 - Advanced concepts for buildx

docker buildx bake

The bake file

docker buildx create

docker buildx build

--build-arg

--build-context

--cache-from

-f,--file

--load

--push

--platform

--secret

-t, --tag

Minor commands

1MinDocker #7 - Superpower your builds with buildx

Getting buildx

What buildx can that build can't

1. Drivers

2. Isolated builder instances

3. Multi-platform builds

1MinDocker #6 - Building further

The `compose` file

`name` and `version`

`services`

`volumes`

`networks`

`configs`

`secrets`

What is `compose`?

Why `compose`?

`docker buildx bake`

`docker buildx create`

`docker buildx build`

`--build-arg`

`--build-context`

`--cache-from`

`-f,--file`

`--load`

`--push`

`--platform`

`--secret`

`-t, --tag`

Getting `buildx`

What `buildx` can that `build` can't