This project is a further expansion on two previous projects:
To fully understand the functionality involved regarding image filtering, image object detection and the integration with Telegram Bot please read about the above projects.
Note
A full explanation of the Terraform deployment of the infra described below can be read here Terraform
- A
VPCinus-east-1orus-east-2, containing two Public Subnets, inus-east-1aandus-east-1borus-east-2aandus-east-2bAvailability Zones (AZ)respectively. - The services deployed within these subnets communicate to the world via an
Internet Gateway, which is attached on the VPC. - The
Polybotservice runs as aDockercontainer on twoEC2machines (t3.micro), 1 in each AZ, behind anApplication Load Balancer (ALB).- I've created a sub-domain under the main
INTdomain.int-devops.clickand attached it to the ALB. - I've created a self-signed certificate and attached it to my sub-domain for secure communication with the Telegram API.
- I've created a sub-domain under the main
- The
Yolo5service runs as aDockercontainer, starting with a singleEC2(t3.medium) which is instantiated via anAuto Scaling Group (ASG).- The ASG is configured to scale up when the CPU reaches 20% utilization (for testing purposes)
- The ASG makes use of a
Launch Template (LT), to create the EC2 machines.- The LT uses User Data to automatically install what is needed and get the latest Docker image from the
ECRrepository and then run the Yolo5 service. - The LT is configured to deploy the EC2s inside the above VPC and in the specified subnets.
- It also makes use of an existing
Key Pairwhich is created separately for SSH. - It uses its own SG for the Yolo5's EC2 machines (read below).
- The LT uses User Data to automatically install what is needed and get the latest Docker image from the
- There is a
Security Group (SG)for the ALB which restricts Inbound traffic to the CIDRs of Telegram servers on port 8443 only and Outbound to the Security Group of the Polybot's EC2 machines on port 8443 as well. - The SG for the Polybot's EC2s, accepts Inbound traffic only from the ALB SG and SSH and Outbound to All.
- The SG for the Yolo5's EC2s, accepts Inbound traffic only for SSH and Outbound to All.
- All EC2 machines have Public IP enabled for convenience only, for use with SSH.
- A
Secret Manager (SM)which has two secrets in it:- Telegram Token
- Sub-Domain Certificate
- There are two
SQS Queues, one foridentifyand one forresultswith which each of the EC2s can put messages into and pull messages from. - A
DynamoDBTable, into which the Image Object Detection results are written and read from. - An
S3 Bucket, which holds the images which are to be identified and then the resulted images. - I've created an
IAM Role, with an inline policy, which follows theLeast Privilegeprinciple and only grants the absolute necessary permissions to the EC2s.- For the Polybot the role is attached to the two machines
- for the Yolo5 the role is part of the LT configuration and each machine that is created gets it.
- I use the latest Ubuntu 24.04
AMI. I use this image for the creation of all the EC2s in this project.
- The user uploads an image on the Telegram App and puts a caption of
predict. - The Polybot service picks up the message and handles it by instantiating the
ObjectDetectionBot. - The image is then uploaded to S3 and a message with the
chatIdandimgNameto theidentifySQS Queue. - The Yolo5 service polls the identify SQS Queue for incoming messages. Once a message is picked up, it gets the
imgName, downloads it from S3 and the detection process kicks in. - The resulted image is uploaded back to S3 and the summary is save to DynamoDB. A message containing either a failure or success details is sent to the
resultsSQS Queue. - The Polybot service polls the results SQS Queue for incoming results messages. Once a message is picked up it gets the
prediction_id, queries the DynamoDB, gets the output from the prediction, parses the output, gets the image name and downloads it from S3, sends responds back to the user with the image and a readable summary of what was found.
.
├── .github
│ └── workflows
│ ├── backend-state-destroying.yaml
│ ├── backend-state-provisioning.yaml
│ ├── infra-destroying.yaml
│ ├── infra-provisioning-main.yaml
│ ├── infra-provisioning-region.yaml
│ ├── polybot-deployment.yaml
│ └── yolo5-deployment.yaml
├── .gitignore
├── AWS_Project.jpg
├── LICENSE
├── README.md
├── load_test.py
├── polybot
│ ├── .dockerignore
│ ├── Dockerfile
│ ├── __init__.py
│ ├── ansible
│ │ ├── ansible.cfg
│ │ ├── aws_ec2.yaml
│ │ └── playbook.yaml
│ ├── python
│ │ ├── __init__.py
│ │ ├── bot.py
│ │ ├── bot_utils.py
│ │ ├── flask_app.py
│ │ ├── img_proc.py
│ │ ├── process_messages.py
│ │ ├── process_results.py
│ │ └── requirements.txt
│ ├── uwsgi.ini
│ └── wsgi.py
├── tf_backend_state
│ ├── .gitignore
│ ├── .terraform.lock.hcl
│ ├── dev.tfvars
│ ├── main.tf
│ ├── prod.tfvars
│ ├── providers.tf
│ ├── terraform.plan
│ ├── terraform.tfstate
│ └── variables.tf
├── tf_infra
│ ├── .gitignore
│ ├── .terraform.lock.hcl
│ ├── main.tf
│ ├── modules
│ │ ├── dynamodb
│ │ │ ├── main.tf
│ │ │ ├── outputs.tf
│ │ │ └── variables.tf
│ │ ├── ec2-key-pair
│ │ │ ├── main.tf
│ │ │ ├── outputs.tf
│ │ │ └── variables.tf
│ │ ├── ecr-and-policy
│ │ │ ├── main.tf
│ │ │ ├── outputs.tf
│ │ │ └── variables.tf
│ │ ├── iam-role-and-policy
│ │ │ ├── main.tf
│ │ │ ├── outputs.tf
│ │ │ ├── policy_template.tftpl
│ │ │ └── variables.tf
│ │ ├── polybot
│ │ │ ├── deploy.sh
│ │ │ ├── main.tf
│ │ │ ├── outputs.tf
│ │ │ └── variables.tf
│ │ ├── secret-manager
│ │ │ ├── main.tf
│ │ │ ├── outputs.tf
│ │ │ └── variables.tf
│ │ ├── sqs-queue
│ │ │ ├── main.tf
│ │ │ ├── outputs.tf
│ │ │ └── variables.tf
│ │ ├── sub-domain-and-cert
│ │ │ ├── generate_certificate.sh
│ │ │ ├── main.tf
│ │ │ ├── outputs.tf
│ │ │ └── variables.tf
│ │ └── yolo5
│ │ ├── deploy.sh.tftpl
│ │ ├── main.tf
│ │ ├── outputs.tf
│ │ └── variables.tf
│ ├── outputs.tf
│ ├── providers.tf
│ ├── region.us-east-1.tfvars
│ ├── region.us-east-2.tfvars
│ └── variables.tf
└── yolo5
├── .dockerignore
├── Dockerfile
├── ansible
│ ├── ansible.cfg
│ ├── aws_ec2.yaml
│ └── playbook.yaml
├── app.py
├── prediction_cleanup.sh
├── requirements.txt
└── yolo_utils.py- In the Python code I'm using Threading. When the application starts two threads get initiated, one for the Bot to poll the
resultsSQS Queue and one for the Bot to poll an internal Python Queue for incoming messages from the Telegram app. - Both threads are getting a single instance of the
bot_factoryso that the same bot setup is used in both. Both threads also get the app to maintain context for globally declared values. - The secrets i.e. the Telegram Token and Domain Certificate are pulled from AWS Secret Manager and used in the code but are not saved anywhere thus improving security.
- The Polybot Dockerfile no longer uses the UWSGI server. It now makes use of regular Flask server. The reason for this has to do with my use of threading which didn't work with the UWSGI.
- The container is run with the
--restart alwaysflag so that when that machine stops and starts or restarts for some reason, the container will immediately start as well.
- The yolo5 service polls the
identifySQS Queue for incoming messages placed there by thePolybotservice. - I've decoupled the services by introducing an additional SQS Queue so that the Yolo5 doesn't make a POST request to the Polybot directly (via the ALB) but rather place a message in the
resultsqueue. - In order to prevent container bloat, I've created a cleanup bash script,
prediction_cleanup.sh, which I run as a cron service in the background inside the container. It deletes all the prediction files and images that are older than 2 minutes.- I added this to the
Dockerfile.
- I added this to the
- The container is run with the
--restart alwaysflag so that when that machine stops and starts or restarts for some reason, the container will immediately start as well.
Note
I've discovered that with the existing architecture the concat image filtering functionality doesn't work because for every image in the group Telegram makes a separate HTTP request causing the ALB to route the second request to the second machine thus making it "loose" state.
In order to resolve this some additional component has to be introduced, perhaps Redis or another Table in DynamoDB.
- I created a bash script,
load_test.shwhich sends messages directly to the identify SQS Queue. The images must be pre-loaded to S3 and the image name list must be updated in the script itself. This simulates increased traffic. - Navigate to the CloudWatch console, observe the metrics related to CPU utilization for your Yolo5 instances. You should see a noticeable increase in CPU utilization during the period of increased load.
- After ~3 minutes, as the CPU utilization crosses the threshold of 60%, CloudWatch alarms will be triggered. Check your ASG size, you should see that the desired capacity increases in response to the increased load.
- After approximately 15 minutes of reduced load, CloudWatch alarms for scale-in will be triggered.
Important
This has to be deployed prior to the main project else the init will not work.
In order to manage the Terraform state one has to create the backend infra which includes the S3 bucket talo-tf-s3-tfstate for the state file itself and DynamoDB Table talo-tf-terraform-lock-table for the locking mechanism so prevent deployment from multiple sources, overriding each other.
I've created a separate Terraform deployment for this purpose in the tf_backend_state directory. It creates the above mentioned services with the relevant failure and security configurations, such as versioning and encryption on the bucket. Also, both the S3 and the DynamoDB Table get least privilege Service Policies allowing only myself to execute specific actions on them.
The main project's infrastructure deployment is in the tf_infra directory.
Here I've follow TF best practices:
- Making use of modules to create clear and easy to manage separation of the various components making up the whole project.
- Each module has it's own main, variables and outputs.
- Separating variables from their values so that it's easy to alter values and redeploy.
- Separating the outputs.
- Names are dynamically created per region and environment to prevent any possible clash.
- All services and elements that make them up are tagged
The modules are in two categories:
- Global modules which are not service specific or are shared amongst multiple services.
- Custom modules which are made of a group of resources for a specific service.
I've made use of both modules which are ready made from the official AWS Terraform repository and some which I've custom built for this project's specific needs.
- Root main - Calls all other modules.
- It makes use of a
dataresource to get the available AZs for the region. - It uses the terraform-aws-ami-ubuntu to get the latest Ubuntu 24.04 distribution.
- Has some declared locals for names, AZs, AMI and tags
- All values are passed in and then "trickle-in" to the relevant modules.
- Certain values come directly from outputs of modules
- It makes use of a
- VPC - Creates all relevant services for the VPC, such as:
- Subnets
- Route tables
- Security groups
- Network connections
- S3 - Creates and configures the bucket
- Route53 - Sub Domain and Certificate - Creates a Self-Signed Certificate of which values are injected dynamically through the CI/CD inputs. It then creates an A Record under the main College domain using that certificate. It uses a
aws_route53_zonedata source to get information regarding the main domain. - Secret Manager - I've made generic so that multiple secrets can be created with it. It's able to create either
plain textorkey-valuesecrets depending on the value that is passed into it.- In my case, two secrets are created: one for the Telegram Token as key-value and one is the sub-domain certificate which I later pull for use in the Python code
- SQS - I've made it generic so that multiple queues can be created with it.
- In my case, two queues are created: one for the
identifyand one for theresults
- In my case, two queues are created: one for the
- DynamoDB - Creates the Dynamo DB Table with it's partition key and indexes based on values sent passed in.
- ECR and Lifecycle Policy - Creates an ECR Repository and Lifecycle Policy to be used to store the Docker images I build either manually or through the CI/CD process.
- In my case the lifecycle policy only keeps one copy (the latest one) of each docker image and only keeps 1 untagged image for 'caching'
- IAM Role and Policy - Creates the IAM Role with the Policy that is needed to give permissions to the different services to talk to each other. It makes use of a
policy_template.tftpland dynamically replaces all the ARN placeholders which are retrieved from the other modules and passed in. - EC2 Key Pair - It creates an SSH Key Pair which I then use to be able to SSH into all the EC2 machines created in this project. It also saves both private and public keys physically as files as I later upload them as artifacts so that they can be downloaded and used.
- Polybot - It creates all the relevant components which make up the Polybot service such as:
- EC2 per AZ and uses the
deploy.shfile for theuser_datato install things which are needed in order for things to work. - Application Load Balancer (ALB) using the official AWS module. It creates all the relevant resources such as:
- Security Group for the ALB
- Listeners
- Target Groups and Health Checks
- Security Group for the EC2s
- EC2 per AZ and uses the
- Yolo5 - It creates all the relevant components which make up the Yolo5 service such as:
- Security Group for the EC2s
- Launch Template with all the relevant configurations.
- The launch template makes use of
deploy.sh.tftplfile for the user_data into which values are passed in dynamically.
- The launch template makes use of
- Auto Scaling Group and Policy
- SNS Topic for scaling event notifications
- The Infra deployment consists of three workflows
- Main which triggers the next one
- Region Specific
- Destroy
Main
- The main workflow takes inputs for the region, sub-domain details and environment.
- Based on the region selection it passes onto the next workflow the correct Telegram Token which is saved in GitHub Secrets.
Region Sepccific
- Takes in the incoming values from the main workflow.
- Sets up Terraform, initializes it, selects workspace, plans and applies.
- Once the Terraform provisioning is done it captures the paths of the SSH private and public keys from the outputs and uploads them as Artifacts for later use.
- It also captures all of the Terraform outputs into a file and uploads it as Artifacts for use in following workflows.
Destroy
- The destroy workflow captures the latest successful
run_idfrom the main workflow using GitHub API command and sets it as GitHub outputs for use in following job - It uses the
run_idoutput to download the Terraform outputs file from the Artifacts. - It uses jq to extract values from the outputs file and sets them as GitHub outputs for use in following job
- The last job sets up Terraform, captures the correct Telegram Token from GitHub Secrets based on the region, initializes it, selects workspace, plans and applies using the values that were passed down
- There are two services, namely:
PolybotandYolo5each deployed on manual trigger with their respective workflows: - Both workflows are structured similarly:
- Each consists of two jobs, namely:
BuildandDeploy. - In the Build job it builds a new Docker image and pushes it to the ECR.
- In the Deploy job I've made use of
Ansibleto deploy the application's new image on the machines and run the docker container. - I'm using the
aws_ec2Ansible plugin to dynamically build the inventory base on Tags that I've assigned to the Polybot and Yolo5 EC2 machines,APP=talo-polybotandAPP=talo-yolo5respectively.
- Each consists of two jobs, namely:
