Skip to content

QsingularityAi/premium-prediction-mlops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MLOps Premium Prediction Project

MLOps Pipeline Kubernetes Segmented Regression Docker MLflow Monitoring

A complete end-to-end MLOps solution for insurance premium prediction using segmented regression models with Kubernetes orchestration.

πŸ“‹ Table of Contents

πŸ” Overview

This project implements a comprehensive MLOps workflow for predicting insurance premiums. The system uses a segmented approach, training separate models for different premium ranges to improve prediction accuracy. The entire pipeline is containerized and can be deployed using Kubernetes, with monitoring, experiment tracking, and model versioning built-in.

✨ Why Segmented Regression?

Insurance premiums often exhibit different patterns across price ranges. By training specialized models for each segment (e.g., very low, low, medium, high, very high premiums), we can achieve better prediction accuracy compared to a single model approach.

πŸ—οΈ Architecture

The system consists of several interconnected components forming a complete MLOps pipeline:

                                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                            β”‚  Data Sources β”‚
                                            β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Data Pipeline                 β”‚ β”‚ β”‚                       β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚β—„β”˜ β”‚    MLflow Tracking    β”‚
β”‚ β”‚ Data Ingest │─►│Validation & │─►│ Feature  β”‚ β”‚   β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚  (DVC)      β”‚  β”‚Preprocessingβ”‚  β”‚Engineeringβ”‚ β”‚   β”‚ β”‚  Experiments     β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                      β”‚                             β”‚ β”‚  Parameters       β”‚ β”‚
                      β–Ό                             β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚           Model Training Pipeline           β”‚     β”‚ β”‚  Metrics         β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚     β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚Segment β”‚  β”‚ Model  β”‚  β”‚Feature Selectionβ”‚ │────► β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚Creation│─►│Training│─►│& Evaluation     β”‚ β”‚     β”‚ β”‚  Artifacts       β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚     β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             Model Registry                   β”‚    β”‚                       β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”‚      Monitoring       β”‚
β”‚ β”‚ Model       β”‚  β”‚Versioningβ”‚  β”‚Deploymentβ”‚ β”‚    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Artifacts   │─►│& Metadata│─►│Approval  β”‚ β”‚    β”‚ β”‚  Data Drift      β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                         β”‚                         β”‚ β”‚  Model            β”‚ β”‚
                         β–Ό                         β”‚ β”‚  Performance      β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚             Serving Infrastructure          β”‚     β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚     β”‚ β”‚  System          β”‚ β”‚
β”‚ β”‚ FastAPI   β”‚  β”‚Kubernetesβ”‚  β”‚Prometheus β”‚ │────► β”‚  Metrics          β”‚ β”‚
β”‚ β”‚ Endpoints │─►│Deployment│─►│& Grafana  β”‚ β”‚     β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🌟 Features

  • Segmented Modeling: Trains specialized models for different premium ranges to improve accuracy
  • Automated Pipeline: Complete data-to-deployment workflow with minimal manual intervention
  • Experiment Tracking: MLflow integration for tracking model parameters, metrics, and artifacts
  • Containerized Deployment: Docker containers for consistent, reproducible deployment
  • Kubernetes Orchestration: Scalable and resilient deployment with Kubernetes
  • Monitoring & Alerting: Prometheus and Grafana for real-time monitoring of model and system performance
  • Model Registry: Versioning for models with comparison and approval workflows
  • CI/CD Ready: Infrastructure for continuous integration and deployment

πŸš€ Installation

Prerequisites

  • Python 3.9+
  • Docker and Docker Compose
  • Kubernetes (Docker Desktop with Kubernetes enabled or a separate cluster)
  • Git

Quick Setup

# Clone the repository
git clone https://github.com/your-username/premium-prediction-mlops.git
cd premium-prediction-mlops

# Run setup script
chmod +x setup.sh
./setup.sh

# Activate virtual environment
source venv/bin/activate

Manual Setup

If you prefer to set up the project manually:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Create required directories
mkdir -p data/{raw,processed} models logs artifacts

# Initialize DVC
dvc init

πŸ“Š Usage

Data Preparation

# Place training data in data/raw directory
# Track it with DVC
dvc add data/raw/train_data.csv
git add data/raw/train_data.csv.dvc
git commit -m "Add training data"

Training Models

# Edit configuration in config/config.yaml if needed
# Run training
python -m src.train

# View experiment results in MLflow
mlflow ui
# Open http://localhost:5000 in your browser

Local Deployment

# Build and start services with Docker Compose
docker-compose up -d

# API documentation available at
# http://localhost:8000/docs

Making Predictions

# Single prediction
curl -X POST "http://localhost:8000/predict" \
     -H "Content-Type: application/json" \
     -d '{
        "features": {
            "Age": 35,
            "Vehicle_Age": 5,
            "Credit_Score": 720,
            "Annual_Income": 65000,
            "Previous_Claims": 1,
            "Insurance_Duration": 3
        }
     }'

πŸ“ Project Structure

premium-prediction-mlops/
β”œβ”€β”€ config/                # Configuration files
β”‚   └── config.yaml        # Main configuration
β”œβ”€β”€ data/                  # Data directory (managed by DVC)
β”‚   β”œβ”€β”€ raw/               # Raw data files
β”‚   └── processed/         # Processed data files
β”œβ”€β”€ kubernetes/            # Kubernetes deployment manifests
β”œβ”€β”€ models/                # Trained model storage
β”œβ”€β”€ notebooks/             # Jupyter notebooks for exploration
β”œβ”€β”€ src/                   # Source code
β”‚   β”œβ”€β”€ api/               # API service
β”‚   β”‚   └── app.py         # FastAPI application
β”‚   β”œβ”€β”€ data/              # Data processing code
β”‚   β”‚   └── data_processor.py
β”‚   └── models/            # Model-related code
β”‚       β”œβ”€β”€ model_trainer.py
β”‚       └── model_predictor.py
β”œβ”€β”€ tests/                 # Unit and integration tests
β”œβ”€β”€ Dockerfile             # Docker container definition
β”œβ”€β”€ docker-compose.yml     # Local deployment configuration
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ setup.sh               # Setup script
└── README.md              # Project documentation

πŸ§‘β€πŸ’» Development

Development Workflow

  1. Create a feature branch:

    git checkout -b feature/your-feature-name
  2. Make your changes and run tests:

    pytest tests/
  3. Format and lint your code:

    make format
    make lint
  4. Commit your changes:

    git commit -m "Add meaningful commit message"
  5. Push to your fork and create a pull request

Using the Makefile

The Makefile in the root directory provides a set of commands to automate and standardize common development, testing, deployment, and operational tasks. It acts as a central control panel for the project, encapsulating complex command sequences into simple, memorable targets (make <command>). This promotes consistency, reduces errors, and makes it easier for developers to interact with the various stages of the MLOps lifecycle.

Key commands include:

  • make setup: Set up the development environment.
  • make format: Format code using black and isort.
  • make lint: Run linters (flake8, black, isort).
  • make test: Run the test suite.
  • make train: Train the models.
  • make serve: Run the API server locally.
  • make docker-build: Build the Docker image.
  • make docker-run: Run the Docker container locally.
  • make deploy: Deploy the application to Kubernetes.
  • make monitor: Deploy monitoring components.
  • make clean: Clean up generated files.
  • make help: Show all available commands.

Refer to the Makefile itself for the full list and details.

Code Style

  • Follow PEP 8 guidelines
  • Use type hints
  • Write docstrings for all functions and classes
  • Include unit tests for new functionality

🌐 Deployment

Kubernetes Deployment

# Apply Kubernetes manifests
kubectl apply -f kubernetes/namespace.yaml
kubectl apply -f kubernetes/configmap.yaml
kubectl apply -f kubernetes/deployment.yaml
kubectl apply -f kubernetes/service.yaml
kubectl apply -f kubernetes/ingress.yaml
kubectl apply -f kubernetes/monitoring.yaml

# Check deployment status
kubectl get all -n mlops-premium

Production Considerations

  • Use proper secrets management
  • Configure TLS certificates for HTTPS
  • Implement authentication for the API
  • Set appropriate resource limits
  • Configure backup and recovery procedures

πŸ“Š Monitoring

Key Metrics to Monitor

  • Model Performance: RMSE, MAE, RΒ² for each segment
  • Prediction Latency: Response time for predictions
  • Data Drift: Changes in feature distributions
  • System Metrics: CPU, memory, network usage

Accessing Monitoring Dashboards

# Port-forward Prometheus
kubectl port-forward -n mlops-premium svc/prometheus-service 9090:9090

# Port-forward Grafana
kubectl port-forward -n mlops-premium svc/grafana-service 3000:3000

# Access in browser
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (default: admin/admin)

Model Retraining

Trigger model retraining when:

  1. Data drift is detected
  2. Model performance degrades
  3. On a regular schedule (e.g., monthly)
python -m src.train --log-artifacts --register-model

πŸ”§ Troubleshooting

Common Issues

  • API Issues: Check logs with docker-compose logs premium-api or kubectl logs -n mlops-premium deployment/premium-model-api
  • Training Issues: Check logs in the logs/ directory
  • Kubernetes Issues: Use kubectl describe pod -n mlops-premium <pod-name> for detailed information

For more detailed troubleshooting, refer to the CHECKLIST.md and QUICKSTART.md files.

πŸ“œ License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors