Skip to content

rahul201722/gmm-clustering-polars

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GMM Clustering with Polars

This repository contains Python scripts that demonstrate how to perform Gaussian Mixture Model (GMM) clustering using Polars for data manipulation and scikit-learn for the machine learning component.

Files

  1. gmm_with_polars.py - Comprehensive example with visualization and performance comparison
  2. simple_gmm_polars.py - Simple, focused example of GMM with Polars
  3. requirements.txt - Required Python packages

Setup

1. Install Dependencies

pip install -r requirements.txt

Or install packages individually:

pip install polars numpy matplotlib seaborn scikit-learn pandas

2. Run the Examples

Simple Example

python simple_gmm_polars.py

Comprehensive Example (with plots)

python gmm_with_polars.py

What These Scripts Do

Simple Example (simple_gmm_polars.py)

  • Creates synthetic data with 3 natural clusters
  • Preprocesses data using Polars operations
  • Performs GMM clustering with scikit-learn
  • Analyzes results using Polars aggregations
  • Saves results to CSV

Comprehensive Example (gmm_with_polars.py)

  • All features of the simple example, plus:
  • Data visualization with matplotlib
  • Performance comparison between Polars and Pandas
  • Detailed clustering analysis and metrics
  • Model parameter inspection

Key Features

Polars Operations Used

  • pl.DataFrame() - Creating DataFrames
  • with_columns() - Adding computed columns
  • group_by().agg() - Aggregating data
  • select() - Selecting columns
  • to_numpy() - Converting to NumPy arrays
  • write_csv() - Saving results

GMM Features

  • Automatic cluster detection
  • Probability estimates for each prediction
  • Multiple covariance types
  • Model convergence information
  • Cluster centers and parameters

Example Output

The scripts will generate:

  1. Console output with clustering analysis
  2. CSV files with results
  3. Visualizations (comprehensive example)
  4. Performance comparisons

Customization

You can modify these scripts to:

  • Use your own data (replace the create_sample_dataset() function)
  • Change the number of clusters (n_components parameter)
  • Use different features for clustering
  • Adjust preprocessing steps
  • Change visualization styles

Why Polars?

Polars offers several advantages for data preprocessing in ML workflows:

  • Speed: Faster than Pandas for many operations
  • Memory efficiency: Better memory usage
  • Lazy evaluation: Optimized query planning
  • Modern API: Clean, consistent syntax
  • Type safety: Better type handling

This makes it an excellent choice for data preparation before applying machine learning algorithms like GMM.

About

GMM clustering implementation using Polars and scikit-learn for efficient data manipulation and machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages