This repository contains Python scripts that demonstrate how to perform Gaussian Mixture Model (GMM) clustering using Polars for data manipulation and scikit-learn for the machine learning component.
gmm_with_polars.py- Comprehensive example with visualization and performance comparisonsimple_gmm_polars.py- Simple, focused example of GMM with Polarsrequirements.txt- Required Python packages
pip install -r requirements.txtOr install packages individually:
pip install polars numpy matplotlib seaborn scikit-learn pandaspython simple_gmm_polars.pypython gmm_with_polars.py- Creates synthetic data with 3 natural clusters
- Preprocesses data using Polars operations
- Performs GMM clustering with scikit-learn
- Analyzes results using Polars aggregations
- Saves results to CSV
- All features of the simple example, plus:
- Data visualization with matplotlib
- Performance comparison between Polars and Pandas
- Detailed clustering analysis and metrics
- Model parameter inspection
pl.DataFrame()- Creating DataFrameswith_columns()- Adding computed columnsgroup_by().agg()- Aggregating dataselect()- Selecting columnsto_numpy()- Converting to NumPy arrayswrite_csv()- Saving results
- Automatic cluster detection
- Probability estimates for each prediction
- Multiple covariance types
- Model convergence information
- Cluster centers and parameters
The scripts will generate:
- Console output with clustering analysis
- CSV files with results
- Visualizations (comprehensive example)
- Performance comparisons
You can modify these scripts to:
- Use your own data (replace the
create_sample_dataset()function) - Change the number of clusters (
n_componentsparameter) - Use different features for clustering
- Adjust preprocessing steps
- Change visualization styles
Polars offers several advantages for data preprocessing in ML workflows:
- Speed: Faster than Pandas for many operations
- Memory efficiency: Better memory usage
- Lazy evaluation: Optimized query planning
- Modern API: Clean, consistent syntax
- Type safety: Better type handling
This makes it an excellent choice for data preparation before applying machine learning algorithms like GMM.