Baker Hughes Sampling Algorithm

2D Graph for Dataset1 using Selection 1
2D Graph for Dataset1 using Selection 2
2D Graph for Dataset2 using Selection 1
2D Graph for Dataset2 using Selection 2

Inspiration

In predictive maintenance, ensuring reliable operation of critical machinery like electric motors can significantly reduce downtime and maintenance costs. Condition monitoring systems for motors often generate vast amounts of data, but processing all this data is computationally demanding. We wanted to design a smarter data selection method to train machine learning models efficiently without losing essential information, enabling faster, more accurate tracking of motor health degradation. This inspired us to create a data sampling algorithm that could balance broad coverage of operational scenarios with focused attention on high-use regions, delivering effective predictions within limited resources.

What it does

Our project enables a condition monitoring system to select the most informative data subsets from large datasets. Given an electric motor's frequency, power, and vibration levels, the algorithm identifies and selects 2,500 data points from an initial dataset of 500,000. This subset preserves the input space's overall coverage while prioritizing data points from frequently observed operating conditions. The selected points enable the system’s machine learning model to make accurate predictions about the motor’s health while reducing computational load.

How we built it

We implemented the sampling algorithm in two stages using Gaussian Mixture Model (GMM):

Selection 1 (Uniform Sampling): We divided the input space (frequency and power) into a grid to cover the full range of operational conditions uniformly. This ensured that the algorithm sampled points from a wide variety of operating conditions.

We tried utilizing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to have a better spread as is taking a density based approach and the result were more promising than GMM.

Selection 2 (Density-based Sampling): We performed GMM to identify high-density areas where the motor frequently operates. Additional data points were sampled proportionally from these regions to give the model a more refined understanding of typical operating states.

The combined dataset effectively balances the broad coverage and detailed focus needed for accurate model training.

Challenges we ran into

One of the primary challenges was achieving a balance between the need for broad coverage. We also faced computational challenges in handling large datasets and distributing them from 75-100 clusters runs very slow.

Accomplishments that we're proud of

We’re proud to have developed a robust sampling algorithm that significantly reduces computational costs while preserving predictive accuracy. By carefully selecting data points, we successfully trained a machine learning model that can track motor health with a fraction of the original data. Additionally, our density-based sampling approach effectively captures the motor’s typical operating states, ensuring reliable degradation tracking.

What we learned

We deepened our understanding of clustering, GMM, and the trade-offs involved in sampling large datasets for machine learning. We learned the importance of balancing data representation across different operational ranges and the need for efficient density estimation methods in real-time applications.

What's next for Baker Hughes Sampling Algorithm

Next, we plan to further optimize the algorithm to adapt dynamically to new operating conditions. As more data is collected, the algorithm could automatically adjust its density threshold to prioritize emerging high-density regions