Inspiration 💵

In the world of economics and finance 💰, there are virtually unlimited ways to interpret a dataset depending on how the data is processed and presented. This is even more true when you have two or more datasets that could come from seemingly unrelated businesses, markets, or subdisciplines within the umbrella term of finance. On top of all that, sometimes data analysis require technical expertise in a certain software or programming language. To keep everything simple, we wanted to build a web application 🌐 to compare multiple financial datasets pairwise coupled with computationally lenient machine learning algorithms 🖥️ for predictive analytics.

What it does 💵

Poly-Finance is a web application that compares any two financial datasets of equal length as long as they share at least one feature column of unique values (think index). Users without extensive technical experience with computers, including finance professionals, can easily leverage techniques from machine learning to pinpoint the set of features that best explain a chosen feature, any of which can come from either dataset. The main functionalities in logical order are as follows:

  • 🎯 Compare and contrast 2 datasets
  • 🎯 Select custom features for visualization
  • 🎯 Generate new features from current features
  • 🎯 Compute descriptive statistics
  • 🎯 Perform PCA analysis
  • 🎯 Run machine learning algorithms to evaluate data

Ultimately, running machine learning algorithms for predictions will gauge how well blending features from both datasets, including optional artificially generated features, relate to the feature being predicted.

How we built it 💵

The main framework behind the web application is streamlit 👑. This has been the longest and most intensive data science application ever faced. Although most of the data manipulation was handled by the standard suite of data science libraries 📚 in Python e.g. Numpy, Pandas, and Scikit-Learn, the use of Apache Spark 💥 specifically through the PySpark (Python API) 🐍 greatly facilitated some of CRUD operations. Additionally, interactive visualization of plots, whether it was the pairwise feature scatter plots or the PCA area plot, was accomplished thanks to the Plotly 📊 library.

Challenges we ran into 💵

HUGE time and effort was spent on debugging code, re-aligning application features, and configuring environment variables. Furthermore, the lack of experience with Apache Spark also slowed us down.

Accomplishments that we're proud of 💵

Having a working data science application with potentially practical use in industry, at least compared to past ones made for fun and experimentation, marks a significant milestone on its own.

What we learned 💵

Quite a few skills all relevant to data science have been covered. The major points are below:

  • 🎯 Apache Spark through Python API
  • 🎯 Polynomial feature generation for machine learning
  • 🎯 Review of important SQL commands

What's next for Poly Finance 💵

Hopefully, a feature allowing customizable data cleaning, transforming, and other preprocessing steps from the user would be implemented. Additional visualization plots may be added as the application becomes more complex. For example, time series and network drawings may be applicable for certain datasets indexed by time or geospatial location, respectively.

Built With

  • apache-spark
  • pyspark
  • python
  • streamlit
Share this project:

Updates