CS_410 CourseProject

Team AHR

Members

● Anthony Petrotte ([email protected])

● Hrishikesh Deshmukh ([email protected])

● Rahul Jonnalagadda ([email protected])

Project Deliverables

All project deliverables are located in the deliverables folder.

Link to Software Usage Tutorial Presentation

Description

The goal of our project is firstly, to analyze the financial news cycle in different time intervals to create a time interval sentiment metric on particular global securities. Secondly, we would compile the intermediate sentiment results during the metric calculation into a time series dataset that can be compared to price movement in the underlying security. This dataset would potentially have many uses, including the possibility to aid in identifying securities driven by trades focused around the new cycle. Another interesting use could be to identifying securities that are more prone to volatility from volume in individual (and potentially more naïve) investors trading in a more impulsive/emotional manner due to recently read information.

By collecting a corpus of recent news focused a specific company/security and trimming it down into subsequently smaller relevant sub-documents, we aim to create a time-series sentiment calculation that can be compared to the price of the company/security. Although the collected data may have many potential uses, we have opted to use the data graphically as a time series. If you are familiar with the concept of a stock price indicator/oscillator, the resulting dataset can be used in a similar manner. This would be a useful tool or addition to the task of stock screening, and could be implemented as an addition to a computational trading strategy.

General Use

The general methodology to use this code base is to 1.) Build Data and 2.) Visualize the data you have compiled

Building Data Set

The data_creation notebook serves as a fully functional example of how the code base is used to pipeline local datasets to _data

Visualizing

We have provided a plotly dash app to interactively graph the datasets you build!

There are two ways to run the dash app in the repo:

python app.py: from the terminal, which will automatically open your browser to the correct local host address
dash_app.ipynb: a jupyter notebook version of app.py. When the cell for the dash is run, it will give you a hyper link to the local host address with your dashboard

In addtion, there are plenty of pre-compiled datasets in the _data directory which can already be viewed

Setup

This section highlights what is needed to run locally and helpful links to get you started

Dependencies

This project has been implemented with python and uses the following open libraries (listed with their use)

Data Processing
- numpy
- pandas
- sklearn
- scipy
- re
- os
- math
Online Data Retrieval
- BeautifulSoup
- requests
- urllib
Date/time formating
- datetime
- dateutil
- time
- timeit
DNN for NLP (sentiment analysis)
- torch
- transformers
Graphing/Utilizing
- plotly
- dash

each of these libraries can be installed via pip or by most typical methods used for installing python libraries

Initial Setup

Needed:

API to get news data
API to get ticker information

Recommended APIs to get started

API to get news data

Polygon.io
- Free for news (pricey for market data)
- Best option for quickly getting started compiling news
- Very large page size limit (1000)
- Not good for company info, especially outside of the US

API to get ticker information

AlphaVantage
- Free
- Easy

Supported APIs

NewsApi (formerly google, now NewsAPI.org)
- Free Developer Version
- Only past 1 month's news
- Daily call limit: 100
- Page size limit: 100
Currents API
- Free
- Daily call limit: 600
- Page size limit: 200
Usearch (formerly contextual_web, now Web Search on RapidAPI)
- Need Rapid API account
- Free-mium
- Daily call limit: 100
- Page size limit: 50
- Rate limit: 1 per second
Yahoo Finance
- Good for quick price info
- Was previously good for getting company info before it stoppped working all of a sudden

Important Note on API keys

The API keys need put into the config.py file under their respective variables. Please See details in the util folder for setting up the config

Method

This section goes into detail about the method used to compile web results and extract sentiments and serves as a how to guide for using this code base

In addtion there are two diagrams of both the entire model as well as a more focused diagram of the document to relevant sub-document process in the img subdirectory

Setup

Pull in api keys from config making a config variable config=config()
- This variable will need to be passed to any class that needs to call an api
create a Ticker() object: passing in config
Initiate a Corpus() object

Getting web results and initiating Corpus

Using web_query:

Initiate a web_query object
- query_all(): Runs through the available apis and makes threaded calls to collect result urls
- compile_results(): combines the results of the api responses and deletes duplicate urls
- scrape_results(): scrapes the list of urls and compiles the raw website text to be given to the Corpus
- get_results(): returns the stored dataframe of urls and text

Setting up the Corpus:

Use the set_results() function to store the results from the web_query in the corpus for processing (needed for dataset building)
Use set_corpus() with the web_query documents and urls

Setting up Ranker and Initial Queries

Initiate the ranking function (in our case, custom built BM25 object class from util.pyRanker)
Fit the ranker to the corpus documents
Build the queries used by the ranking function using build_queries() and passing in the ticker object

Initial Ranking

Rank corpus documents for relevance with rank_docs() passing in the ranker
Prune the documents with prune_docs(). This is pre-set to only prune 0 ranked documents on the primary query using a standard BM25 score
- Note: these documents are not removed, only indexed in the pruned_docs for the sub-division proecess

Sub-Dividing

Create dictionary of sub-docs from the original documents by calling sub_divide() and passing in the Transformer's tokenizer(if needed)
- Note: The tokenizer is needed to ensure that the length of the subdocs created will not exceed the maximum token size used by the Transformer. The Corpus sets the standard distil-BERT tokenizer on initiation.
Rank the newly created subdocs using rank_subdocs() and pass in the ranker
Prune the subdocs with prune_subdocs(). This is pre-set similarly to prune_docs() and prunes 0 ranked subdocs using the prime query and standard BM25 score
- Note: Like the prune_docs() function, this does not remove any subdocs, but creates an index of the ones to keep from the subdoc dictionary

Relevant Set

after sub-dividing and pruning out all the trash, make the relevant set by calling make_relevant()
Rank the relevant set with rank_relevant().
- This ranking function is pre-set to run a BM25 with Structured Query Expansion. The expanded query and its weights set can be adjusted when initiating build_queries()
If needed/wanted, you can further prune the relevant set using prune_relevant()
- Note: unlike the other two pruning functions, this directly adjusts the relevant_set and relevant_scores stored by the corpus

Sentiments

get_sentiments(): Once a relevant set is established
- Using Transformer's classifier will run each relevant subdoc through the classifier to produce a sentiment score

Dataset Initiation

data_preprocess() will setup the needed dictionaries for creating pandas dataframes
- This is also where the scores are calculated, which are simply the sentiment weighted by the relevance
build the two dataframes with build_fulldf() and build_pricedf()

Dataset for graphing

initiate a data_manager object from util.data_manager.
- the data_manager() needs the location of the '_data' directory on initiation
tell the data_manager to put the data in a retrievable place with data_manager.store_date() and pass in the ticker symbol and dataframes

Graphing

get_ticker_list() calls the datamanger to return the available list of stored tickers
get_pricedf() will tell the datamanger to return the stored price_df.csv as a pandas dataframe. This should have everything needed to graph
get_fulldf() will return the full_df.csv. This is all of the data needed for recalculation, or recompiling the relevant documents as it holds urls

Note: All used source sites are in the references sub-directory In addition, all papers that were foundational in the creation of the code base are in the references sub-directory in pdf format (for those who want a little light reading)

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
_data		_data
deliverables		deliverables
img		img
references		references
training		training
util		util
.gitignore		.gitignore
README.md		README.md
app.py		app.py
dash_app.ipynb		dash_app.ipynb
data_creation.ipynb		data_creation.ipynb
re_analyzing.ipynb		re_analyzing.ipynb

Folders and files

Latest commit

History

Repository files navigation

CS_410 CourseProject

Project Deliverables

Description

General Use

Building Data Set

Visualizing

Setup

Dependencies

Initial Setup

Needed:

Recommended APIs to get started

API to get news data

API to get ticker information

Supported APIs

Important Note on API keys

Method

Setup

Getting web results and initiating Corpus

Using web_query:

Setting up the Corpus:

Setting up Ranker and Initial Queries

Initial Ranking

Sub-Dividing

Relevant Set

Sentiments

Dataset Initiation

Dataset for graphing

Graphing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages