Team AHR
Members
● Anthony Petrotte ([email protected])
● Hrishikesh Deshmukh ([email protected])
● Rahul Jonnalagadda ([email protected])
All project deliverables are located in the deliverables folder.
Link to Software Usage Tutorial Presentation
The goal of our project is firstly, to analyze the financial news cycle in different time intervals to create a time interval sentiment metric on particular global securities. Secondly, we would compile the intermediate sentiment results during the metric calculation into a time series dataset that can be compared to price movement in the underlying security. This dataset would potentially have many uses, including the possibility to aid in identifying securities driven by trades focused around the new cycle. Another interesting use could be to identifying securities that are more prone to volatility from volume in individual (and potentially more naïve) investors trading in a more impulsive/emotional manner due to recently read information.
By collecting a corpus of recent news focused a specific company/security and trimming it down into subsequently smaller relevant sub-documents, we aim to create a time-series sentiment calculation that can be compared to the price of the company/security. Although the collected data may have many potential uses, we have opted to use the data graphically as a time series. If you are familiar with the concept of a stock price indicator/oscillator, the resulting dataset can be used in a similar manner. This would be a useful tool or addition to the task of stock screening, and could be implemented as an addition to a computational trading strategy.
The general methodology to use this code base is to 1.) Build Data and 2.) Visualize the data you have compiled
The data_creation notebook serves as a fully functional example of how the code base is used to pipeline local datasets to _data
We have provided a plotly dash app to interactively graph the datasets you build!
There are two ways to run the dash app in the repo:
python app.py: from the terminal, which will automatically open your browser to the correct local host addressdash_app.ipynb: a jupyter notebook version ofapp.py. When the cell for the dash is run, it will give you a hyper link to the local host address with your dashboard
In addtion, there are plenty of pre-compiled datasets in the _data directory which can already be viewed
This section highlights what is needed to run locally and helpful links to get you started
This project has been implemented with python and uses the following open libraries (listed with their use)
- Data Processing
- numpy
- pandas
- sklearn
- scipy
- re
- os
- math
- Online Data Retrieval
- BeautifulSoup
- requests
- urllib
- Date/time formating
- datetime
- dateutil
- time
- timeit
- DNN for NLP (sentiment analysis)
- torch
- transformers
- Graphing/Utilizing
- plotly
- dash
each of these libraries can be installed via pip or by most typical methods used for installing python libraries
- API to get news data
- API to get ticker information
- Polygon.io
- Free for news (pricey for market data)
- Best option for quickly getting started compiling news
- Very large page size limit (1000)
- Not good for company info, especially outside of the US
- AlphaVantage
- Free
- Easy
- NewsApi (formerly google, now NewsAPI.org)
- Free Developer Version
- Only past 1 month's news
- Daily call limit: 100
- Page size limit: 100
- Currents API
- Free
- Daily call limit: 600
- Page size limit: 200
- Usearch (formerly contextual_web, now Web Search on RapidAPI)
- Need Rapid API account
- Free-mium
- Daily call limit: 100
- Page size limit: 50
- Rate limit: 1 per second
- Yahoo Finance
- Good for quick price info
- Was previously good for getting company info before it stoppped working all of a sudden
The API keys need put into the config.py file under their respective variables. Please See details in the util folder for setting up the config
This section goes into detail about the method used to compile web results and extract sentiments and serves as a how to guide for using this code base
In addtion there are two diagrams of both the entire model as well as a more focused diagram of the document to relevant sub-document process in the img subdirectory
- Pull in api keys from config making a config variable
config=config()- This variable will need to be passed to any class that needs to call an api
- create a
Ticker()object: passing in config - Initiate a
Corpus()object
- Initiate a web_query object
query_all(): Runs through the available apis and makes threaded calls to collect result urlscompile_results(): combines the results of the api responses and deletes duplicate urlsscrape_results(): scrapes the list of urls and compiles the raw website text to be given to the Corpusget_results(): returns the stored dataframe of urls and text
- Use the
set_results()function to store the results from the web_query in the corpus for processing (needed for dataset building) - Use
set_corpus()with the web_query documents and urls
- Initiate the ranking function (in our case, custom built BM25 object class from util.pyRanker)
- Fit the ranker to the corpus documents
- Build the queries used by the ranking function using
build_queries()and passing in the ticker object
- Rank corpus documents for relevance with
rank_docs()passing in the ranker - Prune the documents with
prune_docs(). This is pre-set to only prune 0 ranked documents on the primary query using a standard BM25 score- Note: these documents are not removed, only indexed in the pruned_docs for the sub-division proecess
- Create dictionary of sub-docs from the original documents by calling
sub_divide()and passing in the Transformer's tokenizer(if needed)- Note: The tokenizer is needed to ensure that the length of the subdocs created will not exceed the maximum token size used by the Transformer. The Corpus sets the standard distil-BERT tokenizer on initiation.
- Rank the newly created subdocs using
rank_subdocs()and pass in the ranker - Prune the subdocs with
prune_subdocs(). This is pre-set similarly toprune_docs()and prunes 0 ranked subdocs using the prime query and standard BM25 score- Note: Like the
prune_docs()function, this does not remove any subdocs, but creates an index of the ones to keep from the subdoc dictionary
- Note: Like the
- after sub-dividing and pruning out all the trash, make the relevant set by calling
make_relevant() - Rank the relevant set with
rank_relevant().- This ranking function is pre-set to run a BM25 with Structured Query Expansion. The expanded query and its weights set can be adjusted when initiating
build_queries()
- This ranking function is pre-set to run a BM25 with Structured Query Expansion. The expanded query and its weights set can be adjusted when initiating
- If needed/wanted, you can further prune the relevant set using
prune_relevant()- Note: unlike the other two pruning functions, this directly adjusts the relevant_set and relevant_scores stored by the corpus
get_sentiments(): Once a relevant set is established- Using Transformer's classifier will run each relevant subdoc through the classifier to produce a sentiment score
data_preprocess()will setup the needed dictionaries for creating pandas dataframes- This is also where the scores are calculated, which are simply the sentiment weighted by the relevance
- build the two dataframes with
build_fulldf()andbuild_pricedf()
- initiate a data_manager object from util.data_manager.
- the
data_manager()needs the location of the '_data' directory on initiation
- the
- tell the data_manager to put the data in a retrievable place with
data_manager.store_date()and pass in the ticker symbol and dataframes
get_ticker_list()calls the datamanger to return the available list of stored tickersget_pricedf()will tell the datamanger to return the stored price_df.csv as a pandas dataframe. This should have everything needed to graphget_fulldf()will return the full_df.csv. This is all of the data needed for recalculation, or recompiling the relevant documents as it holds urls
Note: All used source sites are in the references sub-directory In addition, all papers that were foundational in the creation of the code base are in the references sub-directory in pdf format (for those who want a little light reading)