Fin-BERT

DEEP LEARNING PROJECT OUTLINE: Title: Fin-BERT 2.0 Who: Jimmy Phelan, Bilal Ali, Tony Zhao, Nathan Kwei Jgphelan, Bali4, tbzhao, nkwei (Bilal) Introduction: What problem are you trying to solve and why? If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper. We are implementing an existing paper on financial sentiment analysis with pre-trained language models. This paper’s objectives were to attain high accuracy in assessing and classifying financial data despite a lack of labeled data in that domain. So, the paper uses pre-trained language models as they require fewer labeled examples and can be trained on specific information. We chose this paper because we wanted to get more experience with fine-tuning language models, and wanted to solve specialized NLP tasks What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc. This problem includes prediction of sentiment labels based on text, which is a classification problem. (Tony) Related Work: Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. This article addresses a variety of the applications for applying NLP task-solving to the financial world. The following applications are outlined as real-world problems that can be addressed through NLP: Financial text analysis, risk assessment, structuring data, portfolio selection, stock prediction, customer management, accounting and audits. NLP allows for financial institutions to attack problems in novel ways, such as understanding the sentiment of texts to predict how markets may react to news. This actually applies specifically to the paper that we are addressing in our project, since the model attempts to predict how pieces of news affect stock price through sentiment analysis. The paper Financial Sentiment Analysis using FinBERT (Yang, UY, and Huang) is also interesting as it applies a fine-tuned version of BERT on financial corpus in order to achieve far superior accuracy in sentiment analysis to predict financial-related outcomes. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list. Related implementations: GitHub Repository by ProsusAI: The official implementation of FinBERT, including training scripts, pre-trained models, and usage examples: ProsusAI/finBERT Sentiment Analysis on Financial News Headlines: A project utilizing FinBERT for analyzing sentiment in financial news headlines: GitHub Repository FinBERT by Yi Yang et al.: A separate implementation focusing on financial communications, including fine-tuned models for sentiment analysis, ESG classification, and forward-looking statements: yya518/FinBERT Financial News Sentiment Analysis: A project utilizing FinBERT for analyzing sentiment in financial news headlines: Raviraj2000/Financial-News-Sentiment-Analysis-using-FinBERT (Jimmy) Data: What data are you using (if any)? If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it). How big is it? Will you need to do significant preprocessing?

We are using three different datasets directly from the paper: TRC2-Financial, Financial Phrase Bank, and FiQa Sentiment Dataset. However, there are some extra pre-processing steps we will have to take given limited RAM resources on Google Colab. The TRC2-financial Corpus is a filtered subset of Reuters’ TRC2 data from focused specifically on financial content. It contains 46,143 documents, over 29 million words, and nearly 400,000 sentences. We will have to filter content, and definitely sample the data. The Financial PhraseBank is a dataset of 4,845 English sentences drawn from LexisNexis financial news articles. Each sentence is annotated by 16 experts with a sentiment label (positive, neutral, or negative) based on its potential impact on a company’s stock price. The dataset is relatively small with around 3,101 examples typically used for training after splitting out portions for validation and testing, and will require only the standard text cleaning and tokenization using the BERT tokenizer for our purposes. Finally, the FiQA Sentiment Dataset, developed a financial opinion mining and question answering challenge, totals 1,174 financial news headlines and tweets. Each entry is annotated with a continuous sentiment score from [1, -1], as well as information about the target financial entity. Like the PhraseBank, standard text cleaning and tokenization will suffice for preprocessing, but the continuous targets may require slight adjustments in the model setup for either regression or classification tasks.

(Jimmy) Methodology: What is the architecture of your model? How are you training the model? If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here. If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.

We’ll start with BERT. BERT uses a series of stacked Transformer encoders, each with multi-head self-attention and feed-forward layers. For the sentiment analysis we’ll add a dense layer on top of BERT that takes the final state, corresponding to the [CLS] token as its input. This classifying layer will deal with either regression or classification: Regression will predict a continuous sentiment score and classification will deal with mapping the representation to the aforementioned labels. We will switch the loss function from cross entropy to mean squared error for this. Classification will map the representation to the aforementioned labels (positive, negative, neutral) using cross entropy loss. For the training methodology, we will follow the paper's guidelines. Namely: Gradual Unfreezing: We start by fine-tuning only the classifier layer while keeping the BERT layers frozen. Then, in stages, we progressively unfreeze the next lower layer.

Discriminative Fine-Tuning: Different parts of the model are updated at different learning rates. In particular, we will use lower learning rates for the lower layers of BERT and higher learning rates for the top layers. We will use a schedule in which the learning rate first increases linearly up to a peak and then decreases linearly. For the training environment, our implementation in Google Colab will require adjustments. We will need to reduce batch sizes or sequence lengths and keep in mind the memory constraints during preprocessing. The paper originally used an Amazon p2.xlarge EC2 instance with one NVIDIA K80 GPU, 4 vCPUs, and 64 GiB memory. Challenges will include memory constraints and general resource intensive pre-training. Implementing scheduling and gradual unfreezing will be challenging as we haven’t had extensive experience implementing these. The provided author’s codebase should be a solid resource to reference should we run into any snags.

(Bilal) Metrics: What constitutes “success?” What experiments do you plan to run? We plan to run our model against the same datasets (given our computational ability) that the paper used to see if our model can achieve similar results. For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? The notion of accuracy does apply to our project. Our evaluation of success will be similar to the metrics that the paper used. These include: Accuracy: accuracy of our model to assess the sentiment of financial data based on the provided labeled dataset. Cross-entropy loss: Weighted with the square root of the of inverse frequency rate to account for high frequency labels Macro F1 Average: average of precision recall score over each class. Allows for classification performance to be measured despite label imbalance. If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. The authors were hoping to achieve state of the art sentiment scoring on Financial texts with FiQA sentiment scoring and Financial PhraseBank What are your base, target, and stretch goals? Base: Get within 10% accuracy that the model in the paper attained Target:Get within 5% accuracy that the model in the paper attained Stretch: equal accuracy to the paper (due to computational constraints) (Nathan) Ethics: Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.) What broader societal issues are relevant to your chosen problem space?

Why is Deep Learning a good approach to this problem? Generally, deep learning is a great approach to this specific problem because of the specific features unique to financial sentiment analysis. While general sentiment analysis and classification can rely on commonly and culturally accepted notions of what corresponds with what sentiment, finance is different. In finance, the sentimental meaning of words and phrases depends heavily on their financial context, phrasing, general jargon, and domain-specific information. Machine learning approaches like bag of words already have worse performances compared to deep learning approaches. These deficiencies would become even more highlighted in a financial context since there are fewer ‘indicator words’ that directly imply a specific sentiment. As such, deep learning is well-fitted for this task. Additionally, in this problem, we deal with limited-labelled data. Utilizing BERT and pre-training means we do not need huge amounts of labeled financial data. The learned relations from BERT are also maintained. By reusing BERT’s pre-trained deep learning architecture, we can find strong results with limited labeled financial data. Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? Some of the major stakeholders in this problem are investors, traders, financial analysts, banks and other financial institutions, the general public, and news sources. Because the goal of FinBERT itself its to classify general financial texts into a sentiment with respect to a financial entity, the model can have major effects among these stakeholders. FinBERT automates the classification that all stakeholders must already perform. Given that FinBERT has shown accurate results, it could not only save these stakeholders time, but also provide context for decision making. Those decisions go on to actually the financial markets which have an impact on virtually everybody in the world. While FinBERT likely and probably shouldn’t be the factors going into these decision makers hands, they can still have a large and real effect. As a result, mistakes in the FinBERT model can have huge consequences. Traders might make poor investment decisions, automated systems might react incorrectly to news, and analysts could be misled about market sentiment. Based on actions like these, others might follow suit which could lead towards market inefficiencies or increased volatility. (Together) Division of labor: Briefly outline who will be responsible for which part(s) of the project. As of right now this is what we were thinking. It’s definitely possible that we forgot some of the lower-level tasks but we added all the main ones we thought of. Preprocessing Data/Pre-training (bilal and tony) Data collection—will have to cut some of the data as well (tony) Split data into train/validate/test sets (bilal) Tokenize (tony) Normalize labels (bilal) Pretrain model (tony and bilal) Training/Testing/Coding out architecture (nathan and jimmy) Gradual Unfreezing: (jimmy) Discriminative Fine-Tuning: (nathan and jimmy) Add classification heads: (nathan)

Built With

collab
python
tensorflow

Updates

nakwei started this project — Apr 15, 2025 06:22 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.