- Data cleaning (handling missing values, duplicates).
- Distribution of sentiment classes (positive, neutral, negative).
- Most frequent words in each category.
- Word cloud visualization for different sentiment classes.
- Convert text into a frequency matrix using CountVectorizer.
- Train models:
- Naïve Bayes classifier
- Logistic Regression classifier
- Compare performance using precision, recall, and F1-score.
- Use TF-IDF Vectorization to represent text.
- Train models:
- Support Vector Machine (SVM)
- Random Forest
- Gradient Boosting (XGBoost, LightGBM)
- Compare performance using precision, recall, and F1-score.
- Apply Truncated SVD on TF-IDF matrix.
- Reduce feature dimensionality while preserving sentiment-related patterns.
- Train Logistic Regression, SVM, or Random Forest on the LSA-transformed data.
- Use LDA to discover latent topics in financial news titles.
- Check whether sentiment classes align with discovered topics.
- Use topics as additional features for sentiment classification.
- Convert text to embeddings (Word2Vec, FastText, or GloVe).
- Train a LSTM, GRU, or BiLSTM for sentiment classification.
- Fine-tune a pre-trained financial language model:
- BERT
- FinBERT (Financial Sentiment BERT)
- RoBERTa
- Use these models for sentiment classification.
- Compare model performance across:
- Baseline models (Lexicon, BoW)
- Machine Learning models (SVM, Random Forest)
- LSA, LDA-enhanced models
- Deep Learning & Transformer models
- Use confusion matrices, ROC curves, and SHAP analysis for feature importance.
- Using our dataset for transfer learning the model learns to generate realistic financial headlines conditioned on the sentiment and instruction.
- At inference time, generate headlines with prompts like:
Sentiment: positive Instruction: Generate a financial headline that includes at least one financial entity. Headline: - This guides the model to include companies, indexes, or sectors in its output, addressing issues with empty or generic responses.
- Generate multiple headlines per sentiment (e.g., 10 positive, 10 negative, 10 neutral) for content diversification.
- Useful for data augmentation, report writing, or synthetic dataset creation.