This project demonstrates the development of a machine learning model to classify news articles as either fake or real. The project uses Natural Language Processing (NLP) techniques for text preprocessing and a Logistic Regression model for classification. The dataset consists of labeled news articles, and the model achieves high accuracy in distinguishing between fake and real news.
The project is organized into the following structure:
Fake_News_Detection/ ├── data/ # Contains dataset files (True.csv and Fake.csv) ├── output/ # Saved models and vectorizers after training ├── scripts/ # Python scripts for preprocessing, training, and evaluation └── README.md # Project documentation
- data/: Contains the dataset files (
True.csvfor real news andFake.csvfor fake news). - output/: Stores the trained Logistic Regression model (
fake_news_model.pkl) and TF-IDF vectorizer (tfidf_vectorizer.pkl). - scripts/: Contains Python scripts used for preprocessing, training, and evaluation.
The dataset used in this project consists of two CSV files:
- True.csv: Articles labeled as real news.
- Fake.csv: Articles labeled as fake news.
- Total articles: 44,898
- Real news: ~21,417
- Fake news: ~23,481
- Columns:
title: The title of the news article.text: The body of the news article.subject: The category of the news article (e.g., politics, world news).date: The publication date of the article.
- Average text length: ~1,993 characters per article.
The goal of this project is to:
- Preprocess text data using NLP techniques (e.g., tokenization, stopword removal).
- Convert text data into numerical features using TF-IDF Vectorization.
- Train a Logistic Regression model to classify articles as fake or real.
- Evaluate the model's performance using metrics like accuracy, precision, recall, and F1-score.
- Programming Language: Python
- Libraries:
- Pandas: Data manipulation and analysis.
- SpaCy: Natural Language Processing for text preprocessing.
- Scikit-learn: Machine learning algorithms and evaluation metrics.
- Joblib: Model serialization.
- Time: Execution time measurement.
-
Data Loading:
- Load
True.csvandFake.csvinto Pandas DataFrames. - Combine both datasets into a single DataFrame with labels (
1for real news,0for fake news).
- Load
-
Text Preprocessing:
- Convert text to lowercase.
- Tokenize text into individual words.
- Remove punctuation, numbers, and stopwords (e.g., "the," "is").
- Truncate articles to the first 500 words to reduce processing time.
-
Feature Extraction:
- Use TF-IDF Vectorization to transform cleaned text into numerical features with a maximum of 2,000 features.
-
Model Training:
- Split the dataset into training (80%) and testing (20%) sets.
- Train a Logistic Regression model on the training set.
-
Evaluation:
- Evaluate the model on the test set using accuracy score and classification report.
-
Save Model:
- Save the trained model (
fake_news_model.pkl) and vectorizer (tfidf_vectorizer.pkl) for future use.
- Save the trained model (
The Logistic Regression model achieved the following results on the test set:
| Metric | Value |
|---|---|
| Accuracy | 97.10% |
| Precision | 97% |
| Recall | 97% |
| F1-score | 97% |
precision recall f1-score support
0 0.97 0.97 0.97 469
1 0.97 0.97 0.97 429
accuracy 0.97 898
macro avg 0.97 0.97 0.97 898 weighted avg 0.97 0.97 0.97 898
- Total preprocessing time: ~291 seconds (~4 minutes and 51 seconds).
- The model performs equally well on both classes (fake and real news), achieving an F1-score of 97% for both classes.
- The high accuracy indicates that TF-IDF features combined with Logistic Regression are effective for this classification task.
Follow these steps to run the project locally:
- Install Python (>=3.7) on your system.
- Install required libraries using pip: pip install pandas spacy scikit-learn joblib python -m spacy download en_core_web_sm
-
Clone this repository to your local machine: git clone https://github.com/Jasonpereira0/Projects.git cd Projects/Fake_News_Detection/
-
Place the dataset files (
True.csvandFake.csv) in thedata/directory. -
Run the script to train the model: python scripts/fake_news_detection.py
-
Check the
output/directory for saved models: output/ ├── fake_news_model.pkl # Trained Logistic Regression model └── tfidf_vectorizer.pkl # TF-IDF Vectorizer used for feature extraction -
Use these models to make predictions on new data.
This project can be extended in several ways:
- Implement deep learning models like LSTMs or Transformers (e.g., BERT) for better performance on more complex datasets.
- Include additional features like metadata (e.g., publication date, subject).
- Deploy as a web application using Flask or Streamlit for real-time predictions.
Contributions are welcome! If you have ideas or find issues, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License—see the LICENSE file for details.