An automated system designed to monitor FDA press announcements and extract key food safety information using AI-powered analysis.
The FDA Food Safety Monitor is a comprehensive tool that:
- Scrapes FDA press announcements for food safety-related news
- Uses AI/NLP techniques to extract structured information
- Generates detailed reports in CSV and JSON formats
- Provides summary statistics and insights
- Automated Web Scraping: Monitors FDA press announcements within specified date ranges
- Content Verification Logging: Logs raw HTML content from the first article for verification
- Keyword Filtering: Identifies relevant articles using configurable keywords
- AI-Powered Extraction: Extracts structured information including:
- Product names
- Contaminants/recall reasons
- Affected regions
- Company names
- Multiple Output Formats: Generates both CSV and JSON reports
- Summary Statistics: Provides insights on companies, contaminants, and trends
- Organized Output: All files automatically saved to
/reportfolder
09-Indivior/
├── config.py # Configuration settings
├── main.py # Main orchestrator
├── website_scraper.py # Web scraping module
├── ai_processor.py # AI processing module
├── requirements.txt # Dependencies
├── README.md # This file
└── report/ # Output folder for all generated files
├── scraped_articles.json
├── final_report_YYYYMMDD_HHMMSS.csv
└── final_report_YYYYMMDD_HHMMSS.json
-
Clone or download the project files
-
Install dependencies:
pip install -r requirements.txt
-
Verify installation:
python main.py
Edit config.py to customize the monitoring parameters:
# URLs to monitor
URL_LIST = ["https://www.fda.gov/news-events/fda-newsroom/press-announcements"]
# Date range for article search
START_DATE = "2024-01-01"
END_DATE = "2024-12-31"
# Keywords to identify relevant articles
KEYWORDS = ["food", "recall", "outbreak", "salmonella", "listeria", "e. coli", "allergen"]Run the complete automated workflow:
python main.pyWeb Scraping Only:
python website_scraper.py- Generates:
scraped_articles.json
AI Processing Only (requires scraped articles):
python ai_processor.py- Generates:
final_report_YYYYMMDD_HHMMSS.csvandfinal_report_YYYYMMDD_HHMMSS.json
All output files are saved in the /report folder
Contains raw scraped article data:
[
{
"title": "Voluntary Recall of Specific Cheese Brand Due to Listeria Concerns",
"url": "https://www.fda.gov/example-article-1",
"publication_date": "2024-07-15",
"full_text": "The full text of the press release..."
}
]Contains structured extracted information:
- Title
- URL
- Publication Date
- Product Name
- Contaminant/Reason
- Affected Regions
- Company Name
- Processing Date
The system includes content verification logging that:
- Captures the raw HTML content from the first article found during scraping
- Displays the first 500 characters of the HTML for verification
- Stops execution after the first article to prevent console flooding
- Helps verify that the scraping process is working correctly
Example output:
--- Verifying Content for URL: https://www.fda.gov/example-article ---
First 500 characters of raw HTML content:
<!DOCTYPE html>
<html lang="en">
<head>
<title>FDA Recall Notice</title>
</head>
<body>
<div class="field-item">
<p>Company XYZ is voluntarily recalling products due to contamination...</p>
--- End of Content Verification ---
*** Stopping execution after first article for content verification ***
The AI processor uses advanced pattern matching and NLP techniques to extract:
- Identifies specific food products mentioned in recalls
- Handles various naming conventions and formats
- Detects common foodborne pathogens (Listeria, Salmonella, E. coli)
- Identifies allergens and contamination types
- Recognizes foreign material contamination
- Extracts manufacturer and distributor information
- Handles various company naming formats
- Identifies US states and regions affected by recalls
- Parses distribution information
- Configuration: Set date range to monitor last 6 months
- Scraping: System finds 50 food safety articles
- Processing: AI extracts structured data from each article
- Reporting: Generate comprehensive reports showing:
- Top companies with recalls
- Most common contaminants
- Geographic distribution of issues
The system includes robust error handling for:
- Network connectivity issues
- Malformed HTML content
- Missing article data
- File I/O errors
- requests: Web scraping and HTTP requests
- beautifulsoup4: HTML parsing
- pandas: Data processing and analysis
- Standard Python libraries: datetime, json, csv, re, os, sys
-
FDA Website Blocking (404 Error):
- The FDA website may block automated requests
- Solution: Set
TEST_MODE = Trueinconfig.pyto use sample data - Alternative: Try running the script later or from a different network
-
No articles found: Check date range and keywords in config.py
-
Network errors: Verify internet connection and FDA website availability
-
Processing errors: Ensure scraped_articles.json exists and is valid
When the FDA website is blocking requests, you can use test mode:
# In config.py
TEST_MODE = True # Use sample data instead of scrapingThis will generate realistic sample data for testing the AI processing functionality.
For detailed debugging information, run individual components:
python website_scraper.py # Check scraping results
python ai_processor.py # Check processing resultsKEYWORDS = ["food", "recall", "outbreak", "new_keyword"]URL_LIST = [
"https://www.fda.gov/news-events/fda-newsroom/press-announcements",
"https://additional-url.gov/announcements"
]Modify ai_processor.py to add new extraction patterns or improve existing ones.
To contribute to this project:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is intended for educational and research purposes. Please review FDA website terms of use before extensive scraping.
For issues or questions, please check the troubleshooting section or create an issue in the project repository.
Note: This system is designed to monitor publicly available FDA press announcements. Always verify critical information with official FDA sources.