This file documents which scrapers are proven to work and should be used in the multi-agent pipeline, and which ones are historical/failed attempts that should be avoided.
These scrapers are proven successful and are actively used in backend/agents/agent_framework.py:
- Success Rate: 100% (109/109 records)
- Data Source: https://www.directory.gov.au/enquiry-lines
- Method: Unstructured web scraping with pagination
- Output:
data/raw/government_services.csv - Quality: Federal government phone numbers (1800, 1300, 13X numbers)
- Status: ✅ ACTIVE - Primary federal services scraper
- Success Rate: 100% (266/266 records)
- Data Source: NSW Health API via data.gov.au
- Method: Structured API data download
- Output:
data/raw/nsw_hospitals.csv - Quality: Complete NSW hospital dataset with phones, addresses, districts
- Status: ✅ ACTIVE - Best performing scraper
- Success Rate: 100% (13/13 threat indicators)
- Data Source: https://www.scamwatch.gov.au/news-alerts
- Method: Threat intelligence extraction from news articles
- Output:
data/raw/scamwatch_threats.csv - Quality: Verified scam phone numbers and organization impersonation data
- Status: ✅ ACTIVE - Essential for threat detection
- Success Rate: 100% (12/12 organizational records)
- Data Source: https://data.gov.au ACNC CSV bulk dataset
- Method: Organizational verification (names, ABNs, addresses, purposes)
- Output:
data/raw/acnc_charities_picton.csv - Quality: Verified charity organizations for anti-scam verification
- Limitation:
⚠️ Contact details (phone/email/website) blocked by JavaScript protection - Details: See
BUG_REPORT_ACNC.mdfor technical analysis - Status: ✅ ACTIVE - Proven charity data extraction
- Success Rate: 90% (9/10 agencies)
- Data Source: https://www.service.nsw.gov.au/nswgovdirectory/
- Method: Two-stage unstructured scraping (directory → agency pages)
- Output:
data/raw/nsw_correct_directory.csv - Quality: NSW government agencies with phone/email/website/addresses
- Status: ✅ ACTIVE - Uses correct URL pattern that works
These scrapers had issues and should NOT be added to the pipeline:
- Issue: Limited to 10 agencies, incomplete implementation
- Status: ❌ DEPRECATED - Use
nsw_correct_scraper.pyinstead
- Issue: Wrong URL patterns, low success rate
- Status: ❌ DEPRECATED - Use
nsw_correct_scraper.pyinstead
- Issue: Direct website scraping without API foundation
- Status: ❌ DEPRECATED - Use
acnc_data_agent.pyinstead
- Issue: Over-engineered version with reliability issues
- Status: ❌ DEPRECATED - Use
acnc_data_agent.pyinstead
- Issue: Requires existing CSV input, not standalone
- Status: ❌ UTILITY ONLY - Not for pipeline, used by acnc_data_agent
When adding a new scraper to the pipeline, ensure:
-
✅ Update
agent_framework.pycollector_tasks list:collector_tasks = [ ('scraper_name', 'backend/agents/scraper_file.py'), ]
-
✅ Register agent in main() function:
coordinator.register_agent(CollectorAgentProxy("scraper_name", "backend/agents/scraper_file.py"))
-
✅ Update output path to save in
data/raw/:def save_to_csv(self, data, filename='data/raw/output.csv'):
-
✅ Update
data_standardizer.pyto process new data:- Add filepath to relevant standardization method
- Update Path references to use
data/raw/filename.csv
-
✅ Test end-to-end pipeline with
python backend/run_pipeline.py
For reference, proven scrapers achieve:
- Success Rate: 90-100% data extraction
- Data Quality: Grade A (95%+ quality score from Critic Agent)
- Contact Coverage: Phone numbers mandatory, email/website preferred
- Format Compliance: Australian phone number formats, valid emails
- Source Reliability: Official government/charity APIs preferred
Files to avoid in agent_framework.py:
nsw_focused_scraper.pynsw_gov_directory_scraper.pyacnc_charity_scraper.pyacnc_enhanced_agent.py- Any scraper with <90% success rate
- Any scraper without proper
data/raw/output paths
Pipeline Command: python backend/run_pipeline.py
Active Agents: 5 collectors + 2 processors
- ✅ government_services_scraper (109 records)
- ✅ nsw_hospitals_agent (266 records)
- ✅ scamwatch_threat_agent (13 records)
⚠️ acnc_data_agent (12 organizational records, contact details limited)- ✅ nsw_correct_scraper (9 records)
Results: 402 safe contacts, 13 threat indicators, Grade A quality (95%)
Last Updated: 2025-08-30 - Enhanced pipeline with all proven scrapers